Automated Similarity Judgment Program

The Automated Similarity Judgment Program ( ASJP , dt. Automated similarity judging program ) is a collaborative project involving the use of computerized techniques in comparative linguistics pursued as an approach. The ASJP is based on a freely accessible word list database ( Open Access ) and consists of vocabulary lists of 40 basic words. More than half of the world's languages are already covered; the database is continuously updated. In addition to isolated languages and languages that have already been proven to belong to certain language families , the database also contains pidgin languages , creole languages , mixed languages , and constructed languages . The contents of the database are stored in their own, simplified and standardized notation ( ASJPcode ). The database has already been used to determine points in time at which language families differentiated themselves. The method used is related to glottochronology but differs in some respects. Furthermore, u. a. Conducted investigations with the ASJP to determine original home , examined onomatopoeia, and compared various phylogenetic methods.

history

Original goals

The ASJP was originally developed to objectively determine the similarity of words with the same meaning in different languages and to create automated classifications of languages based on observed lexical similarities. In the first ASJP publication, two semantically identical words in compared languages were considered similar if they shared at least two identical sounds . The similarity between languages was calculated based on the proportion of words in the entire list that were considered similar. This method was tested with a list of 100 words in 250 languages from various language families (e.g. Austro-Asian , Indo-European , Maya , and Muskogee ).

The ASJP consortium

The ASJP consortium was founded around 2008. The aim was to bring together around 25 professional linguists as well as other interested parties who work as volunteer transcribers and / or as supporters of the project in other forms. The driving force behind the creation of the consortium was Cecil H. Brown. Søren Wichmann is the project curator in day-to-day business. A third, central member of the consortium is Eric W. Holman, who developed most of the software.

Shorter word lists

The word lists originally used were based on the Swadesh list with 100 entries. However, it could be shown that a subset of 40 words from this list gives just as good (if not slightly better) results for language classifications. Since then, only 40 words have been added to the list in the various languages.

Levenshtein distance

In its publications, the ASJP has been using a system for assessing similarity based on Levenshtein distances ( LD ) since 2008 . Levenshtein distances are defined as the minimum number of insert, delete and replace operations that are necessary to convert a word as a character string into another. Differences in word length are corrected by dividing the LD by the number of characters in the longest of the words being compared. This results in the normalized LD (English: Levenshtein Distance Normalized LDN ). A divided LDN (English: Levenshtein Distance Normalized Divided, LDND ) between two languages is the division of the average LDN of all word pairs with the same meaning by the average LDN of all word pairs with different meanings. This second normalization is used to clean up the results from random hits.

Word list

The ASJP uses the following words for its word lists. The ASJP list is similar to Sergei Je's abridged Swadesh list . Jachontow , however, contains some differences.

parts of the body

eye
ear
nose
tongue
tooth
hand
knee
blood
bone
Breast (of the woman)
liver
skin

animals and plants

louse
dog
fish
Horn (of animals)
tree
leaf

People

human
Surname

nature

Sun
star
water
Fire
stone
path
mountain
night

Verbs and Adjectives

Drink
To die
See
Listen
Come
New
Full

Ordinal numbers and pronouns

one
Two
I
You
We

Phoneme list

ASJP-DB in the 2016 version uses the following symbols to encode phonemes: pbfvmw 8 tdszcnrl SZC j T 5 ykgx N q X h 7 L 4 G! ie E 3 auo

Web links

ASJP Database Official Website

Individual evidence

↑ Wichmann, Søren, André Müller, Annkathrin Wett, Viveka Velupillai, Julia Bischoffberger, Cecil H. Brown, Eric W. Holman, Sebastian Sauppe, Zarina Molochieva, Pamela Brown, Harald Hammarström, Oleg Belyaev, Johann-Mattis List, Dik Bakker, Dmitry Egorov, Matthias Urban, Robert Mailhammer, Agustina Carrizo, Matthew S. Dryer, Evgenia Korovina, David Beck, Helen Geyer, Patience Epps, Anthony Grant, and Pilar Valenzuela. 2013. The ASJP Database (version 16). http://asjp.clld.org/
^ ^A ^b Brown, Cecil H., Eric W. Holman, Søren Wichmann, and Viveka Velupillai. 2008. Automated classification of the world's languages: A description of the method and preliminary results. STUF - Language Typology and Universals 61.4: 285-308.
↑ Holman, Eric W., Cecil H. Brown, Søren Wichmann, André Müller, Viveka Velupillai, Harald Hammarström, Sebastian Sauppe, Hagen Jung, Dik Bakker, Pamela Brown, Oleg Belyaev, Matthias Urban, Robert Mailhammer, Johann-Mattis List, and Dmitry Egorov. 2011. Automated dating of the world's language families based on lexical similarity. Current Anthropology 52.6: 841-875.
^ Wichmann, Søren, André Müller, and Viveka Velupillai. 2010. Homelands of the world's language families: A quantitative approach. Diachronica 27.2: 247-276.
^ Wichmann, Søren, Holman, Eric W., and Cecil H. Brown. 2010. Sound symbolism in basic vocabulary. Entropy 12.4: 844-858.
^ Pompei, Simone, Vittorio Loreto, and Francesca Tria. 2011. On the accuracy of language trees. PLoS ONE 6: e20109.
↑ Holman, Eric W., Søren Wichmann, Cecil H. Brown, Viveka Velupillai, André Müller, and Dik Bakker. 2008. Explorations in automated language classification. Folia Linguistica 42.2: 331-354.
^ Wichmann, Søren, Eric W. Holman, Dik Bakker, and Cecil H. Brown. 2010. Evaluating linguistic distance measures. Physica A 389: 3632-3639 ( doi : 10.1016 / j.physa.2010.05.011 ).
↑ http://asjp.clld.org/static/Guidelines.pdf

[1] Wichmann, Søren, André Müller, Annkathrin Wett, Viveka Velupillai, Julia Bischoffberger, Cecil H. Brown, Eric W. Holman, Sebastian Sauppe, Zarina Molochieva, Pamela Brown, Harald Hammarström, Oleg Belyaev, Johann-Mattis List, Dik Bakker, Dmitry Egorov, Matthias Urban, Robert Mailhammer, Agustina Carrizo, Matthew S. Dryer, Evgenia Korovina, David Beck, Helen Geyer, Patience Epps, Anthony Grant, and Pilar Valenzuela. 2013. The ASJP Database (version 16). http://asjp.clld.org/

[BrownCecil-2] A ^b Brown, Cecil H., Eric W. Holman, Søren Wichmann, and Viveka Velupillai. 2008. Automated classification of the world's languages: A description of the method and preliminary results. STUF - Language Typology and Universals 61.4: 285-308.

[3] Holman, Eric W., Cecil H. Brown, Søren Wichmann, André Müller, Viveka Velupillai, Harald Hammarström, Sebastian Sauppe, Hagen Jung, Dik Bakker, Pamela Brown, Oleg Belyaev, Matthias Urban, Robert Mailhammer, Johann-Mattis List, and Dmitry Egorov. 2011. Automated dating of the world's language families based on lexical similarity. Current Anthropology 52.6: 841-875.

[4] Wichmann, Søren, André Müller, and Viveka Velupillai. 2010. Homelands of the world's language families: A quantitative approach. Diachronica 27.2: 247-276.

[5] Wichmann, Søren, Holman, Eric W., and Cecil H. Brown. 2010. Sound symbolism in basic vocabulary. Entropy 12.4: 844-858.

[6] Pompei, Simone, Vittorio Loreto, and Francesca Tria. 2011. On the accuracy of language trees. PLoS ONE 6: e20109.

[7] Holman, Eric W., Søren Wichmann, Cecil H. Brown, Viveka Velupillai, André Müller, and Dik Bakker. 2008. Explorations in automated language classification. Folia Linguistica 42.2: 331-354.

[8] Wichmann, Søren, Eric W. Holman, Dik Bakker, and Cecil H. Brown. 2010. Evaluating linguistic distance measures. Physica A 389: 3632-3639 ( doi : 10.1016 / j.physa.2010.05.011 ).

[9] ttp://asjp.clld.org/static/Guidelines.pdf