ISO 639

from Wikipedia, the free encyclopedia

The ISO 639 is an international standard of the International Organization for Standardization (ISO), the identifiers for names of languages ( language code, language codes, language codes , english language codes defined). The standard consists of six sub-standards: five of them contain identifiers with two letters ( ISO 639-1 ) , three letters ( ISO 639-2 , ISO 639-3 and ISO 639-5 ) and four letters ( ISO 639-6 ); one part contains guidelines for application ( ISO 639-4 ) .

application

The identifiers defined in the standard are used, among other things, in lexicography , linguistics , in libraries, information services and in data exchange. They are used to clearly indicate languages ​​and their identification in documents. They were not introduced as abbreviations because, among other things, there is not always a similarity with the designated language.

The code is defined in lower case. This clearly differentiates between the language code (lower case) and the country codes according to the ISO 3166 standard (upper case).

The language codes in this standard include natural languages and planned languages , but not languages ​​created for machine processing such as: B. Programming languages .

Partial norms

The officially introduced sub-standards are:

  • ISO 639-1: 2002 - Codes for the representation of names of languages ​​- Part 1: Alpha-2 code
  • ISO 639-2: 1998 - Codes for the representation of names of languages ​​- Part 2: Alpha-3 code
  • ISO 639-3: 2007 - Codes for the representation of names of languages ​​- Part 3: Alpha-3 code for comprehensive coverage of languages
  • ISO 639-4: 2010 - Codes for the representation of names of languages ​​- Part 4: Implementation guidelines and general principles for language coding
  • ISO 639-5: 2008 - Codes for the representation of names of languages ​​- Part 5: Alpha-3 code for language families and groups
  • ISO 639-6: 2009 - Codes for the representation of names of languages ​​- Part 6: Alpha-4 representation for comprehensive coverage of language variation
Schematic overview of the sub-standards ISO-639
ISO 639-1 ISO 639-2 ISO 639-3 ISO 639-5
Entries > 200 > 500 > 6900
Possible combinations 676 17,576 17,576 17,576
Individual languages Individual languages ​​and language groups with a strong common affiliation Individual languages ​​(also macro languages )
Collective groups *) Collective groups for language families or other languages ​​of a family Collective groups for language families
*)With Bihari (bh), ISO 639-1 includes a collective language code for a language group.

ISO 639-1

Part 1 of the standard was created for use in terminology , lexicography and linguistics . Until it was officially adopted in 2002, it was operated under the name ISO 639. Precursors are the Requests for Comments (RFCs) RFC 1766 (March 1995) and RFC 3066 (January 2001). ISO 639-1 is not only intended to cover the languages ​​most widely used in literature, but also to include the most “developed” languages ​​with a “specialized” vocabulary. Not only individual languages but also language families are included. Each language is represented by a two-letter identifier ( alpha-2 code ). For example stands defor the German language or frfor the French language . Altogether, different identifiers are possible by using the 26 Latin letters , of which 209 are occupied (as of March 2014). The standard is administered by the International Information Center for Terminology (Infoterm) founded by UNESCO .

The inclusion of further language codes is planned, but only for identifiers that are added to the ISO 639-2 standard at the same time. Two-letter identifiers are no longer assigned to existing entries in ISO 639-2. This is to ensure compatibility.

ISO 639-2

The later standard ISO 639-2 expands ISO 639-1 by a larger number of languages . Every language code defined in ISO 639-1 can also be found in this standard with a three-letter code ( Alpha-3 code ).

For the second ISO 639 standard, the identifier was expanded to three letters so that theoretically language codes are possible. So far, 506 (as of March 2014) identifiers for individual languages and language families have been included (including the languages ​​according to ISO 639-1). The aim of the standard is to use “terminology and bibliography” in order to meet the needs of libraries and to enable the broadest possible labeling of works from around the world. Languages ​​were included for which a suitable amount of literature had been published. Since the focus is on the written language, a distinction was not made between languages ​​which, although they have great similarities in their written form, differ in their spoken form. For example, there is no distinction between the Chinese languages ​​such as standard Chinese and Cantonese .

The US Library of Congress takes care of this sub-standard and publishes the current code list.

The ISO 639-2 standard extends ISO 639-1 and includes all language codes there. The two-letter identifiers are continued with three letters in this standard, whereby for the most part only one additional letter is added and a similarity is guaranteed (see below for the special case of the identifiers ISO 639-2 / B). The basis for the language codes of this standard was the MARC Code List for Languages , which has been in use since 1968 and also administered by the Library of Congress.

Historical languages ​​such as Middle High German ( gmhfor German, Middle High ) or Old High German ( gohfor German, Old High ) are among the IDs that have been added .

Collective language codes

A special feature collective language codes ( English collective language codes ), which are not provided in the standard ISO 639-1. They enable groups of languages ​​to be identified for which an assignment of identifiers to the individual languages ​​is not provided. This can be done for small languages ​​for which there is only a small number of literary works or for which no significant increase is expected. On the one hand, they combine language families such as the Iroquois languages under the identifier iroor they provide a collective name for all other individual languages ​​of a family, in which individual associated languages ​​have their own entry. This is the case with the Sami language family (identifier smifor others), for which the associated North Sami language already has its own identifier ( sme). In the table of language codes, the identifier languages (German "Sprachen") is usually appended to the name for the former groups , and the identifier (other) (German "others") for the latter , in order to identify collective language codes. If a language code is available for a single language, this should be preferred and no collective code should be assigned. This can also affect language codes that are newly included in the standard.

The standard does not provide a description for the assignment of individual languages ​​(without their own entry) to one of the collective language codes offered by ISO 639-2. The Library of Congress , however, refers to the above-mentioned list of the MARC Code List for Languages that can fulfill this function.

Terminological and bibliographical language codes (T / B)

Another difference to ISO 639-1 and the other sub-standards is the use of terminology (English terminology code ) and bibliographic codes (English bibliographic code ), which are referred to as ISO 639-2 / T and ISO 639-2 / B. This distinction is made for 22 entries and is largely due to the fact that before the standard was used, there were already conventions in the library system for three-letter identifiers, which differed greatly from the naming of the standard ISO 639-1 for two letters. The German language is one of these cases, their B code is spacious , the T-Code eng .

As a continuation of ISO 639-1 was aimed for in the naming, it was decided to introduce two codes in cases of different identifiers. The terminological identifier continues the designation according to ISO 639-1, while the bibliographic identifier is used for reasons of compatibility and reflects the previous, extensive designation. The standard allows the mixture of T and B codes not and called for a definition of the kind used before the data exchange by the parties concerned of.

Changes

It is possible to add and change language codes as well as to change their description, while ensuring stability in the standard described. Language codes according to ISO 639-2 / B, which are only intended to ensure compatibility, are excluded from changes. A code abandoned after changes should be reused after five years at the earliest.

ISO 639-3

The ISO 639-3 standard was published on February 5, 2007 and, based on the first two sub-standards, is intended to provide comprehensive coverage of all languages ​​in the world. The three-letter identifiers from the previous ISO 639-2 standard are continued and thus ISO 639-3 can theoretically also have 17,576 different identifiers (practically limited, among other things, by the fact that ISO 639-5 also includes alpha-3 codes that disjoint (foreign to the element) to those from ISO 639-3). All known languages ​​are recorded, including all living, extinct, historical and constructed languages. More than 6,900 languages ​​have been included in the standard so far. The complete list is intended primarily for use in information technology , where a complete listing of all languages ​​is desirable. This also includes entries for the Swiss-German dialects ( gsw , German SWiss), Kölsch ( ksh ) and the Bavarian dialects ( bar ).

It is managed by the organization SIL International , which uses the Ethnologue to record living languages ​​(with exceptions) and language codes. In the 15th edition of the Ethnologue, the codes previously assigned by SIL were adapted to those of ISO 639-2 in order to enable conformity. Other historical and artificial languages ​​are from Linguist List .

With the exception of bibliographical identifiers (ISO 639-2 / B), all identifiers for individual languages ​​of ISO 639-2 can be found in this standard. Collective language IDs are not used. The three-letter codes are kept unambiguous throughout the standard, so that the identifiers of bibliographic and collective identifiers in ISO 639-3 cannot be reassigned.

Macro languages

An extension is the use of so-called macro languages ( English macrolanguage , as an umbrella language , not to be confused with macro families ). Several individual languages ​​are subsumed in one entry, such as B. The Chinese languages in the entry zho, which contains the individual languages Standard Chinese , Hakka , Min Nan and Wu . The more than 50 macro languages ​​are formally listed in the ISO 639-1 (if recorded) and -2 standards as individual languages.

In contrast to languages ​​that are represented by collective language codes, macro languages ​​are intended to summarize individual languages ​​if, from certain points of view, it appears necessary to consider these languages ​​as a single one. The registration authority gives examples of this:

  • there is a single highly developed language used by speakers of related languages ​​under the impression of a common identity ( Arabic language ),
  • there is a common written form (Chinese languages ​​with Chinese script ) or
  • different groups develop separately, so that a clear identification is necessary, but a common identity still exists ( Croatian language , Serbian language , Bosnian language ).

Macro-languages ​​can, as a concept, bring together the various approaches of the sub-norms -2 and -3. A single entry from ISO 639-2, which subsumes several entries from ISO 639-3, is inserted into the structure of the third sub-standard. Every macro language code has an equivalent in ISO 639-2 with the exception of the Serbo-Croatian language (as of August 2007), which originally had an entry in ISO 639-1 that is now obsolete.

Some individual languages ​​that are combined in macro languages ​​also have their own entries in the ISO 639-1 or -2 standards. The Norwegian language functions noras a macro language with the code , but the languages Bokmål ( nb, nob) and Nynorsk ( nn, nno) also have corresponding entries in the other standards.

When summarizing in macro languages, as with the Malay language , name conflicts can arise. While the code mlydenotes the individual language, msaMalay stands for the entry as a macro language . In order to avoid confusion, the names of these entries are given a qualifying addition in the list of identifiers.

ISO 639-4

An explanation of the application of the standards from ISO 639 can be found in the ISO 639-4 standard. This standard itself does not define any language codes. The publication took place in July 2010.

ISO 639-5

ISO 639-5, which was published on May 15, 2008, offers an extension of the collective identifiers from ISO 639-2. The existing identifiers from ISO 639-2 were included. This part of the standard does not share any language codes with ISO 639-3; the quantities of identifiers used are mutually exclusive.

This sub-standard offers a hierarchy of language families and allows the codes from sub-standards 1–3 to be structured. This enables a different gradation in the generalization for marking speech data.

ISO 639-6

The ISO 639-6 standard, published on November 17, 2009, defines four-letter codes ( alpha-4 ) and offers an extension of the language codes from Parts 1–3. It was withdrawn on November 25, 2014.

Integration and relationships of the individual norms

The language codes defined in the various sub-standards interact and allow labeling with different granularity. This integration will only be completed with the publication of the ISO 639-4 and ISO 639-6 standards.

Representation of the integration of the individual language codes from the various sub-standards using the example of Manx . The alpha-4 codes are from the draft ISO 639-6 and are subject to change prior to publication.

The standards of the ISO 639 series are related to one another in different ways. ISO 639-3 defines the set of all individual languages ​​(supplemented by the macro codes), while Part 5 defines a hierarchy of language families. These clearly delimited sets can be found in part in the two older sub-norms -1 and -2 and their elements are juxtaposed there in an unstructured manner. ISO 639-1 is a subset of Part 2, since there are stronger criteria for inclusion than two-letter codes.

Representation of the quantity relationships defined by the sub-standards.

administration

The management of the identification lists is carried out by selected registration authorities, whose task is to accept and review requests for the inclusion of new identifications and changes to existing entries. The registry for ISO 693-1 is Infoterm , for ISO 639-2 the Library of Congress and ISO 639-3 is administered by SIL International .

The naming of the identifiers should, if possible, follow the national language name of the coded language. Exceptions may be made if countries in which the language concerned is spoken wish to use a different name.

Special identifiers

The two standards ISO 639-2 and ISO 639-3 have special identifiers to enable flexible handling of the identification of texts, including mis(from English missing code for "missing code") for languages ​​to which no code has yet been assigned .

The IDs from qaato qtz(including the alphabetic IDs in between) are registered for local use and are not assigned by the registration authority.

The identifier was zxxonly introduced later for marking documents without linguistic content . It can be used for marking documents that do not contain any text, e.g. B. Sheet music or photos.

With mul(from English multiple languages for "several languages"), which is intended for the designation of several languages, if an identification by all individual identifiers is not attached, as well as und(from English undetermined for "unknown") for an unidentifiable language two special identifiers.

Designation of the language according to RFC 5646

A combination of the language codes of the ISO 639 standard with other standards for identifying languages ​​and scripts is given by Request for Comments 5646 ( RFC 5646 ). There the interaction of language codes (ISO 639), geographic codes ( ISO 3166-1 ) and script codes ( ISO 15924 ) is described.

The ISO 3166-1 standard identifies geographical entities and can thus be used to designate languages ​​and dialects of a specific region. Like ISO 639-1, ISO 3166-1 also uses two-letter abbreviations. It is recommended there to display geographic codes in capital letters. Language and region codes overlap, so designated deaccording to ISO 639-1 the German language and DEaccording to ISO 3166-1 the country Germany, frthe French language and FRanalogously the territory of the state France . However, the same codes in the various standards can also mark different terms, such as BEfor Belgium and befor the Belarusian language (“Belarusian”), EUfor the European Union and, euon the other hand, for the Basque language (“Euskara”). In practice, however, these overlaps do not play a role, as the language code always comes first, before the hyphen.

Writing systems can be identified with ISO 15924. Typically they are represented with a four-letter code, the first letter of which is usually capitalized. So stand Cyrlfor the script according to the Cyrillic alphabet and Latnfor the script according to the Latin alphabet .

An example of a code according to RFC 5646 is fr-Latn-CAfor French according to the Latin alphabet as it is written in Canada .

RFC 5646 requires that no distinction be made between upper and lower case. So is z. B. fr-Latn-CAidentical to fr-latn-ca. At the same time, it must be displayed in upper and lower case for people, whereas this must be ignored for internal processing.

Examples of the language codes according to ISO 639

This table shows (sorted by language code) the various language entries and shows the relationships between the sub-standards of ISO 639. Living, historical and artificial languages ​​are listed. Some identifiers do not exist in the other standards, or they exist in a different form.

language ISO 639-1 ISO 639-2 (B / T) ISO 639-3 Kind of example
Old Church Slavonic cu chu chu historical language, sacred language
German de ger / deu eng B and T identifiers for ISO 639-2
Esperanto eo epo epo constructed language (planned language )
Ancient Greek - grc grc historical language, sacred language , scientific terminology (especially medicine and humanities )
Upper Sorbian - hsb hsb Minority language
Iroquois languages - iro - collective identifier for language family
Japanese language Yes jpn jpn Alpha-2 and Alpha-3 identifiers do not share two letters
Latin la lat lat historical language, sacred language , scientific terminology (especially medicine )
Latgalian lv lav lav falls under the Latvian language without its own entry
Ladakhi language - sit lbj Language without its own language code for ISO 639-2, there under other Sino-Tibetan languages
Sanskrit sa san san historical language, still in use as a second language
North Sami language se sme sme Language with its own language code, despite the existence of an associated collective identifier
other Sami languages - smi - Language family with collective identifier, only for languages ​​without a separate entry
Klingon - tlh tlh constructed language invented for the entertainment industry
chinese languages zh chi / zho zho Entry for language family with the same written language but without mutual intelligibility in the spoken language; in ISO 639-3 macro language

Further precursors and related standards

  • In the German-speaking world, the DIN 2335 standard, adopted in 1986, was used in the past.
  • ISO 15924 ( Script Codes ) for the identification of writing systems
  • The Library of Congress also maintains the MARC Code List for Languages .
  • The National Information Standards Organization introduces ANSI / NISO Z39.53 ( Codes for the Representation of Languages ​​for Information Interchange ), a standard for language identifiers that is also administered by the Library of Congress .

See also

Web links

Lookup lists:

Others:

Individual evidence

  1. a b c Frequently Asked Questions (FAQ) - Codes for the representation of names of languages ​​(Library of Congress). In: ISO 639-2 Registration Authority. Library of Congress , accessed October 24, 2006 .
  2. a b International Organization for Standardization (ISO) (Ed.): Codes for the representation of names of languages ​​- Part 3: Alpha-3 code for comprehensive coverage of languages . 1st edition. February 1, 2007.
  3. a b c Registration Authority at the Library of Congress: Codes for the Representation of Names of Languages ​​• Codes arranged alphabetically by alpha-3 / ISO 639-2 Code
  4. ISO 639 - Language Codes. In: infoterm.info. Retrieved February 28, 2015 .
  5. H. Alvestrand: RFC 3066 - Tags for the Identification of Languages. January 2001
  6. Working principles for ISO 639 maintenance. In: ISO 639-2 Registration Authority. Library of Congress, June 2, 2006, accessed August 5, 2007 .
  7. a b MARC Code List for Languages. In: MARC. Library of Congress, December 17, 2007, accessed December 31, 2007 .
  8. ISO 639 codes arranged alphabetically by alpha-3 code: downloadable text files . In: ISO 639-2 Registration Authority. Library of Congress, October 29, 2007; accessed November 8, 2007 .
  9. ISO 639-3: 2007. In: ISO Standards. International Organization for Standardization (ISO), accessed on August 6, 2007 .
  10. a b Relationship between ISO 639-3 and the other parts of ISO 639. In: ISO 639-3. SIL International , accessed March 28, 2007 .
  11. ^ German-Swiss dialect map , GIS wiki of the University of Applied Sciences Rapperswil HSR.
  12. COPTIC: an extinct language of Egypt. In: Ethnologue 14. SIL International, accessed on August 5, 2007 .
  13. ISO 639-3 Macrolanguage Mappings. In: ISO 639-3. SIL International, accessed March 28, 2007 .
  14. Scope of denotation for language identifiers - Macrolanguages. In: ISO 639-3. SIL International, accessed March 28, 2007 .
  15. ^ John Cowan, Don Osborn: Wikimedia language codes . Email exchange between John Cowan and Don Osborn on the mailing list ietf-languages, September 13, 2006
  16. ^ A b John Cowan, Peter Constable: What's the plan for ISO 639-3 and RFC 3066 ter? . Email exchange between John Cowan and Peter Constable on the mailing list ietf-languages, August 20, 2004
  17. ISO / DIS 639-4. International Organization for Standardization (ISO), accessed December 5, 2010 .
  18. International Organization for Standardization: ISO 639-6: 2009 - Codes for the representation of names of languages ​​- Part 6: Alpha-4 code for comprehensive coverage of language variants. December 2009, accessed March 28, 2018 .
  19. ^ Lee Gillam, Debbie Garside, Chris Cox: Developments in Language Codes standards . In Rehm, Witt, Lemnitzer (Ed.): Data structures for linguistic resources and their applications / Data Structures for Linguistic Resources and Applications. Proc. of GLDV 2007, 11-13 April 2007. Gunter Narr Verlag, Tübingen. ISBN 978-3-8233-6314-9 .
  20. a b see e.g. B. the change notice to ISO 639-2: Change Notice. In: ISO 639-2 Registration Authority. Library of Congress, September 29, 2006, accessed October 26, 2006 .
  21. Update of the language codes according to ISO 639-2. (No longer available online.) Hessian Library Information System, October 26, 2006, archived from the original on September 1, 2007 ; Retrieved October 26, 2006 . Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2Template: Webachiv / IABot / www.hebis.de
  22. Codes for the Representation of Names of Languages-Part 2: Alpha-3 Code - Normative Text. In: ISO 639-2 Registration Authority. Library of Congress, June 2, 2006, accessed October 30, 2006 .
  23. ^ MARC Code List for Languages. In: MARC. Library of Congress, March 26, 2008, accessed June 15, 2008 .