List of Unicode properties

from Wikipedia, the free encyclopedia

The Unicode standard not only encodes a very large number of characters, but also defines a number of properties for each of these characters that describe the character and its behavior. For example, one can see from the properties of the letter Ä that it is a capital letter, that the corresponding lower case letter is ä or that it can be broken down into an A with a trema .

General

Formally, Unicode properties are defined as the mapping of code points into a certain range of values. The data is made available in various simple text files and as an XML file.

values

Different value ranges are possible depending on the property. Most properties are listing properties, their range of values ​​consists of a fixed set. Listed properties are further subdivided into catalog properties and binary properties. Catalog properties are characterized by the fact that the number of possible values ​​increases gradually with new Unicode versions. Binary properties are enumerating properties with exactly two values, true ( Y) and false ( N). It is thus indicated whether the property applies to this character or not.

There are also string properties that each character assign a string of Unicode characters, numeric characteristics that each character assigned a number and other properties that can be assigned to any of these categories.

Default values

Properties have one or more default values ​​for a number of reasons. On the one hand, the standard value is often left out in the tables in order to make them clearer. On the other hand, programs must also be able to deal with text that was created according to a newer Unicode version and therefore also contain characters that were not yet used at the time the program was developed. For enumerating properties, a value is usually defined that is considered the standard, in a few cases there are several standard values ​​that are assigned depending on the block. In the case of binary properties, the default value is always N, i.e. not applicable.

With string properties, the default value is always the character itself.

Aliases

Many properties have one or more aliases in addition to their actual name. Often these are abbreviations. Short aliases are also often specified for the possible values ​​of enumerating properties.

status

Many properties are normative , i.e. binding for programs that work according to the Unicode standard and interpret the property. Other properties, however, are marked as informative and serve only as additional information without a binding character. A group of properties is marked as contributing . These properties should not be used on their own, but have been defined in order to derive other properties from them. They usually identify an exceptional set of characters that would otherwise not be recorded. Finally, there are still provisional properties that were initially included with reservations to see whether they would prove themselves in practice.

Some properties are also marked as deprecated ("obsolete"), these should no longer be used for various reasons, but remain in the Unicode standard for reasons of downward compatibility.

stability

In order to ensure backward compatibility, some properties, once they have been set for a character, are not changed or are changed only in certain previously known ways. For example, it stipulates that the name of a character will never be changed even if it turns out to be incorrect.

properties

The following lists show all Unicode properties, grouped as in the official documentation, for the Unicode 6.3 status. The name of the property, an abbreviated alias name (if available), the status of the property, the type of value range and a description are given.

General

The general properties give a rough overview of the character. They are used, among other things, in regular expressions if they support the query of Unicode properties, as in Perl .

property Short status values description
Name na normative Others Name of the character *
Name_Alias normative Others Aliases, mainly used for control characters for which the property Nameremains empty
Block blk normative Catalog Unicode block in which the character is located
Age age normative
informative
Catalog Version in which the character was recorded
General_Category gc normative enumerating rough breakdown of all characters, see separate section
Script sc informative Catalog The character's writing system , e.g. Latin, Greek, Cyrillic, etc .; Commonfor characters that are used in several writing systems
Script_Extensions informative Others Writing systems for characters that are used in several systems
White_Space WSpace normative binary indicates a character as a space
Alphabetic Alpha informative binary Characters from alphabets
Hangul_Syllable_Type hst normative enumerating Determination of the syllable blocks in Korean
Noncharacter_Code_Point NChar normative binary reserved characters
Default_Ignorable_Code_Point DI normative binary Characters that should be ignored in the display if the program does not support them
Deprecated Dep normative binary deprecated characters that should no longer be used
Logical_Order_Exception LOE normative binary Characters that must be swapped with the following characters before the Unicode Collation Algorithm can be used
Variation_Selector VS normative binary Variant selectors that choose between different display variants of the previous character
* In addition to individual characters, some character strings also have their own name.

General category

The property General_Categoryis one of the basic properties used both in the Unicode standard itself and in many other technical documentation. It divides all characters into letters, numbers, punctuation and others according to their main use. The following table lists the possible values.

category code meaning Examples
Letter L
Capital letter Lu Capital letter A , Ä , Δ , DŽ
Lowercase letter Ll Lowercase letter a, ä, δ, dž
Title letter Lt Characters that are in title notation. These are just a few characters that encode a digraph Dž
Modifying letter Lm Letters that modify the preceding letter Letters from the Unicode block Spacing Modifier Letters
Other letter Lo Letters from alphabets that are not case-sensitive (e.g. Hebrew), CJK and others ב , 丌
COMBINING SIGN M
without feed Mn Combining character that is placed on or below the preceding character combining diacritical marks
with feed Mc Combining sign that itself takes up space indian vowel symbols
enclosing Me Combining character that completely surrounds the preceding character Combining enclosing circle
number N
Digit Nd Digits 0, 1
Letter Nl Letters that are used as numbers
Other numerals No other numbers, such as superscripts, circled or fractions ², ½, ②
punctuation P
connecting Pc Characters that combine two parts into one word Underscore
Line Pd different dashes: hyphen , dash , etc. -, -, -
opening Ps opening brackets (, [, {
closing Pe closing brackets ),],}
opening quotation mark Pi opening quotation marks (can also be used as closing quotes depending on the language) «
closing quotation mark Pf closing quotation mark (can also be used as an opening quote depending on the language) »
other punctuation Po Punctuation marks and other punctuation that do not fall into any of the above categories ! . ,:; ? §
symbol S
Math symbol Sm Symbols used in mathematical contexts +, <,>, ±
Currency symbol Sc Symbols that denote a currency $, €
Modifying symbol Sk Symbols that modify the preceding character Symbols from the Unicode block Spacing Modifier Letters
Miscellaneous symbol So Symbols that do not fall into any of the above categories ⛔, ©
White space Z
Spaces Zs Different width spaces Space , non-breaking space
new line Zl Line separator (U + 2028)
Paragraph break Zp Paragraph separator (U + 2029)
Other signs C
Control characters Cc general control characters BEL
formatting Cf Control characters for formatting conditional hyphen , bidirectional control characters
Surrogates Cs Surrogates
Private use Co Sign for private use U + F8FF
unallocated Cn Code points to which no character has (yet) been assigned

Upper / lower case

Many properties are case-sensitive. You determine whether a character is an upper or lower case letter, which is the lower case letter for a given upper case letter and vice versa, and more. In order to compare character strings regardless of the spelling, a normal form called case fold is defined. These properties are used, among other things, by the various Unicode casing algorithms .

property Short status values description
Uppercase Upper informative binary indicates a character as a capital letter
Lowercase Lower informative binary indicates a character as a lower case letter
Cased informative binary denotes all characters that are upper, lower or title letters
Simple_Lowercase_Mapping slc normative String Corresponding lower case letter (if it is a character)
Simple_Titlecase_Mapping stc normative String corresponding title letter (if it is a character)
Simple_Uppercase_Mapping suc normative String Corresponding capital letter (if it is a character)
Simple_Case_Folding scf normative String associated casefold letter (if it is a character)
Lowercase_Mapping lc informative String corresponding assignments that also contain more complex conversions
Titlecase_Mapping tc informative String
Uppercase_Mapping uc informative String
Case_Folding cf normative String
Soft_Dotted SD normative binary i, jand similar characters whose period is removed from capitalization and diacritical marks
Case_Ignorable CI informative binary Characters without meaning for questions about upper and lower case letters
Changes_When_Lowercased CWL informative binary Characters that change when converted to lower case
Changes_When_Titlecased CWT informative binary Characters that change when converted to title writing
Changes_When_Uppercased CWU informative binary Characters that change when converted to uppercase
Changes_When_Casefolded CWCF informative binary Characters that change when converted to casefold normal form
Changes_When_Casemapped CWCM informative binary Characters that change with any case change

Numerically

The following properties deal with the numerical properties of characters, especially the number characters in Unicode .

property Short status values description
Numeric_Value nv normative numerically numeric value of the character
Numeric_Type nt normative enumerating Type (decimal, digit, numeric)
ASCII_Hex_Digit AHex normative binary ASCII characters that are used for hexadecimal digits, that is, 0to 9, ato, fand AtoF
Hex_Digit Hex informative binary Characters used for hexadecimal digits, including their variants

normalization

A number of properties deal with the different types of normalization of Unicode texts.

property Short status values description
Canonical_Combining_Class ccc normative enumerating / numeric specifies which combining characters interact with each other and in which order they should be sorted
Decomposition_Mapping dm normative String indicates the decomposition of a character
Decomposition_Type dt normative
informative
enumerating indicates the type of decomposition (canonical, changes the font / the break behavior / etc.)
Composition_Exclusion CE normative binary Characters with a canonical decomposition that should not be used in the combined normal forms
Full_Composition_Exclusion Comp_Ex normative binary
FC_NFKC_Closure FC_NFKC normatively
deprecated
String associated caseFold -normal if the character only in the caseFold is transferred -normal and then in NFKC
NFC_Quick_Check NFC_QC normative enumerating Properties that enable a quick test to determine whether a character string is in a certain normal form
NFKC_Quick_Check NFKC_QC normative enumerating
NFD_Quick_Check NFD_QC normative enumerating
NFKD_Quick_Check NFKD_QC normative enumerating
Expands_On_NFC XO_NFC normatively
deprecated
binary Characters that become multiple characters when converted to the appropriate normalization form
Expands_On_NFD XO_NFD normatively
deprecated
binary
Expands_On_NFKC XO_NFKC normatively
deprecated
binary
Expands_On_NFKD XO_NFKD normatively
deprecated
binary
NFKC_Casefold NFKC_CF informative String Characters after conversion to NFKC and then the normal casefold form
Changes_When_NFKC_Casefolded CWKCF informative binary Characters that change when they are first converted to NFKC and then to the normal casefold form

presentation

The following properties play a role in the appearance of text.

property Short status values description
Joining_Group jg normative enumerating determines how or whether a letter connects with its neighbors, see Arabic in Unicode
Joining_Type jt normative enumerating
Join_Control Join_C normative binary Control characters for ligatures and letter combinations
Line_Break lb normative enumerating sets the page break behavior for the Unicode line breaking algorithm determines
Grapheme_Cluster_Break GCB informative enumerating be in the segmentation algorithms of the limits of the determination graphemes , sentences and words used
Sentence_Break SB informative enumerating
Word_Break WB informative enumerating
East_Asian_Width ea informative enumerating indicates the width of a character, which plays a role in the representation of East Asian texts
Prepended_Concatenation_Mark PCM informative binary Characters that span the following characters, such as the Syrian abbreviation symbol

Bidi

The following properties are available for displaying bidirectional text .

property Short status values description
Bidi_Class bc normative enumerating determines the writing direction in the Unicode bidi algorithm
Bidi_Control Bidi_C normative binary Bidirectional control character
Bidi_Mirrored Bidi_M normative binary indicates whether a character must be displayed mirrored in the left-hand text
Bidi_Mirroring_Glyph bmg informative Others possible mirror image of the sign, e.g. (as a mirror image for ), in some cases no such sign exists
Bidi_Paired_Bracket bpb normative Others Counterpart of a bracket
Bidi_Paired_Bracket_Type bpt normative enumerating indicates opening and closing brackets

Identifier

The following properties are one way of defining the characters allowed in identifiers . In contrast to classic programming languages , which only allow ASCII characters, most of the Unicode characters are allowed in identifiers in languages ​​that use these properties. One example of a language whose syntax largely allows this range is JavaScript .

property Short status values description
ID_Start IDS informative binary Character that can be at the beginning of an identifier
ID_Continue IDC informative binary Character that can appear in the following positions in an identifier
XID_Start XIDS informative binary Character that can be at the beginning of an identifier
XID_Continue XIDC informative binary Character that can appear in the following positions in an identifier
Pattern_Syntax Pat_Syn normative binary Characters that can be used in the syntax
Pattern_White_Space Pat_WS normative binary Characters that should be treated as white space

CJK

Some properties apply to CJK characters. There are also a number of other properties, see the Unihan section .

property Short status values description
Ideographic Ideo informative binary CJK mark
IDS_Binary_Operator IDSB normative binary Ideographic descriptive symbol
IDS_Trinary_Operator IDST normative binary
Unified_Ideographic UIdeo normative binary Chinese character that can be used in ideographic description sequences
Radical normative binary Radical that can be used in ideographic descriptive sequences

Others

Some properties are mainly used to provide information about a character without being intended for special applications.

property Short status values description
Math informative binary Mathematical characters in Unicode
Quotation_Mark QMark informative binary quotation marks
Dash informative binary horizontal lines of different lengths
Hyphen informatively
deprecated
binary Hyphen and similar characters, was originally used for line breaks and replaced there by the Line_Breakproperty
STerm informative binary Characters that mark the end of a sentence
Terminal_Punctuation Term informative binary Punctuation marks that usually mark the end of a sentence
Diacritic Dia informative binary Diacritical mark
Extender Ext informative binary Characters that extend the preceding letter, such as length characters
Grapheme_Base Gr_Base normative binary older properties for the determination of graphemes, see Grapheme_Cluster_Breakthe section representation for the newer method
Grapheme_Linkcan consist of Canonical_Combining_Classare determined property
Grapheme_Extend Gr_Ext normative binary
Grapheme_Link Gr_Link informatively
deprecated
binary
Unicode_1_Name na1 informative Others old name in the Unicode version 1.0
ISO_Comment isc informatively
deprecated
Others originally used for comments in the ISO 10646 name list, now empty
Indic_Matra_Category provisionally enumerating determines the placement of dependent vowels in Indian scripts
Indic_Syllabic_Category provisionally enumerating determines the structure of the categories of syllable-forming components in Indian scripts

Contributing Properties

These properties are not used alone, but are used to derive other properties from them. Most of the time, these are exceptional quantities that are not covered by the general category.

property Short status values description
Other_Alphabetic OAlpha contributing binary For Alphabetic
Other_Default_Ignorable_Code_Point ODI contributing binary For Default_Ignorable_Code_Point
Other_Grapheme_Extend OGr_Ext contributing binary For Grapheme_Extend
Other_ID_Start OIDS contributing binary for backward compatibility of ID_Start
Other_ID_Continue OIDC contributing binary for backward compatibility of ID_Continue
Other_Lowercase OLower contributing binary For Lowercase
Other_Math OMath contributing binary For Math
Other_Uppercase OUpper contributing binary For Uppercase
Jamo_Short_Name JSN contributing Others for Name Korean syllable blocks

Unihan

For CJK characters, which were included in Unicode as part of the Han standardization , there is a separate database that provides properties specifically for these characters. The information on the source denotes the character encoding in various national character sets . In addition to the properties listed here, there are a number of other provisional properties that provide further information on pronunciation, meaning, alternative coding, etc.

property status values description
kAccountingNumeric informative numerically numeric value for forgery-proof number characters
kOtherNumeric informative numerically numeric value of a character that is rarely used as a number sign
kPrimaryNumeric informative numerically numeric value of an ordinary number sign
kCompatibilityVariant normative String Normalization of the character if it is a compatibility variant
kIICore normative Others Character that should be present on all systems
kIRG_GSource normative Others Source: China / Singapore
kIRG_HSource normative Others Source: Hong Kong
kIRG_JSource normative Others Source: Japan
kIRG_KPSource normative Others Source: North Korea
kIRG_KSource normative Others Source: South Korea
kIRG_MSource normative Others Source: Macau
kIRG_TSource normative Others Source: Taiwan
kIRG_USource normative Others Source: USA
kIRG_VSource normative Others Source: Vietnam
kRSUnicode informative Others Radical and number of further strokes
kMandarin informative Others Pinyin -Lesart
kTotalStrokes informative Others Number of strokes including radical

swell

  • Mark Davis, Ken Whistler: Unicode Standard Annex # 44: Unicode Character Database. (on-line)
  • John H. Jenkins, Richard Cook, Ken Lunde: Unicode Standard Annex # 38: Unicode Han Database. (on-line)
  • Ken Whistler, Asmus Freytag: Unicode Technical Report # 23: The Unicode Character Property Model. (on-line)
  • Eric Muller: Unicode Standard Annex # 42: Unicode Character Database in XML. (on-line)

Individual evidence

  1. perlretut : More on characters, strings, and character classes. Perl documentation at perldoc.perl.org
  2. ^ Addison Phillips: Unicode Standard Annex # 34: Unicode Named Character Sequences. (on-line)
  3. ECMAScript Language Specification , 5.1 Edition, 7.6 Identifier Names and Identifiers

Web links