List of Unicode properties
The Unicode standard not only encodes a very large number of characters, but also defines a number of properties for each of these characters that describe the character and its behavior. For example, one can see from the properties of the letter Ä that it is a capital letter, that the corresponding lower case letter is ä or that it can be broken down into an A with a trema .
General
Formally, Unicode properties are defined as the mapping of code points into a certain range of values. The data is made available in various simple text files and as an XML file.
values
Different value ranges are possible depending on the property. Most properties are listing properties, their range of values consists of a fixed set. Listed properties are further subdivided into catalog properties and binary properties. Catalog properties are characterized by the fact that the number of possible values increases gradually with new Unicode versions. Binary properties are enumerating properties with exactly two values, true ( Y
) and false ( N
). It is thus indicated whether the property applies to this character or not.
There are also string properties that each character assign a string of Unicode characters, numeric characteristics that each character assigned a number and other properties that can be assigned to any of these categories.
Default values
Properties have one or more default values for a number of reasons. On the one hand, the standard value is often left out in the tables in order to make them clearer. On the other hand, programs must also be able to deal with text that was created according to a newer Unicode version and therefore also contain characters that were not yet used at the time the program was developed. For enumerating properties, a value is usually defined that is considered the standard, in a few cases there are several standard values that are assigned depending on the block. In the case of binary properties, the default value is always N
, i.e. not applicable.
With string properties, the default value is always the character itself.
Aliases
Many properties have one or more aliases in addition to their actual name. Often these are abbreviations. Short aliases are also often specified for the possible values of enumerating properties.
status
Many properties are normative , i.e. binding for programs that work according to the Unicode standard and interpret the property. Other properties, however, are marked as informative and serve only as additional information without a binding character. A group of properties is marked as contributing . These properties should not be used on their own, but have been defined in order to derive other properties from them. They usually identify an exceptional set of characters that would otherwise not be recorded. Finally, there are still provisional properties that were initially included with reservations to see whether they would prove themselves in practice.
Some properties are also marked as deprecated ("obsolete"), these should no longer be used for various reasons, but remain in the Unicode standard for reasons of downward compatibility.
stability
In order to ensure backward compatibility, some properties, once they have been set for a character, are not changed or are changed only in certain previously known ways. For example, it stipulates that the name of a character will never be changed even if it turns out to be incorrect.
properties
The following lists show all Unicode properties, grouped as in the official documentation, for the Unicode 6.3 status. The name of the property, an abbreviated alias name (if available), the status of the property, the type of value range and a description are given.
General
The general properties give a rough overview of the character. They are used, among other things, in regular expressions if they support the query of Unicode properties, as in Perl .
property | Short | status | values | description |
---|---|---|---|---|
Name |
na |
normative | Others | Name of the character * |
Name_Alias |
normative | Others | Aliases, mainly used for control characters for which the property Name remains empty
|
|
Block |
blk |
normative | Catalog | Unicode block in which the character is located |
Age |
age |
normative informative |
Catalog | Version in which the character was recorded |
General_Category |
gc |
normative | enumerating | rough breakdown of all characters, see separate section |
Script |
sc |
informative | Catalog |
The character's writing system , e.g. Latin, Greek, Cyrillic, etc .; Common for characters that are used in several writing systems
|
Script_Extensions |
informative | Others | Writing systems for characters that are used in several systems | |
White_Space |
WSpace |
normative | binary | indicates a character as a space |
Alphabetic |
Alpha |
informative | binary | Characters from alphabets |
Hangul_Syllable_Type |
hst |
normative | enumerating | Determination of the syllable blocks in Korean |
Noncharacter_Code_Point |
NChar |
normative | binary | reserved characters |
Default_Ignorable_Code_Point |
DI |
normative | binary | Characters that should be ignored in the display if the program does not support them |
Deprecated |
Dep |
normative | binary | deprecated characters that should no longer be used |
Logical_Order_Exception |
LOE |
normative | binary | Characters that must be swapped with the following characters before the Unicode Collation Algorithm can be used |
Variation_Selector |
VS |
normative | binary | Variant selectors that choose between different display variants of the previous character |
General category
The property General_Category
is one of the basic properties used both in the Unicode standard itself and in many other technical documentation. It divides all characters into letters, numbers, punctuation and others according to their main use. The following table lists the possible values.
category | code | meaning | Examples |
---|---|---|---|
Letter | L |
||
Capital letter | Lu |
Capital letter | A , Ä , Δ , DŽ |
Lowercase letter | Ll |
Lowercase letter | a, ä, δ, dž |
Title letter | Lt |
Characters that are in title notation. These are just a few characters that encode a digraph | Dž |
Modifying letter | Lm |
Letters that modify the preceding letter | Letters from the Unicode block Spacing Modifier Letters |
Other letter | Lo |
Letters from alphabets that are not case-sensitive (e.g. Hebrew), CJK and others | ב , 丌 |
COMBINING SIGN | M |
||
without feed | Mn |
Combining character that is placed on or below the preceding character | combining diacritical marks |
with feed | Mc |
Combining sign that itself takes up space | indian vowel symbols |
enclosing | Me |
Combining character that completely surrounds the preceding character | Combining enclosing circle |
number | N |
||
Digit | Nd |
Digits | 0, 1 |
Letter | Nl |
Letters that are used as numbers | Ⅲ |
Other numerals | No |
other numbers, such as superscripts, circled or fractions | ², ½, ② |
punctuation | P |
||
connecting | Pc |
Characters that combine two parts into one word | Underscore |
Line | Pd |
different dashes: hyphen , dash , etc. | -, -, - |
opening | Ps |
opening brackets | (, [, { |
closing | Pe |
closing brackets | ),],} |
opening quotation mark | Pi |
opening quotation marks (can also be used as closing quotes depending on the language) | « |
closing quotation mark | Pf |
closing quotation mark (can also be used as an opening quote depending on the language) | » |
other punctuation | Po |
Punctuation marks and other punctuation that do not fall into any of the above categories | ! . ,:; ? § |
symbol | S |
||
Math symbol | Sm |
Symbols used in mathematical contexts | +, <,>, ± |
Currency symbol | Sc |
Symbols that denote a currency | $, € |
Modifying symbol | Sk |
Symbols that modify the preceding character | Symbols from the Unicode block Spacing Modifier Letters |
Miscellaneous symbol | So |
Symbols that do not fall into any of the above categories | ⛔, © |
White space | Z |
||
Spaces | Zs |
Different width spaces | Space , non-breaking space |
new line | Zl |
Line separator (U + 2028) | |
Paragraph break | Zp |
Paragraph separator (U + 2029) | |
Other signs | C |
||
Control characters | Cc |
general control characters | BEL |
formatting | Cf |
Control characters for formatting | conditional hyphen , bidirectional control characters |
Surrogates | Cs |
Surrogates | |
Private use | Co |
Sign for private use | U + F8FF |
unallocated | Cn |
Code points to which no character has (yet) been assigned |
Upper / lower case
Many properties are case-sensitive. You determine whether a character is an upper or lower case letter, which is the lower case letter for a given upper case letter and vice versa, and more. In order to compare character strings regardless of the spelling, a normal form called case fold is defined. These properties are used, among other things, by the various Unicode casing algorithms .
property | Short | status | values | description |
---|---|---|---|---|
Uppercase |
Upper |
informative | binary | indicates a character as a capital letter |
Lowercase |
Lower |
informative | binary | indicates a character as a lower case letter |
Cased |
informative | binary | denotes all characters that are upper, lower or title letters | |
Simple_Lowercase_Mapping |
slc |
normative | String | Corresponding lower case letter (if it is a character) |
Simple_Titlecase_Mapping |
stc |
normative | String | corresponding title letter (if it is a character) |
Simple_Uppercase_Mapping |
suc |
normative | String | Corresponding capital letter (if it is a character) |
Simple_Case_Folding |
scf |
normative | String | associated casefold letter (if it is a character) |
Lowercase_Mapping |
lc |
informative | String | corresponding assignments that also contain more complex conversions |
Titlecase_Mapping |
tc |
informative | String | |
Uppercase_Mapping |
uc |
informative | String | |
Case_Folding |
cf |
normative | String | |
Soft_Dotted |
SD |
normative | binary |
i , j and similar characters whose period is removed from capitalization and diacritical marks
|
Case_Ignorable |
CI |
informative | binary | Characters without meaning for questions about upper and lower case letters |
Changes_When_Lowercased |
CWL |
informative | binary | Characters that change when converted to lower case |
Changes_When_Titlecased |
CWT |
informative | binary | Characters that change when converted to title writing |
Changes_When_Uppercased |
CWU |
informative | binary | Characters that change when converted to uppercase |
Changes_When_Casefolded |
CWCF |
informative | binary | Characters that change when converted to casefold normal form |
Changes_When_Casemapped |
CWCM |
informative | binary | Characters that change with any case change |
Numerically
The following properties deal with the numerical properties of characters, especially the number characters in Unicode .
property | Short | status | values | description |
---|---|---|---|---|
Numeric_Value |
nv |
normative | numerically | numeric value of the character |
Numeric_Type |
nt |
normative | enumerating | Type (decimal, digit, numeric) |
ASCII_Hex_Digit |
AHex |
normative | binary | ASCII characters that are used for hexadecimal digits, that is, 0 to 9 , a to, f and A toF
|
Hex_Digit |
Hex |
informative | binary | Characters used for hexadecimal digits, including their variants |
normalization
A number of properties deal with the different types of normalization of Unicode texts.
property | Short | status | values | description |
---|---|---|---|---|
Canonical_Combining_Class |
ccc |
normative | enumerating / numeric | specifies which combining characters interact with each other and in which order they should be sorted |
Decomposition_Mapping |
dm |
normative | String | indicates the decomposition of a character |
Decomposition_Type |
dt |
normative informative |
enumerating | indicates the type of decomposition (canonical, changes the font / the break behavior / etc.) |
Composition_Exclusion |
CE |
normative | binary | Characters with a canonical decomposition that should not be used in the combined normal forms |
Full_Composition_Exclusion |
Comp_Ex |
normative | binary | |
FC_NFKC_Closure |
FC_NFKC |
normatively deprecated |
String | associated caseFold -normal if the character only in the caseFold is transferred -normal and then in NFKC |
NFC_Quick_Check |
NFC_QC |
normative | enumerating | Properties that enable a quick test to determine whether a character string is in a certain normal form |
NFKC_Quick_Check |
NFKC_QC |
normative | enumerating | |
NFD_Quick_Check |
NFD_QC |
normative | enumerating | |
NFKD_Quick_Check |
NFKD_QC |
normative | enumerating | |
Expands_On_NFC |
XO_NFC |
normatively deprecated |
binary | Characters that become multiple characters when converted to the appropriate normalization form |
Expands_On_NFD |
XO_NFD |
normatively deprecated |
binary | |
Expands_On_NFKC |
XO_NFKC |
normatively deprecated |
binary | |
Expands_On_NFKD |
XO_NFKD |
normatively deprecated |
binary | |
NFKC_Casefold |
NFKC_CF |
informative | String | Characters after conversion to NFKC and then the normal casefold form |
Changes_When_NFKC_Casefolded |
CWKCF |
informative | binary | Characters that change when they are first converted to NFKC and then to the normal casefold form |
presentation
The following properties play a role in the appearance of text.
property | Short | status | values | description |
---|---|---|---|---|
Joining_Group |
jg |
normative | enumerating | determines how or whether a letter connects with its neighbors, see Arabic in Unicode |
Joining_Type |
jt |
normative | enumerating | |
Join_Control |
Join_C |
normative | binary | Control characters for ligatures and letter combinations |
Line_Break |
lb |
normative | enumerating | sets the page break behavior for the Unicode line breaking algorithm determines |
Grapheme_Cluster_Break |
GCB |
informative | enumerating | be in the segmentation algorithms of the limits of the determination graphemes , sentences and words used |
Sentence_Break |
SB |
informative | enumerating | |
Word_Break |
WB |
informative | enumerating | |
East_Asian_Width |
ea |
informative | enumerating | indicates the width of a character, which plays a role in the representation of East Asian texts |
Prepended_Concatenation_Mark |
PCM |
informative | binary | Characters that span the following characters, such as the Syrian abbreviation symbol |
Bidi
The following properties are available for displaying bidirectional text .
property | Short | status | values | description |
---|---|---|---|---|
Bidi_Class |
bc |
normative | enumerating | determines the writing direction in the Unicode bidi algorithm |
Bidi_Control |
Bidi_C |
normative | binary | Bidirectional control character |
Bidi_Mirrored |
Bidi_M |
normative | binary | indicates whether a character must be displayed mirrored in the left-hand text |
Bidi_Mirroring_Glyph |
bmg |
informative | Others | possible mirror image of the sign, e.g. ( as a mirror image for ) , in some cases no such sign exists
|
Bidi_Paired_Bracket |
bpb |
normative | Others | Counterpart of a bracket |
Bidi_Paired_Bracket_Type |
bpt |
normative | enumerating | indicates opening and closing brackets |
Identifier
The following properties are one way of defining the characters allowed in identifiers . In contrast to classic programming languages , which only allow ASCII characters, most of the Unicode characters are allowed in identifiers in languages that use these properties. One example of a language whose syntax largely allows this range is JavaScript .
property | Short | status | values | description |
---|---|---|---|---|
ID_Start |
IDS |
informative | binary | Character that can be at the beginning of an identifier |
ID_Continue |
IDC |
informative | binary | Character that can appear in the following positions in an identifier |
XID_Start |
XIDS |
informative | binary | Character that can be at the beginning of an identifier |
XID_Continue |
XIDC |
informative | binary | Character that can appear in the following positions in an identifier |
Pattern_Syntax |
Pat_Syn |
normative | binary | Characters that can be used in the syntax |
Pattern_White_Space |
Pat_WS |
normative | binary | Characters that should be treated as white space |
CJK
Some properties apply to CJK characters. There are also a number of other properties, see the Unihan section .
property | Short | status | values | description |
---|---|---|---|---|
Ideographic |
Ideo |
informative | binary | CJK mark |
IDS_Binary_Operator |
IDSB |
normative | binary | Ideographic descriptive symbol |
IDS_Trinary_Operator |
IDST |
normative | binary | |
Unified_Ideographic |
UIdeo |
normative | binary | Chinese character that can be used in ideographic description sequences |
Radical |
normative | binary | Radical that can be used in ideographic descriptive sequences |
Others
Some properties are mainly used to provide information about a character without being intended for special applications.
property | Short | status | values | description |
---|---|---|---|---|
Math |
informative | binary | Mathematical characters in Unicode | |
Quotation_Mark |
QMark |
informative | binary | quotation marks |
Dash |
informative | binary | horizontal lines of different lengths | |
Hyphen |
informatively deprecated |
binary | Hyphen and similar characters, was originally used for line breaks and replaced there by the Line_Break property
|
|
STerm |
informative | binary | Characters that mark the end of a sentence | |
Terminal_Punctuation |
Term |
informative | binary | Punctuation marks that usually mark the end of a sentence |
Diacritic |
Dia |
informative | binary | Diacritical mark |
Extender |
Ext |
informative | binary | Characters that extend the preceding letter, such as length characters |
Grapheme_Base |
Gr_Base |
normative | binary | older properties for the determination of graphemes, see Grapheme_Cluster_Break the section representation for the newer method Grapheme_Link can consist of Canonical_Combining_Class are determined property
|
Grapheme_Extend |
Gr_Ext |
normative | binary | |
Grapheme_Link |
Gr_Link |
informatively deprecated |
binary | |
Unicode_1_Name |
na1 |
informative | Others | old name in the Unicode version 1.0 |
ISO_Comment |
isc |
informatively deprecated |
Others | originally used for comments in the ISO 10646 name list, now empty |
Indic_Matra_Category |
provisionally | enumerating | determines the placement of dependent vowels in Indian scripts | |
Indic_Syllabic_Category |
provisionally | enumerating | determines the structure of the categories of syllable-forming components in Indian scripts |
Contributing Properties
These properties are not used alone, but are used to derive other properties from them. Most of the time, these are exceptional quantities that are not covered by the general category.
property | Short | status | values | description |
---|---|---|---|---|
Other_Alphabetic |
OAlpha |
contributing | binary | For Alphabetic
|
Other_Default_Ignorable_Code_Point |
ODI |
contributing | binary | For Default_Ignorable_Code_Point
|
Other_Grapheme_Extend |
OGr_Ext |
contributing | binary | For Grapheme_Extend
|
Other_ID_Start |
OIDS |
contributing | binary | for backward compatibility of ID_Start
|
Other_ID_Continue |
OIDC |
contributing | binary | for backward compatibility of ID_Continue
|
Other_Lowercase |
OLower |
contributing | binary | For Lowercase
|
Other_Math |
OMath |
contributing | binary | For Math
|
Other_Uppercase |
OUpper |
contributing | binary | For Uppercase
|
Jamo_Short_Name |
JSN |
contributing | Others | for Name Korean syllable blocks
|
Unihan
For CJK characters, which were included in Unicode as part of the Han standardization , there is a separate database that provides properties specifically for these characters. The information on the source denotes the character encoding in various national character sets . In addition to the properties listed here, there are a number of other provisional properties that provide further information on pronunciation, meaning, alternative coding, etc.
property | status | values | description |
---|---|---|---|
kAccountingNumeric |
informative | numerically | numeric value for forgery-proof number characters |
kOtherNumeric |
informative | numerically | numeric value of a character that is rarely used as a number sign |
kPrimaryNumeric |
informative | numerically | numeric value of an ordinary number sign |
kCompatibilityVariant |
normative | String | Normalization of the character if it is a compatibility variant |
kIICore |
normative | Others | Character that should be present on all systems |
kIRG_GSource |
normative | Others | Source: China / Singapore |
kIRG_HSource |
normative | Others | Source: Hong Kong |
kIRG_JSource |
normative | Others | Source: Japan |
kIRG_KPSource |
normative | Others | Source: North Korea |
kIRG_KSource |
normative | Others | Source: South Korea |
kIRG_MSource |
normative | Others | Source: Macau |
kIRG_TSource |
normative | Others | Source: Taiwan |
kIRG_USource |
normative | Others | Source: USA |
kIRG_VSource |
normative | Others | Source: Vietnam |
kRSUnicode |
informative | Others | Radical and number of further strokes |
kMandarin |
informative | Others | Pinyin -Lesart |
kTotalStrokes |
informative | Others | Number of strokes including radical |
swell
- Mark Davis, Ken Whistler: Unicode Standard Annex # 44: Unicode Character Database. (on-line)
- John H. Jenkins, Richard Cook, Ken Lunde: Unicode Standard Annex # 38: Unicode Han Database. (on-line)
- Ken Whistler, Asmus Freytag: Unicode Technical Report # 23: The Unicode Character Property Model. (on-line)
- Eric Muller: Unicode Standard Annex # 42: Unicode Character Database in XML. (on-line)
Individual evidence
- ↑ perlretut : More on characters, strings, and character classes. Perl documentation at perldoc.perl.org
- ^ Addison Phillips: Unicode Standard Annex # 34: Unicode Named Character Sequences. (on-line)
- ↑ ECMAScript Language Specification , 5.1 Edition, 7.6 Identifier Names and Identifiers
Web links
- Unicode Character Database
- Overview of all properties
- Unicode browser of the ICU project (English)
- Graphemica , overview of all properties of a character
- Codepoints , overview of all properties of a character, including search