List of Unicode properties

The Unicode standard not only encodes a very large number of characters, but also defines a number of properties for each of these characters that describe the character and its behavior. For example, one can see from the properties of the letter Ä that it is a capital letter, that the corresponding lower case letter is ä or that it can be broken down into an A with a trema .

General

Formally, Unicode properties are defined as the mapping of code points into a certain range of values. The data is made available in various simple text files and as an XML file.

values

Different value ranges are possible depending on the property. Most properties are listing properties, their range of values consists of a fixed set. Listed properties are further subdivided into catalog properties and binary properties. Catalog properties are characterized by the fact that the number of possible values increases gradually with new Unicode versions. Binary properties are enumerating properties with exactly two values, true ( Y) and false ( N). It is thus indicated whether the property applies to this character or not.

There are also string properties that each character assign a string of Unicode characters, numeric characteristics that each character assigned a number and other properties that can be assigned to any of these categories.

Default values

Properties have one or more default values for a number of reasons. On the one hand, the standard value is often left out in the tables in order to make them clearer. On the other hand, programs must also be able to deal with text that was created according to a newer Unicode version and therefore also contain characters that were not yet used at the time the program was developed. For enumerating properties, a value is usually defined that is considered the standard, in a few cases there are several standard values that are assigned depending on the block. In the case of binary properties, the default value is always N, i.e. not applicable.

With string properties, the default value is always the character itself.

Aliases

Many properties have one or more aliases in addition to their actual name. Often these are abbreviations. Short aliases are also often specified for the possible values of enumerating properties.

status

Many properties are normative , i.e. binding for programs that work according to the Unicode standard and interpret the property. Other properties, however, are marked as informative and serve only as additional information without a binding character. A group of properties is marked as contributing . These properties should not be used on their own, but have been defined in order to derive other properties from them. They usually identify an exceptional set of characters that would otherwise not be recorded. Finally, there are still provisional properties that were initially included with reservations to see whether they would prove themselves in practice.

Some properties are also marked as deprecated ("obsolete"), these should no longer be used for various reasons, but remain in the Unicode standard for reasons of downward compatibility.

stability

In order to ensure backward compatibility, some properties, once they have been set for a character, are not changed or are changed only in certain previously known ways. For example, it stipulates that the name of a character will never be changed even if it turns out to be incorrect.

properties

The following lists show all Unicode properties, grouped as in the official documentation, for the Unicode 6.3 status. The name of the property, an abbreviated alias name (if available), the status of the property, the type of value range and a description are given.

General

The general properties give a rough overview of the character. They are used, among other things, in regular expressions if they support the query of Unicode properties, as in Perl .

property	Short	status	values	description
`Name`	`na`	normative	Others	Name of the character ^*
`Name_Alias`		normative	Others	Aliases, mainly used for control characters for which the property `Name`remains empty
`Block`	`blk`	normative	Catalog	Unicode block in which the character is located
`Age`	`age`	normative informative	Catalog	Version in which the character was recorded
`General_Category`	`gc`	normative	enumerating	rough breakdown of all characters, see separate section
`Script`	`sc`	informative	Catalog	The character's writing system , e.g. Latin, Greek, Cyrillic, etc .; `Common`for characters that are used in several writing systems
`Script_Extensions`		informative	Others	Writing systems for characters that are used in several systems
`White_Space`	`WSpace`	normative	binary	indicates a character as a space
`Alphabetic`	`Alpha`	informative	binary	Characters from alphabets
`Hangul_Syllable_Type`	`hst`	normative	enumerating	Determination of the syllable blocks in Korean
`Noncharacter_Code_Point`	`NChar`	normative	binary	reserved characters
`Default_Ignorable_Code_Point`	`DI`	normative	binary	Characters that should be ignored in the display if the program does not support them
`Deprecated`	`Dep`	normative	binary	deprecated characters that should no longer be used
`Logical_Order_Exception`	`LOE`	normative	binary	Characters that must be swapped with the following characters before the Unicode Collation Algorithm can be used
`Variation_Selector`	`VS`	normative	binary	Variant selectors that choose between different display variants of the previous character

^* In addition to individual characters, some character strings also have their own name.

General category

The property General_Categoryis one of the basic properties used both in the Unicode standard itself and in many other technical documentation. It divides all characters into letters, numbers, punctuation and others according to their main use. The following table lists the possible values.

category	code	meaning	Examples
Letter	`L`
Capital letter	`Lu`	Capital letter	A , Ä , Δ , Ǆ
Lowercase letter	`Ll`	Lowercase letter	a, ä, δ, ǆ
Title letter	`Lt`	Characters that are in title notation. These are just a few characters that encode a digraph	ǅ
Modifying letter	`Lm`	Letters that modify the preceding letter	Letters from the Unicode block Spacing Modifier Letters
Other letter	`Lo`	Letters from alphabets that are not case-sensitive (e.g. Hebrew), CJK and others	ב , 丌
COMBINING SIGN	`M`
without feed	`Mn`	Combining character that is placed on or below the preceding character	combining diacritical marks
with feed	`Mc`	Combining sign that itself takes up space	indian vowel symbols
enclosing	`Me`	Combining character that completely surrounds the preceding character	Combining enclosing circle
number	`N`
Digit	`Nd`	Digits	0, 1
Letter	`Nl`	Letters that are used as numbers	Ⅲ
Other numerals	`No`	other numbers, such as superscripts, circled or fractions	², ½, ②
punctuation	`P`
connecting	`Pc`	Characters that combine two parts into one word	Underscore
Line	`Pd`	different dashes: hyphen , dash , etc.	-, -, -
opening	`Ps`	opening brackets	(, [, {
closing	`Pe`	closing brackets	),],}
opening quotation mark	`Pi`	opening quotation marks (can also be used as closing quotes depending on the language)	«
closing quotation mark	`Pf`	closing quotation mark (can also be used as an opening quote depending on the language)	»
other punctuation	`Po`	Punctuation marks and other punctuation that do not fall into any of the above categories	! . ,:; ? §
symbol	`S`
Math symbol	`Sm`	Symbols used in mathematical contexts	+, <,>, ±
Currency symbol	`Sc`	Symbols that denote a currency	$, €
Modifying symbol	`Sk`	Symbols that modify the preceding character	Symbols from the Unicode block Spacing Modifier Letters
Miscellaneous symbol	`So`	Symbols that do not fall into any of the above categories	⛔, ©
White space	`Z`
Spaces	`Zs`	Different width spaces	Space , non-breaking space
new line	`Zl`		Line separator (U + 2028)
Paragraph break	`Zp`		Paragraph separator (U + 2029)
Other signs	`C`
Control characters	`Cc`	general control characters	BEL
formatting	`Cf`	Control characters for formatting	conditional hyphen , bidirectional control characters
Surrogates	`Cs`	Surrogates
Private use	`Co`	Sign for private use	U + F8FF
unallocated	`Cn`	Code points to which no character has (yet) been assigned

Upper / lower case

Many properties are case-sensitive. You determine whether a character is an upper or lower case letter, which is the lower case letter for a given upper case letter and vice versa, and more. In order to compare character strings regardless of the spelling, a normal form called case fold is defined. These properties are used, among other things, by the various Unicode casing algorithms .

property	Short	status	values	description
`Uppercase`	`Upper`	informative	binary	indicates a character as a capital letter
`Lowercase`	`Lower`	informative	binary	indicates a character as a lower case letter
`Cased`		informative	binary	denotes all characters that are upper, lower or title letters
`Simple_Lowercase_Mapping`	`slc`	normative	String	Corresponding lower case letter (if it is a character)
`Simple_Titlecase_Mapping`	`stc`	normative	String	corresponding title letter (if it is a character)
`Simple_Uppercase_Mapping`	`suc`	normative	String	Corresponding capital letter (if it is a character)
`Simple_Case_Folding`	`scf`	normative	String	associated casefold letter (if it is a character)
`Lowercase_Mapping`	`lc`	informative	String	corresponding assignments that also contain more complex conversions
`Titlecase_Mapping`	`tc`	informative	String
`Uppercase_Mapping`	`uc`	informative	String
`Case_Folding`	`cf`	normative	String
`Soft_Dotted`	`SD`	normative	binary	`i`, `j`and similar characters whose period is removed from capitalization and diacritical marks
`Case_Ignorable`	`CI`	informative	binary	Characters without meaning for questions about upper and lower case letters
`Changes_When_Lowercased`	`CWL`	informative	binary	Characters that change when converted to lower case
`Changes_When_Titlecased`	`CWT`	informative	binary	Characters that change when converted to title writing
`Changes_When_Uppercased`	`CWU`	informative	binary	Characters that change when converted to uppercase
`Changes_When_Casefolded`	`CWCF`	informative	binary	Characters that change when converted to casefold normal form
`Changes_When_Casemapped`	`CWCM`	informative	binary	Characters that change with any case change

Numerically

The following properties deal with the numerical properties of characters, especially the number characters in Unicode .

property	Short	status	values	description
`Numeric_Value`	`nv`	normative	numerically	numeric value of the character
`Numeric_Type`	`nt`	normative	enumerating	Type (decimal, digit, numeric)
`ASCII_Hex_Digit`	`AHex`	normative	binary	ASCII characters that are used for hexadecimal digits, that is, `0`to `9`, `a`to, `f`and `A`to`F`
`Hex_Digit`	`Hex`	informative	binary	Characters used for hexadecimal digits, including their variants

normalization

A number of properties deal with the different types of normalization of Unicode texts.

property	Short	status	values	description
`Canonical_Combining_Class`	`ccc`	normative	enumerating / numeric	specifies which combining characters interact with each other and in which order they should be sorted
`Decomposition_Mapping`	`dm`	normative	String	indicates the decomposition of a character
`Decomposition_Type`	`dt`	normative informative	enumerating	indicates the type of decomposition (canonical, changes the font / the break behavior / etc.)
`Composition_Exclusion`	`CE`	normative	binary	Characters with a canonical decomposition that should not be used in the combined normal forms
`Full_Composition_Exclusion`	`Comp_Ex`	normative	binary
`FC_NFKC_Closure`	`FC_NFKC`	normatively deprecated	String	associated caseFold -normal if the character only in the caseFold is transferred -normal and then in NFKC
`NFC_Quick_Check`	`NFC_QC`	normative	enumerating	Properties that enable a quick test to determine whether a character string is in a certain normal form
`NFKC_Quick_Check`	`NFKC_QC`	normative	enumerating
`NFD_Quick_Check`	`NFD_QC`	normative	enumerating
`NFKD_Quick_Check`	`NFKD_QC`	normative	enumerating
`Expands_On_NFC`	`XO_NFC`	normatively deprecated	binary	Characters that become multiple characters when converted to the appropriate normalization form
`Expands_On_NFD`	`XO_NFD`	normatively deprecated	binary
`Expands_On_NFKC`	`XO_NFKC`	normatively deprecated	binary
`Expands_On_NFKD`	`XO_NFKD`	normatively deprecated	binary
`NFKC_Casefold`	`NFKC_CF`	informative	String	Characters after conversion to NFKC and then the normal casefold form
`Changes_When_NFKC_Casefolded`	`CWKCF`	informative	binary	Characters that change when they are first converted to NFKC and then to the normal casefold form

presentation

The following properties play a role in the appearance of text.

property	Short	status	values	description
`Joining_Group`	`jg`	normative	enumerating	determines how or whether a letter connects with its neighbors, see Arabic in Unicode
`Joining_Type`	`jt`	normative	enumerating
`Join_Control`	`Join_C`	normative	binary	Control characters for ligatures and letter combinations
`Line_Break`	`lb`	normative	enumerating	sets the page break behavior for the Unicode line breaking algorithm determines
`Grapheme_Cluster_Break`	`GCB`	informative	enumerating	be in the segmentation algorithms of the limits of the determination graphemes , sentences and words used
`Sentence_Break`	`SB`	informative	enumerating
`Word_Break`	`WB`	informative	enumerating
`East_Asian_Width`	`ea`	informative	enumerating	indicates the width of a character, which plays a role in the representation of East Asian texts
`Prepended_Concatenation_Mark`	`PCM`	informative	binary	Characters that span the following characters, such as the Syrian abbreviation symbol

Bidi

The following properties are available for displaying bidirectional text .

property	Short	status	values	description
`Bidi_Class`	`bc`	normative	enumerating	determines the writing direction in the Unicode bidi algorithm
`Bidi_Control`	`Bidi_C`	normative	binary	Bidirectional control character
`Bidi_Mirrored`	`Bidi_M`	normative	binary	indicates whether a character must be displayed mirrored in the left-hand text
`Bidi_Mirroring_Glyph`	`bmg`	informative	Others	possible mirror image of the sign, e.g. `(`as a mirror image for `)`, in some cases no such sign exists
`Bidi_Paired_Bracket`	`bpb`	normative	Others	Counterpart of a bracket
`Bidi_Paired_Bracket_Type`	`bpt`	normative	enumerating	indicates opening and closing brackets

Identifier

The following properties are one way of defining the characters allowed in identifiers . In contrast to classic programming languages , which only allow ASCII characters, most of the Unicode characters are allowed in identifiers in languages that use these properties. One example of a language whose syntax largely allows this range is JavaScript .

property	Short	status	values	description
`ID_Start`	`IDS`	informative	binary	Character that can be at the beginning of an identifier
`ID_Continue`	`IDC`	informative	binary	Character that can appear in the following positions in an identifier
`XID_Start`	`XIDS`	informative	binary	Character that can be at the beginning of an identifier
`XID_Continue`	`XIDC`	informative	binary	Character that can appear in the following positions in an identifier
`Pattern_Syntax`	`Pat_Syn`	normative	binary	Characters that can be used in the syntax
`Pattern_White_Space`	`Pat_WS`	normative	binary	Characters that should be treated as white space

CJK

Some properties apply to CJK characters. There are also a number of other properties, see the Unihan section .

property	Short	status	values	description
`Ideographic`	`Ideo`	informative	binary	CJK mark
`IDS_Binary_Operator`	`IDSB`	normative	binary	Ideographic descriptive symbol
`IDS_Trinary_Operator`	`IDST`	normative	binary	Ideographic descriptive symbol
`Unified_Ideographic`	`UIdeo`	normative	binary	Chinese character that can be used in ideographic description sequences
`Radical`		normative	binary	Radical that can be used in ideographic descriptive sequences

Others

Some properties are mainly used to provide information about a character without being intended for special applications.

property	Short	status	values	description
`Math`		informative	binary	Mathematical characters in Unicode
`Quotation_Mark`	`QMark`	informative	binary	quotation marks
`Dash`		informative	binary	horizontal lines of different lengths
`Hyphen`		informatively deprecated	binary	Hyphen and similar characters, was originally used for line breaks and replaced there by the `Line_Break`property
`STerm`		informative	binary	Characters that mark the end of a sentence
`Terminal_Punctuation`	`Term`	informative	binary	Punctuation marks that usually mark the end of a sentence
`Diacritic`	`Dia`	informative	binary	Diacritical mark
`Extender`	`Ext`	informative	binary	Characters that extend the preceding letter, such as length characters
`Grapheme_Base`	`Gr_Base`	normative	binary	older properties for the determination of graphemes, see `Grapheme_Cluster_Break`the section representation for the newer method `Grapheme_Link`can consist of `Canonical_Combining_Class`are determined property
`Grapheme_Extend`	`Gr_Ext`	normative	binary
`Grapheme_Link`	`Gr_Link`	informatively deprecated	binary
`Unicode_1_Name`	`na1`	informative	Others	old name in the Unicode version 1.0
`ISO_Comment`	`isc`	informatively deprecated	Others	originally used for comments in the ISO 10646 name list, now empty
`Indic_Matra_Category`		provisionally	enumerating	determines the placement of dependent vowels in Indian scripts
`Indic_Syllabic_Category`		provisionally	enumerating	determines the structure of the categories of syllable-forming components in Indian scripts

Contributing Properties

These properties are not used alone, but are used to derive other properties from them. Most of the time, these are exceptional quantities that are not covered by the general category.

property	Short	status	values	description
`Other_Alphabetic`	`OAlpha`	contributing	binary	For `Alphabetic`
`Other_Default_Ignorable_Code_Point`	`ODI`	contributing	binary	For `Default_Ignorable_Code_Point`
`Other_Grapheme_Extend`	`OGr_Ext`	contributing	binary	For `Grapheme_Extend`
`Other_ID_Start`	`OIDS`	contributing	binary	for backward compatibility of `ID_Start`
`Other_ID_Continue`	`OIDC`	contributing	binary	for backward compatibility of `ID_Continue`
`Other_Lowercase`	`OLower`	contributing	binary	For `Lowercase`
`Other_Math`	`OMath`	contributing	binary	For `Math`
`Other_Uppercase`	`OUpper`	contributing	binary	For `Uppercase`
`Jamo_Short_Name`	`JSN`	contributing	Others	for `Name` Korean syllable blocks

Unihan

For CJK characters, which were included in Unicode as part of the Han standardization , there is a separate database that provides properties specifically for these characters. The information on the source denotes the character encoding in various national character sets . In addition to the properties listed here, there are a number of other provisional properties that provide further information on pronunciation, meaning, alternative coding, etc.

property	status	values	description
`kAccountingNumeric`	informative	numerically	numeric value for forgery-proof number characters
`kOtherNumeric`	informative	numerically	numeric value of a character that is rarely used as a number sign
`kPrimaryNumeric`	informative	numerically	numeric value of an ordinary number sign
`kCompatibilityVariant`	normative	String	Normalization of the character if it is a compatibility variant
`kIICore`	normative	Others	Character that should be present on all systems
`kIRG_GSource`	normative	Others	Source: China / Singapore
`kIRG_HSource`	normative	Others	Source: Hong Kong
`kIRG_JSource`	normative	Others	Source: Japan
`kIRG_KPSource`	normative	Others	Source: North Korea
`kIRG_KSource`	normative	Others	Source: South Korea
`kIRG_MSource`	normative	Others	Source: Macau
`kIRG_TSource`	normative	Others	Source: Taiwan
`kIRG_USource`	normative	Others	Source: USA
`kIRG_VSource`	normative	Others	Source: Vietnam
`kRSUnicode`	informative	Others	Radical and number of further strokes
`kMandarin`	informative	Others	Pinyin -Lesart
`kTotalStrokes`	informative	Others	Number of strokes including radical

swell

Mark Davis, Ken Whistler: Unicode Standard Annex # 44: Unicode Character Database. (on-line)
John H. Jenkins, Richard Cook, Ken Lunde: Unicode Standard Annex # 38: Unicode Han Database. (on-line)
Ken Whistler, Asmus Freytag: Unicode Technical Report # 23: The Unicode Character Property Model. (on-line)
Eric Muller: Unicode Standard Annex # 42: Unicode Character Database in XML. (on-line)

Individual evidence

↑ perlretut : More on characters, strings, and character classes. Perl documentation at perldoc.perl.org
^ Addison Phillips: Unicode Standard Annex # 34: Unicode Named Character Sequences. (on-line)
↑ ECMAScript Language Specification , 5.1 Edition, 7.6 Identifier Names and Identifiers

Web links

Unicode Character Database
Overview of all properties
Unicode browser of the ICU project (English)
Graphemica , overview of all properties of a character
Codepoints , overview of all properties of a character, including search

[1] rlretut : More on characters, strings, and character classes. Perl documentation at perldoc.perl.org

[2] Addison Phillips: Unicode Standard Annex # 34: Unicode Named Character Sequences. (on-line)

[3] ECMAScript Language Specification , 5.1 Edition, 7.6 Identifier Names and Identifiers