Extended UNIX coding

from Wikipedia, the free encyclopedia

Extended UNIX code ( abbreviation EUC ) is an 8- bit - character encoding , mainly for Chinese , Japanese and Korean is needed. EUC is a collective term for various encodings that can encode up to four different character sets depending on the country . Originally developed by the Open Software Foundation (OSF), Unix International (UI) and the Unix System Laboratories Pacific ( USLP ) as the standard coding for UNIX systems, this coding is used less and less today because it is often based on more common local coding ( Shift -JIS , Big5 etc.) and / or Unicode ( UTF-8 ) has been replaced.

Similarities

All EUC codes have some things in common:

  • They support up to four different character sets, called code sets in EUC terminology . Code set 0 is always (7-bit) - ASCII , code sets 1–3 are different depending on the subspecies.
  • Code set 0 is always coded directly by a byte.
  • There are two special characters ( escape characters ) that are used to switch to Code Set 2 or Code Set 3: SS2 (0x8e) and SS3 (0x8f).
  • The non-ASCII range from 0xa0–0xff is used for multi-byte characters.

There are several coding options for code sets 1 to 3 (different depending on the sub-variant of EUC). The following codes are possible:

Code set version 1 Variant 2 Variation 3
Code set 0 1 byte: 0x21-0x7e
Code set 1 1 byte: 0xa0-0xff 2 bytes: 0xa0–0xff, 0xa0–0xff 3 bytes: 0xa0–0xff, 0xa0–0xff, 0xa0–0xff
Code set 2 2 bytes: 0x8e, 0xa0–0xff 3 bytes: 0x8e, 0xa0–0xff, 0xa0–0xff 4 bytes: 0x8e, 0xa0–0xff, 0xa0–0xff, 0xa0–0xff
Code set 3 2 bytes: 0x8f, 0xa0–0xff 3 bytes: 0x8f, 0xa0–0xff, 0xa0–0xff 4 bytes: 0x8f, 0xa0–0xff, 0xa0–0xff, 0xa0–0xff

EUC-JP

EUC-JP is the variant used in Japan.

Code set 0 is ASCII (more precisely JIS-Roman ) and is coded directly by a byte from the range 0x21 to 0x7e.

Code set 1 is JIS X 0208: 1997 and is coded by two characters (variant 2 in the table above)

Code Set 2 are half-width katakana , which are also coded by two bytes (variant 1 in the table). The second byte is only from the range 0xa1 to 0xdf, since there are only 56 katakana (and a handful of special characters) and these then correspond to the 1-byte coding from JIS X 0201: 1997 (only with the escape character 0x8e as a prefix ).

In Code Set 3, JIS X 0212: 1990 is coded in the three-byte variant.

EUC-KR

EUC-KR is the version of EUC used in Korea. It is similar to ISO-2022-KR (or KS X 1001 ).

EUC-CN

EUC-CN is used in China and is equivalent to GB2312 . It encodes the simplified Chinese characters.

EUC-TW

Originally developed for Taiwan, EUC-TW is rarely used. Big5 is much more common there . Both encode the traditional Chinese characters.