White space

from Wikipedia, the free encyclopedia

White space (technically also English Whitespace / 'waɪtspeɪs / "white space" or space characters ) is a term in computer science for characters in a text that are normally only represented by empty spaces in a text editor or word processing program and still take up ( memory ) space. They are primarily used for word spacing ( spaces ), grouping of digits, preventing and enabling breaks ( narrow spaces of different widths ).

Depending on the context, different characters are viewed as whitespace, almost always at least spaces and tabs , mostly line breaks as well . Many programs also offer the option of making these characters visible and distinguishable using representative formatting symbols (for example for line breaks, ·for spaces and / or >for tabs).

On the one hand, these characters play a special role in programming. In different programming languages you can separate individual protected words and also names of variables from one another. Some programming languages ​​(such as Python ) require special formatting of the source code using whitespace characters (indentation of blocks).

On the other hand, it is often irrelevant (depending on the syntax of the programming language) whether one or more of these characters follow one another. That is why comparison programs or comparison functions in IDE in particular offer an “ Ignore Whitespace option .

When counting the characters of a text document, the space is sometimes not counted.

Regular expressions

For regular expressions , two slightly different definitions are common for the characters in the character class \sor [:space:]as white space. In Perl-compatible regular expressions (PCRE) at least the space (U + 0020), the horizontal tab character (U + 0009), the line - (U + 000A) and form feed (U + 000C) as well as the carriage return (U + 000D) to the space. In regular expressions according to the POSIX standard, the vertical tab character (U + 000B) is also included in the space. In both cases, depending on the set locale, additional characters may be added, e.g. in Japanese the ideographic space (U + 3000).

The ECMA standard, and thus also JavaScript , has its own definition for characters in regular expressions that are regarded as whitespace. It includes, among other things, the non-breaking space (U + 00A0), the byte order mark (U + FEFF) and all characters defined as white space in the Unicode Standard Version 3.0.

Unicode

In Unicode , several Unicode properties are assigned to each code point , i.e. each Unicode character . Among other things, the characters are divided into general categories ( General_Category , gc ). The characters considered as white space are included in the category for control characters ( Cc ) and the three categories for line, paragraph and other separators ( Zl , Zp and Zs ). There is no category for whitespace. In addition, each character is assigned to a bidirectionality class ( Bidi_Class , bc ). A class with the name White_Space ( WS ) exists here for use within the Unicode bidi algorithm , which only contains various spaces. Characters such as tabs and line feeds do not count as whitespace here, but are assigned their own bidirectionality classes for general separators ( CS ), segment ( S ) and paragraph separators ( B ).

25 characters are counted as white space and are identified by the White_Space property .

  • Several control characters , specifically the horizontal (U + 0009) and vertical tab characters (U + 000B), the line - (U + 000A) and page feed (U + 000C) as well as the carriage return (U + 000D)
  • The space (U + 0020)
  • The control character for the next line (U + 0085)
  • The non-breaking space (U + 00A0)
  • The Ogham space (U + 1680)
  • Elf narrow spaces , hair spatium and square -Leerzeichen in different sizes (U + 2000 to U + 200A)
  • Line and paragraph separators (U + 2028 and U + 2029)
  • The narrow non-breaking space (U + 202F)
  • The middle math space (U + 205F)
  • The ideographic space (U + 3000)

For use in software development and especially in programming languages , Unicode defines a second property called Pattern_White_Space (literally "pattern space", after the patterns in regular expressions ) with only 11 characters (U + 0009 to U + 000D, U + 0020, U +0085, U + 200E, U + 200F, U + 2028 and U + 2029). In particular, the protected and language-specific spaces are missing here.

This list is also only a recommendation and can be changed by the developers of the programming language, whereby it is recommended to use the Unicode standard as the basis for the different definition.

Individual evidence

  1. Perl Programming Documentation: Using character classes , 2000.
  2. ^ The Open Group: Base Specifications: Locale Definition , 2004.
  3. The Open Group: Locales ( Memento of the original from September 29, 2008 in the Internet Archive ) Info: The @1@ 2Template: Webachiv / IABot / www.opengroup.org archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. , 2010.
  4. ECMAScript Language Specification - ECMA-262 Edition 5.1 , June 2011.
  5. Unicode: Frequently Asked Questions . In it: “ All the characters that have the White_Space property, also generically known as 'whitespace characters'. "
  6. Unicode: Unicode Standard Annex # 31: Identifier and Pattern Syntax . Therein: " Each programming language can define its own whitespace characters [...] relative to the Unicode Pattern_White_Space [...] characters, with some specified set of additions or subtractions. "