Text file

from Wikipedia, the free encyclopedia
95 ASCII characters, white letters on a black background
The 95 printable characters of the original ASCII

In information technology, a text file is a file that contains characters that can be displayed . These can be subdivided by control characters such as line and page breaks. The counterpart to the text file is a binary file . Basically, text files are also stored in binary format, but the terms are used in a complementary way, because the interpretation of the binary content is crucial: in a text file, the content is interpreted as a sequential sequence of characters from a character set ; in a binary file, any other interpretation of the content is possible possible. As a result, in contrast to a binary file, a text file can be read without the use of special programs and can be  viewed and edited with a simple text editor such as Notepad under Windows or vi or Nano under Unix .

In contrast to this technical definition of the term text file, in which the file format is decisive, the colloquial use of the term is often based primarily on the content of the file that is visible to the end user : All files are somewhat vaguely referred to as "text files" that are linked to the Aim to present a readable text, regardless of the form in which it is saved. The files generated by conventional word processing or publishing software when they are saved, however, are often complex file formats which, in addition to the text, contain meta information for describing the text layout, the structure and the fonts used; pictures or graphics can also be embedded. Therefore, it is not a question of text files in the technical sense, as the file formats are often binary and special software is required for display.

With a text file in the technical sense, the number of available characters is determined by the underlying coding . The most common here are ASCII or UTF-8 , a coding of Unicode . Such a text file does not necessarily have to contain text - it can also be ASCII-Art , for example , i.e. pictograms based on the available characters. However, if it is text and does not require special processing or knowledge of a special notation to understand the meaning , the content is referred to as plain text . However, the set of characters is often restricted by a natural or formal language . Text files that require a specific notation - such as HTML files - can be edited with a simple text editor, but there are often special programs that make editing easier - for example, with special highlighting or automatic formatting.

history

In the early days of electronic data processing , the distinction between text and binary files was easier than it is today. With a text file, a character was always converted directly into a special bit pattern. The file could be transferred straight to a terminal , printer or teletype without any detours - that is, character by character, without any conversion by a special program . The Baudot code used for transmission between teleprinters is also the origin of the control characters " line feed " or " carriage return " to be found in text files .

Character coding is used to convert the physically stored bit sequences into a text . In the past, almost exclusively one character was always converted into exactly one byte , i.e. usually a group of 8 bits, which enabled 256 (corresponds to 2 8 ) different characters. When encoding using ASCII in the original definition, only 7 bits were actually used.

With the 7- or 8-bit character sets, only one font can be used in a file; the use of different languages ​​is only possible to a limited extent. The East Asian writing systems, such as Japanese , Chinese and Korean , can practically not be reproduced at all. With ISO 2022 there was a standard for the first time in 1986 that made it possible to use different fonts in a text file, and that also provided for fonts that use more than 256 different characters. However, this standard was only widely used in East Asia and was superseded by Unicode , first published in 1991 , which is intended to map all existing writing systems in the long term.

Since the introduction of Unicode at the latest, converting a character into its binary representation has become more complicated, since there are several variants for this and a character is not always converted with the same number of bytes.

Since the exchange of files between different computer systems has become more important, not least because of the Internet , and text files allow files to be processed more easily than binary files, the text format has gained in importance. However, due to the diverse use of text files, the term itself has become more inexpedient and blurred.

Differentiation of binary and text files

Many operating systems have conventions regarding the extension of file names to identify the file type. Under Windows and macOS , the extension is usually .txtappended to the name of a text file ; this file extension is also sometimes used in other operating systems such as Linux.

The Multipurpose Internet Mail Extensions (MIME) designed to standardize the technical format of e-mails define so-called media types , which are now used in many other areas to identify the file type in addition to e-mail traffic. The media type indicates text. The complete type specification is supplemented by a subtype that specifies the purpose of the text. In the case of text files that directly contain the "actual" text that is not intended for specific machine processing, the full type is given . texttext/plain

No special formatting, such as bold highlighting, can be specified for the text contained in a text file . Some encodings allow the stacking of diacritical marks or the representation of bidirectional text .

A file created with a word processor (such as Microsoft Word or LibreOffice Writer) is normally not a text file, even if only text was recorded, as the text can only be displayed and edited again using a suitable word processing system. Even a text in PostScript ( .ps), Portable Document Format (PDF, .pdf) or TeX - DVI ( .dvi) is not a text file because it contains encoded format information, which can also be binary. Texts that are read in using a scanner are also not text files. Rather , these are image files , unless they are converted into a text file after the scanning process using text recognition software (OCR, optical character recognition ).

With data compression , a considerably greater saving in memory size can usually be achieved with text files than with binary files. This is because text files have a lower information density than most binary files, which is what common compression algorithms take advantage of - for example by using Huffman coding .

Identification of the end of the line

There are basically two ways of defining where a new line should begin in the text: defining a constant number of characters per line or using defined special characters to mark the end of the line.

Definition of a constant line length

Using a fixed line length has the advantage that the position of a specific line within the character string ( byte sequence ) of the file can be determined without having to read the file line by line. However, it has the disadvantage that lines with shorter content have to be "filled in" (see padding ); this is usually done with spaces . As a result, the file takes up more space than necessary if the line length is not exhausted. Such a fixed line length is only used on mainframe systems. The record length is managed by the file system or must be specified when accessing the file. The record length of 80 characters is very common, as this number of characters can be displayed in one line on character-oriented terminals , which in turn goes back to the historical punch cards .

Identification by means of control characters

The usual definition of the character for marking the end of a line is reminiscent of the original direct data output of text files on teleprinters or printers, the design of which corresponded to a typewriter . There, the "commands" were carriage return ( Carriage Return , CR) and line feed ( line feed , LF) necessary to cause the continuation of the print output at the beginning of the next line - with a teletype were the two separate buttons. These two control characters were consequently the most promising candidates to be used to mark the end of a line in the electronic storage of files. In principle, however, one sign of both is sufficient, and this option meant that the definition was inconsistent, which is still a complication when exchanging files across systems:

  • Mainly on the platform of the IBM PC-compatible computer , under operating systems such as PC-compatible DOS or Windows , the sequence of CRand LF(ASCII: two bytes) is used to mark the end of a line.
  • Under Unix and Unix-like operating systems, for example Linux , the end of the line is identified solely by means of LF(ASCII, UTF-8: one byte).
  • With older operating systems from Apple , for example System 1 to Mac OS 9 (1984-2001), CRa third option was used with the exclusive use of .
  • In the world of IBM mainframes, in addition to these two characters, another special character ( New Line , ) is used in the EBCDIC .NL

Most of the problems in this regard arise when exchanging files. B. between the Windows and Unix platforms, since these use the same character code over large areas and, with the exception of the line end character, no conversion of the files is usually necessary.

Further control characters

In addition to marking the end of the line, other control characters can occur, especially when using ASCII in text files. These were especially common when the content of the text files was still being transferred directly to the terminal or printer. The most important are the character Form Feed(FF), which marks the position of a page break in the text, and Horizontal Tabulation(HT), the tab character , which marks an indentation of the text.

In order to be able to influence the display of the text in an even more differentiated manner, escape sequences were sometimes used in connection with text files . They consist of the introductory control character Escape(ESC) and a sequence of additional characters that encode a display instruction. The established standards here are ANSI escape sequences according to ANSI X3.41-1974 and X3.64-1977, which were originally developed to control terminals such as the VT models by DEC . At the time of the dot matrix printer, the ESC / P standard introduced by Epson was widely used for printing , so that escape sequences of this type could also be found in text files.

Character encoding

The physical binary content of text files is converted into text according to a fixed rule for the respective file. The following character encodings are used :

  • ASCII is the most widely used format - especially when the various extensions of the standard are included.
  • ISO 8859-1 (also known as Latin-1 ) and ISO 8859-15 are standardized extensions of ASCII that form the basis of the Windows-1252 code used by Windows in the English and Western European language areas .
  • EBCDIC is a coding used on mainframe computers from IBM .
  • Unicode is an international standard that maps all meaningful characters worldwide. In contrast to the above codings, Unicode does not manage with 8 bits (i.e. one byte ), since Unicode defines far more than 256 different characters.

When using Unicode, the general conversion of a character into a byte cannot be used. There are different methods of converting Unicode into a byte sequence. The most common encodings are used here with the aim of minimizing the file size when the most common characters occur. For this, however, the rule is “sacrificed” that each character is always encoded with the same number of bytes. An example of this is the widespread UTF-8 encoding , which also has the special feature that all characters contained in the original ASCII are encoded in exactly the same way as in ASCII in one byte. The binary content of a file that consists exclusively of such characters is therefore identical, regardless of whether it was encoded in ASCII or UTF-8.

Text in which umlauts and eszett are displayed incorrectly (as special characters)
Incorrect display when using the wrong coding

With Unicode, there is also the convention of using special byte sequences (so-called byte order marks ) at the beginning of a file to indicate which Unicode coding is used. This is also necessary because on many systems - including Windows - the previous ASCII-based coding and Unicode are used in parallel. With such an encoding, the boundary to the binary file begins to blur.

If a text file is interpreted using an incorrect character encoding, it can be completely illegible if completely incompatible encodings are used - such as ASCII and EBCDIC. If, on the other hand, a different coding derived from the original ASCII is used, only the special characters - for example the German umlauts  - are displayed incorrectly, as these are not part of the first 128 standardized characters of the ASCII.

Exchange between different systems

When transferring text files from one system to another type of system, it is important to consider whether the character encodings used by the systems match. The method used to mark the end of a line must also be taken into account (see above ). The exchange of files that only use the first 128 characters of ASCII is usually quite problem-free on systems that use this or an encoding derived from it. The Unicode coding UTF-8 also exactly matches ASCII if these characters are used exclusively. If, on the other hand, other characters are used, a conversion is often necessary. Please note, however, that a conversion only needs to be carried out if the file is actually displayed on the target system itself. If the file is only saved on this system and transferred back to a system that uses the original encoding for display, a conversion would be unnecessary and possibly even harmful, since information can be lost through this double conversion.

When exchanging text files as attachments to an e-mail , inconsistencies can occur. The problem usually lies with the sender, as the mail client often cannot correctly determine the encoding of the text file, but does not require this information from the user for reasons of user-friendliness and thus does not enter the appropriate or incorrect information in the mail. In principle, most of the mail clients in use today are able to convert the coding if necessary.

In the case of a direct file transfer between systems, a special program is usually used for the transfer. This also takes over the necessary conversions, even if the coding of the two systems is completely different - for example when exchanging between Windows and IBM mainframes. In the case of a transfer, it is usually necessary to specify whether the file to be transferred is a text or binary file in order to determine whether the file should be converted or not - the content of a binary file would be destroyed by such a conversion .

Use of text files

The original and simplest case of using text files is to transmit the text they contain as actual information ( plain text ) . Text files can, however, be used to transmit more complex data using a formal structure that has to be defined in advance. The file is then mostly no longer intended primarily for direct use by the user , but is further processed by a specific program or maintained by a system administrator .

In many cases, text files are used in this way today, in which binary files actually appear to be predestined because only further machine processing takes place. The main disadvantage of binary files here is that their structure is far more inhomogeneous across system boundaries than that of text files (see for example byte order ). On the other hand, text files have the disadvantage that more storage space is required to store the same information and that the data often have to be converted back into binary format for further processing. However, since  the cross-system exchange of data has become more and more important - especially due to the Internet - data storage in text files is now common practice.

Text format is also often used for configuration files to be maintained by administrators or privileged users . A special configuration program would be required in each case with a binary format; when using the text format, the configuration file can be edited directly using a text editor .

Tabular data

Text files are used to store data with a tabular structure for various reasons. Files structured in this way can be further processed with a spreadsheet program ( e.g. Calc from the LibreOffice and Apache OpenOffice or Microsoft Excel packages ). Database data are often exported in this way in order to exchange them between mostly different application programs - even if the XML format seems predestined for such a case today.

There are several methods of tabulating data in text files, of which the following are the most common:

  • Separation of columns by tabulator: The tabulator character, a special control character, is used within a line to identify the column boundaries.
  • CSV format : This format, which originally meant Comma Separated Values , is similar to the separation by tab characters, except that the comma is usually used as a separator in the English-speaking world and the semicolon in German .
  • Definition of a constant number of characters per column: In order to be able to use such a file, you must know the width of each individual column. This definition is not saved in the file itself.

XML

XML (Extensible Markup Language) is a meta - file format . It therefore defines the format in which the structure of a file is defined. XML is deliberately a text format and should be readable for man and machine alike, and cross-system exchange of XML data should be made possible without any problems.

XML files are basically text files, the rough structure of which is standardized and which are mainly used for data exchange or for data storage - the exact purpose is not specified by XML itself. An example of a format based on XML is SVG ( Scalable Vector Graphics ) , a graphic format that is basically legible encoded in a text file.

The file formats of the word processors OpenDocument ( OpenOffice.org ) and Office Open XML (the newer versions of Microsoft Word , recognizable by the file extension .docxinstead of .doc) are based on XML, and the stored files are therefore text files. It should be noted, however, that the "text" that becomes visible when such a file is edited directly is not the "actual" text content of the document, but the description of the text document on a meta level .

Other file formats

In addition to XML formats, there are also some, mostly older, widely used markup languages that are often used and saved in the form of a text file.

  • HTML , the language for designing content on the World Wide Web , is structurally related to XML.
  • Rich Text Format (RTF) is a language for exchanging formatted text between word processing programs , even on different platforms.
  • TeX and LaTeX represent a typesetting system that uses a special language for text composition that is encoded in text files.
  • PostScript is a file format that enables professional print formatting and is saved in the form of a text file. The binary data contained in graphics are converted into text as hexadecimal digits. Since many printers can interpret this format directly, many word processing or desktop publishing programs output their results in PostScript format. However, PostScript is being replaced by PDF in some areas .

In addition, there are many other and also proprietary formats, the structure of which can only be determined if a corresponding specification is available .

View and edit text files

Under Windows and in the previous MS-DOS system , both from Microsoft , the command line commands TYPE and MOREthe display of text files are used. There are text editors for direct display and editing of text files under all operating systems , for example vi or Nano under Unix. Practically all text editors allow you to search for specific text content directly in a file. Many text editors also offer support for the display of special file formats, so various syntax elements are highlighted according to their meaning (for example by coloring). With the help of a text editor, a file can usually also be printed .

Both when displaying in a text editor and when printing, the problem can arise that the indentation of lines is not displayed correctly. This is mostly due to the fact that the file contains the tabulator control character, for which there is no standard definition of how far the indentation should be. How many characters are indented is therefore configuration information of the editor or printer. To make matters worse, when displaying in the text editor, the difference between a space and a tab character is usually not or only difficult to see.

Text editors often automatically insert "soft" line breaks if the screen window used is not wide enough to display the entire line. Such “soft” line breaks can also be inserted when printing. These line breaks are not contained in the file itself and can occur elsewhere if the output is on another medium. It is often difficult for the user to distinguish these from the actual, "hard" line breaks - that is, the line breaks that the user  has inserted into the file himself - for example using the corresponding key - and that are also saved in the file.

literature

  • Sascha Kersken: IT manual for IT specialists. Galileo Computing, Bonn 2009, ISBN 978-3-8362-1420-9
  • Nell B. Dale, John Lewis: Computer science illuminated . Jones and Bartlett Publishers, Sudbury 2007, ISBN 0-7637-4149-3 .

Web links

Wiktionary: Text file  - explanations of meanings, word origins, synonyms, translations
Commons : Text files  - collection of images, videos, and audio files

Individual evidence

  1. ^ Khalid Azim Mughal, Torill Hamre, Rolf W. Rasmussen: Java Actually: A First Course in Programming. Cengage Learning EMEA, London 2007, ISBN 1-84480-418-6 , p. 268 ( Google books )
  2. Nell B. Dale, John Lewis: Computer science illuminated. Jones & Bartlett Learning, Burlington 2013, ISBN 1-4496-6573-X , p. 364f ( Google books )
  3. a b RFC 4288 : Media Type Specifications and Registration Procedures . Section 4.2.1
  4. Steve Moritsugu, Sanjiv Guha, David Pitts: Practical Unix. Page 218, Que, 1999, ISBN 0-7897-2250-X ( online )
  5. The fact that the most significant bit in text files that used the original ASCII was always 0 was also used by heuristics to differentiate between text and binary files.
  6. ^ Peter Constable: Character set encoding basics. Understanding character set encodings and legacy encodings ( Memento of the original from May 5, 2013 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2Template: Webachiv / IABot / scripts.sil.org
  7. ^ Richard Gillam: Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard. Boston 2003, ISBN 0-201-70052-2 , p. 38 ff.
  8. ^ Sascha Kersken: IT manual for IT specialists. Galileo Computing, Bonn 2009, ISBN 978-3-8362-1420-9 , p. 779
  9. ^ Sascha Kersken: IT manual for IT specialists. Galileo Computing, Bonn 2009, ISBN 978-3-8362-1420-9 , pp. 848f.
  10. At least the file formats used by word processors do not contain the actual text directly, but since current word processing software often uses a file format based on XML, this file can be viewed on a different level of abstraction as a text file (see XML ).
  11. ^ Hans Werner Lang (FH Flensburg): Coding theory - Huffmann code
  12. GD Brown: zOS / JCL. Job Control Language in the z / OS MVS operating system. Munich 2004, ISBN 3-486-27397-3 , pp. 124ff.
  13. ^ Sascha Kersken: IT manual for IT specialists. Galileo Computing, Bonn 2009, ISBN 978-3-8362-1420-9 , p. 779
  14. ^ Sascha Kersken: IT manual for IT specialists. Galileo Computing, Bonn 2009, ISBN 978-3-8362-1420-9 , pp. 824f.
  15. WebSphere Message Broker: Converting EBCDIC NL to ASCII CR LF
  16. Manual for VT100 Terminal (English)
  17. Michael Schönitzer: Encodings
  18. RFC 959 : File Transfer Protocol
  19. ^ Sarah Coppin, Brent Hendricks: XML Basics
  20. Mario Jeckle: Extensible Markup Language (XML) ( Memento of the original from December 21, 2007 in the Internet Archive ) Info: The archive link was inserted automatically and not yet checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2Template: Webachiv / IABot / www.jeckle.de
  21. ^ Walter Ditch: XML-based Office Document Standards. (PDF; 1.5 MB) JISC, Bristol 2007
  22. ^ Sascha Kersken: IT manual for IT specialists. Galileo Computing, Bonn 2009, ISBN 978-3-8362-1420-9 , p. 823
This version was added to the list of articles worth reading on January 8, 2010 .