CSV (file format)

from Wikipedia, the free encyclopedia
CSV
File extension : .csv
MIME type : text / csv


The file format CSV stands for English Comma-separated values (rare character separated values ) and describes the structure of a text file easily for storage or exchange of structured data . The file name extension is .csv .

There is no general standard for the CSV file format, but it is fundamentally described in RFC 4180 . The character encoding to be used is also not specified; 7-bit ASCII code is widely considered to be the lowest common denominator.

Tables or lists of different lengths can be displayed in CSV files .

The Simple Data Format (SDF) was used as an alternative to the CSV format .

More complicated, for example nested data structures can be saved using additional rules or in linked CSV files. However, other formats such as JSON , XML or EDIFACT are more suitable for saving them in a file .

File structure

Some characters within the text file have a special function for structuring the data.

  • A character is used to separate data records . As a rule, this is the line break of the file-generating operating system - in practice, in the Windows operating system it is often actually two characters.
  • A character is used to separate data fields (columns) within the data records. The comma is generally used for this . Depending on the software involved and the user settings, semicolons , colon , tabs , spaces or other characters are also common.
  • In order to be able to use special characters within the data (e.g. commas in decimal values ), a field delimiter (also: text delimiter ) is used. Usually this field delimiter is the quotation mark " . If the field delimiter itself is contained in the data, it is doubled in the data field (see masking marks ).

The first record can be a header record that defines the column names.

According to RFC 4180 , paragraph 2, point 4, each data record should contain the same number of columns - but this is not always observed.

Formatting of the data fields

The formatting of the data itself is not specified. This means that the formats used must be agreed between the users involved. The following are particularly affected:

  • Dates and times
    • The order of the individual details (year, month, day, hour, minute, second, ...) cannot always be clearly identified.
    • A further complicating factor is that nationally different separators are used, especially when specifying dates.
    • The most harmless hurdle in this case is that the numerical values ​​occur with and without a leading zero.

Examples: Is 04/03/02 March 4, 2002, April 3, 2002, March 2, 2004, or a completely different value? Does 8:09 correspond to “nine minutes past eight in the morning”, “8:00 p.m.” or is it a “duration of 8 minutes and 9 seconds”?

  • Texts
    • In contrast to XML, CSV does not provide for a note on the character set used within the file. The character coding used should be defined in advance between all parties involved.
  • Numerical values
    • According to the original specification for the CSV data format, number fields with a fixed minimum width can be used. Then numerical values ​​are supplemented with leading zeros in order to obtain the minimum width.
    • Different decimal and thousand separators have become established in different countries. These symbols can even be used in opposite directions across national borders.
    • Sometimes a thousand separator is not used.
    • The wealth of formats for currency information is unmanageable.
  • Empty field
    • The field content "" is sometimes interpreted as empty content and sometimes as a single quotation mark.

Calculations

The CSV format describes data records linked to one another line by line. Calculations are not provided, but many programs such as LibreOffice Calc , OpenOffice.org Calc , Excel and Gnumeric accept appropriate arithmetic expressions. These are based on the respective program. For example, the following (first) line works for the programs mentioned:

100;200;=A1+B1

Depending on the language environment, named functions can also be used.

software

Applications

  • The CSV file format is often used to exchange data between different computer programs, for example database tables .
  • The password file of /etc/passwdthe Unix user administration is a CSV file with the separator ":".

example

The following source text of a CSV file, with the comma (,) as field separator and line break (CR LF) as data record separator and with column headings in the first line:

Stunde,Montag,Dienstag,Mittwoch,Donnerstag,Freitag
1,Mathematik,Deutsch,Englisch,Mathematik,Kunst
2,Sport,Französisch,Geschichte,Sport,Geschichte
3,Sport,"Religion (ev., kath.)",Kunst,,Kunst

represents the following table:

hour Monday Tuesday Wednesday Thursday Friday
1 mathematics German English mathematics art
2 Sports French history Sports history
3 Sports Religion (ev., Cath.) art art

In this example, the quotation marks are used to mark the comma between ev and kath in the last line as text. The third element of this line is religion (ev, cath) . In addition, all ASCII characters are processed in the form of the 8-bit ISO 8859-1 character set .

Norms and standards

The CSV file format is also specified as Request for Comments (RFC):

  • RFC 4180 - Common Format and MIME Type for Comma-Separated Values ​​(CSV) Files; from October 2005.
  • RFC 7111 - URI Fragment Identifiers for the text / csv Media Type; from January 2014.

Individual evidence

  1. Dirk Louis, Peter Müller: The Java 6 Codebook , Pearson Education, 2007, ISBN 978-3-8273-2465-8 , p. 259: “The most common variation is the replacement of the comma with another separator, which is why CSV often also is understood as an acronym for Character Separated Values ​​"
  2. Page of fgetcsv in php manual