Column-oriented database

from Wikipedia, the free encyclopedia

A column-oriented database is a database management system that physically stores its content in columns (and not in lines). This has advantages in applications such as a data warehouse , where aggregates are formed from large numbers of similar elements. The column-oriented approach is in contrast to the row- oriented approach that most known database systems follow.

description

A database usually presents its data as two-dimensional tables made up of rows and columns; however, these must be stored in one-dimensional form. For example, a database might contain the following table:

Personal No Surname First name salary
1 Schmidt Joseph 40000
2 Müller Maria 50000
3 Meier Julia 44000

This simple table contains a column for the personnel number, name columns and a salary.

This table exists in memory and on the computer's hard drive . Both types of storage have in common that the data are arranged in a one-dimensional sequence of bytes from the point of view of the operating system . The task to be solved is to map the two-dimensional structure of a database table in a one-dimensional sequence of bytes.

A line-oriented database concatenates all data values ​​in a line, followed by the next line, and so on.

1,Schmidt,Josef,40000;2,Müller,Maria,50000;3,Meier,Julia,44000;

Instead, a column-oriented database works column by column:

1,2,3;Schmidt,Müller,Meier;Josef,Maria,Julia;40000,50000,44000;

The physical organization of a database is strongly influenced by partitioning , indexes , caching , views , OLAP cubes and transactional aspects such as write-ahead logging . Taking all these influences into account, it turns out that OLTP systems tend to be line-oriented, while OLAP systems strive for a balance between line and column orientation.

Advantages and disadvantages

Comparisons between row-oriented and column-oriented systems typically focus primarily on the efficiency of hard disk access, which is significant time consuming compared to other computer operations. Reading a megabyte of sequentially stored data can take as long as a single random access . And since the access time of the hard disks improves only slowly compared to the CPU speed (see Moore's Law ) , this view will remain as long as the systems store their data on hard disks. In a greatly simplified form, the following observations give you an idea of ​​the advantages and disadvantages of the column and row-oriented organization.

  • Column-oriented systems are more efficient when an aggregate has to be formed over many rows but only a few columns, since, in contrast to the row-oriented system, you only have to read these and not all columns.
    Example:SELECT SUM(Gehalt) FROM tabelle;
  • Column-oriented systems are more efficient if a column receives a new value for all rows of the table at the same time, since the column data can be written efficiently and the data in the other columns do not have to be taken into account.
    Example salary increase:UPDATE tabelle SET Gehalt = Gehalt * 1.03;
  • Row-oriented systems are more efficient when many columns of a single row are required at the same time and when the row width is very large, since the entire row can then be read with a single disk access.
    Example:SELECT * FROM tabelle WHERE Personalnr = 1;
  • Line-oriented systems are more efficient when inserting a new line if all the data for this line is available at once, since the line can then be written with a single access.
    Example:INSERT INTO tabelle (Personalnr, Nachname, Vorname, Gehalt) VALUES (4, 'Maier', 'Karl-Heinz', 45000);

In practice, line-oriented architectures are beneficial for typical OLTP tasks (e.g. accounting systems) with many interactive transactions. Column-oriented systems are well suited for OLAP tasks (e.g. analytical information systems ), which are typically characterized by a small number of very complex queries over all data records. But there are also a number of tried and tested line-oriented relational OLAP databases that can process terabytes or even petabytes of data, such as Teradata , or the IBM PureData System for Analytics (IBM Netezza).

compression

Column data has a uniform data type; therefore there are some options for disk space optimization available in column-oriented systems that are not possible with line-oriented data. For example, many compression schemes such as the Lempel-Ziv-Welch algorithm (LZW) or run-length coding take advantage of the similarity of neighboring data for compression. While these techniques can be used for row-oriented data, a typical implementation will produce less effective results.

To improve compression, some implementations (for example Vertica ) sort the columns. When used in conjunction with bitmap indexes , sorting can improve compression by an order of magnitude. In order to improve the compression of the lexicographical order in run-length coding, it is advisable to use the columns with less cardinality as the first sorting key. In a table with the columns Name , Gender , Age , it would be best to sort on the basis of gender (cardinality 2), then age (cardinality <150), and then the name.

In the case of a column-oriented database, where each individual column can be compressed, the order of the columns in the table does not have any influence on the compression. In any case, the sequence can lead to better compression rates for composite indices. However, when re-sorting, the usefulness of an index for a query that only specifies a part of the index fields can be lost. For example, if the index includes name and plant across all employees, the compression will have to be increased if it is rearranged by plant and name. Afterwards, however, the index can usually no longer be used for a search by name.

Column compression leads to a reduction in disk space consumption at the expense of reading speed. All of the data in a single column can be read much more efficiently if this data is stored in the same location, as is the case with a row-oriented architecture. Access to individual data becomes more difficult as the compression increases, since large amounts of data must first be decompressed in order to read a single record. Therefore, column-oriented architectures are often enriched with additional mechanisms in order to minimize the need to access compressed data.

Since the mid-2000s, the assumption that the bottom line is that compression is slower no longer applies. With increasing computing power, it has usually become faster to fetch small amounts of data from disk and then decompress instead of reading large amounts of uncompressed data. The same applies to write access. The manufacturers of line-oriented databases such as Oracle also rely on compression and recommend this on suitable servers to increase speed.

Implementations

Column storage came in the form of inverted files in the early days of database systems, starting in the 1970s. For example, Statistics Canada implemented the RAPID system as early as 1976 and used it for the Canadian census and some other statistical applications. RAPID was also used by other statistical organizations around the world until the 1980s, and Statistics Canada even into the 1990s.

For many years Sybase IQ was the only product available on the market in the field of column-oriented database systems. However, this has now changed significantly due to many open source and proprietary applications:

Individual evidence

  1. George P. Copeland, Setrag N. Khoshafian: A decomposition storage model . SIGMOD '85, 1985, doi: 10.1145 / 318898.318923 .
  2. ^ C-Store: A column-oriented DBMS . (PDF; 174 kB) In: Stonebraker et al .: Proceedings of the 31st VLDB Conference , Trondheim 2005
  3. Pat & Betty O'Neil, Xuedong Chen, Stephen Revilak: The Star Schema Benchmark and Augmented Fact Table Indexing . (PDF; 501 kB) TPC Technology Conference 8/24/09
  4. DJ Abadi, SR Madden, N. Hachem: Column stores vs. row stores: how different are they really? SIGMOD'08, 2008, pp. 967-980.
  5. ^ N. Bruno: Teaching an old elephant new tricks . (PDF; 185 kB) CIDR '09, 2009.
  6. Daniel Lemire, Owen Kaser, Kamel Aouiche: Sorting improves word-aligned bitmap indexes . In: Data & Knowledge Engineering , 69 (1), 2010, pp. 3–28. arxiv : 0901.3751
  7. ^ Daniel Lemire, Owen Kaser: Reordering Columns for Smaller Indexes , arxiv : 0909.1346
  8. Slezak et al .: Brighthouse: an analytic data warehouse for ad-hoc queries (PDF; 456 kB), Proceedings of the 34th VLDB Conference, Auckland 2008
  9. Oracle Advanced Compression . Oracle Technet
  10. ^ Turner, Hammond, Cotton: A DBMS for Large Statistical Databases . In: Proceedings of VLDB 1979 , Rio de Janeiro.
  11. ibm.com
  12. http://www.vermontdatabase.com/rc21home.htm
  13. Information technology updates: MariaDB ColumnStore