Index structure

Index structures (indices) are used in computer science to ensure quick access to data in an extensive data collection. Data are usually managed sequentially on a storage medium . The processing of a search query would be associated with linear effort, in the worst case the entire database would have to be searched.

If a specific data record is now searched for in this amount of data using a search criterion, an expensive search can be avoided using an index structure. The index makes it possible to quickly determine the position of the data record within the medium.

Known procedures

Index structures are themselves special data structures such as

sorted series (efficient search using binary search )
Hash tables
Tree structures like binary trees , B-trees , B * -tree

There are also special index structures for special requirements. For example, spatial databases use R-trees to index multidimensional data . These trees allow multi-dimensional search criteria and distance calculations.

Bitmap index , grid file , kd tree , R tree, UB tree , area tree for multi-dimensional data structures

Functional principle based on an example

A typical index card system can be referred to as an index structure: The set of addresses is subdivided in such a way that, based on the search criterion ("family name"), only a certain subset of all addresses is possible. A typical subdivision would be according to the first letter. If the name "Miller" is then searched for, only the subset that begins with "M" has to be searched. This subset can usually be found quickly using a marker in the card index. If this subset is still too large, the subsets can be further subdivided using the second or third letter. Instead of searching through all names, only the names that begin with "Garbage" are searched. This can be done in a comparatively short time.

However, such an index card system is not suitable for finding out who's birthday is on a particular day: one would always have to search through the entire index. It is said that the “birthday” attribute was not indexed . A remedy here is a second index, in which usually only a reference to the primary index is stored for each calendar day, here typically the name. This is formally referred to as a “secondary index” or “external index”. In order to send birthday greeting cards for a specific day, for example, it is necessary to first determine the names of the recipients and then look up the associated addresses in the primary index. The advantage, however, is that an address change only has to be made in the primary address file, not additionally in the birthday list.

Any completely sorted list also represents a simple type of index structure that can be searched efficiently using the so-called binary search . Here, the list is mentally divided into two parts, and the existing sorting makes it easy to determine in which of the two parts the element sought must be located. Such a search is also possible, for example, in a printed telephone book. A more advanced index structure based on the same idea is the B-tree . The main advantages here are that they can be updated more easily. In a manually managed address book, however, you have the problem that you cannot insert any entries, so you can neither create a complete sorted list nor a B-tree there - the index card system described above (which does not completely sort within an initial letter) but yes. This is also not a problem with the data volumes that are usually kept by hand.

use

Databases are the best-known application area for index structures . Since particularly large amounts of data have to be processed here - especially so large amounts of data that they do not completely fit into the main memory - fast access is critical here. In this way, suitable indexes can be defined on tables , which lead to a considerable increase in performance. See also the database index , which is the technical implementation of the inverted file .

Geographic information systems often use index structures in the form of a database, but direct integration of the index structure can sometimes be necessary in order to achieve the desired performance. Spatial index structures are used here to limit the search area.

A less obvious area of application is computer graphics , particularly in real-time applications and 3D computer games. Spatial index structures such as the BSP tree are often used here to manage the environment.

Types

Index structures can be divided into different types based on the way they work.

Internal and external index structures

Internal index structures store the actual data themselves, external structures only store references to the data (for example in another index). This is also referred to as a primary or secondary index . A secondary index generally uses less memory and is easier to keep in sync with changes to the data. However, it is also slower since the actual data must always be fetched from the primary index.

An example of a secondary index would be the following: Address data are stored on index cards as described above, sorted by family name. A list is maintained as a secondary index which maps the first name to the family name, for example "Max" to "Miller" and "Mustermann". However, only the family name is stored here, not the full address. For example, if the telephone number changes, it is sufficient to update it in the primary index. However, if you want to find out the telephone number of each “Max”, you must first look up the family name in the secondary index, then the telephone number with the family name in the primary index. If the full address were also stored under the first name, this would be a second primary index, but any change would then have to be made in both files.

Secondary indices allow the emulation of multi-dimensional index structures. In many cases, however, they are inferior to a genuinely multidimensional index structure.

Dynamic and static index structures

An index structure is called dynamic if it dynamically adapts its structure to changes to the data, otherwise it is called static .

Static index structures allow faster access to the data, but must be recalculated in a laborious manner if the data is changed. For example, when inserting into a sorted row, on average half of the stored data must be shifted. Dynamic index structures allow the index to be updated dynamically with every change. It is also said that when changes are made, only a “local” change is made to the index structure, i.e. only a small part of the index is changed.

A typical example for a static index structure is the (static) hash table and for a dynamic index structure the B-tree . The tree depth and structure are automatically adapted to the amount of data. Changes to the tree often only happen in the affected leaf or a path from the root to the affected leaf.

One and multi-dimensional index structures

Typical index structures index according to a single attribute (for example "last name"), even if the data contains additional attributes. In order to allow requests for such additional attributes (or a combination thereof), additional one-dimensional secondary indices are created. But if you want to inquire about two attributes at the same time, the section must be calculated. In unfavorable cases, a great deal of data is initially loaded and then discarded when the section is calculated.

An index structure that can efficiently answer queries with more than one attribute (for example “first name” and “last name” at the same time) is called multidimensional . This is mainly used for spatial data, for example in geographic information systems . Inquiries of the type “What is the nearest petrol station?” Or “All objects in rectangle R” cannot be answered on the basis of a single dimension, but require a rectangle query in the index structure. Queries for only individual attributes can then often be answered more quickly by choosing the query rectangle so that it completely covers the value range of the attributes that are not required.

Clustered and non-clustered index structures

An index structure is called clustered if the data is physically sorted exactly as it is needed in the index and in queries. In this way, area inquiries can be answered more efficiently, while successive operations on neighboring elements usually require the same data pages and can thus be buffered .

An address book sorted alphabetically by family name with separate pages for each letter corresponds functionally to a clustered index. A hash table is usually not clustered; the data of neighboring keys are usually in completely different data pages. A B + tree is an example of a clustered index structure, each data page corresponds to a disjoint interval of the key space.

Thin and densely populated index structures

An index structure is said to be sparse if it also allows "gaps" (i.e. unoccupied memory locations) in its data structures. Sparsely populated index structures therefore occupy additional memory and such positions must be ignored when searching. With dynamic index structures, however, this is also an important advantage: new data can simply be inserted into such a gap. Conversely, when data is removed, new gaps can arise without the index having to be laboriously reorganized. The reduced reorganization effort is therefore a clear advantage in the case of frequently changing data, despite the increased memory consumption and the increased search time. Many sparse index structures allow dynamic size adjustment and try to guarantee, for example, that they do not take up more than twice as much space as necessary.

An ordered series is an example of a densely populated index structure. Hash tables and trees are usually sparse.

Consideration

Disadvantages of index structures are an additional administrative effort due to the structure itself and, in some cases, a high memory requirement.