Aggregation (OLAP)

In connection with the management of large amounts of data in a data warehouse , aggregation (also called consolidation or compression ) refers to the consolidation of a series of facts into a single fact. For example, the mean , the minimum or maximum or the sum can be determined from a set of numbers . Such functions that assign a single value to a set of numbers are called aggregation functions (or summary functions). The result is then used as a substitute for the source data. The reverse is known as refinement .

Aggregation and refinement are operations of Online Analytical Processing (OLAP). They form an OLAP cube (English Cube ab) to a smaller (aggregation) or larger (refinement) cubes. The operations take place along classification paths and thus change the classification levels (also: consolidation levels) of the individual dimensions .

The corresponding operations in a DBMS are also referred to as "roll up" and " roll / drill down ". A simple aggregation is possible in SQL using GROUP BY.

Some multidimensional databases use memory-based analysis. Due to the high performance of the storage medium and the limited amount of data, they enable aggregations at the time of the query ( calculation on the fly ).

example

A data warehouse contains sales by date, industry and business as individual facts. The date dimension contains a classification path with the classification levels day-month-year-TOP. By aggregating along this path, using the sum function, for example, the number of sales by year, industry and business can be determined. When aggregating to the top classification level TOP, the date dimension coincides; the result cube contains the total number of sales by industry and business over all times. The use of aggregate functions and, if necessary, other arithmetic operations to determine a single number is also called the determination of a key figure or a key performance indicator (KPI).

Aggregate functions

Aggregate functions are functions that combine certain properties. The functions used in an aggregation can be divided into

Distributive functions : sum (SUM), number (COUNT), maximum (MAX), minimum (MIN) ...

Formally, a distributive function can be described as follows: The data to be aggregated is divided into partitions and it is the attribute that is to be aggregated. Then there is a function that operates on a set of the same results as and can be represented as: ${\ displaystyle F}$ ${\ displaystyle J}$ ${\ displaystyle X_ {i}}$ ${\ displaystyle G}$ ${\ displaystyle F}$ ${\ displaystyle F}$

${\ displaystyle F (\ {X_ {i} \}) = G (\ {F (\ {X_ {i, j} \}) | j = 1 \ dots J) \})}$ .

This is the entirety of all attribute values in the database and are those attribute values that are in the partition . ${\ displaystyle \ {X_ {i} \}}$ ${\ displaystyle \ {X_ {i, j} \}}$ ${\ displaystyle j}$

So you first calculate the cardinality of the individual partitions and then add up the individual results.

Algebraic functions : mean (AVG), truncated mean (truncated AVG), standard deviation, top N ...

Algebraically, functions are defined in the same way as before, but you have greater freedom here with regard to the function that works on the individual partitions. If you previously had to use the same function that is used for the entire database, you can select a different function here . An algebraic aggregation function thus has the following representation: ${\ displaystyle F}$ ${\ displaystyle H}$ ${\ displaystyle F}$

${\ displaystyle F (\ {X_ {i} \}) = G (\ {H (\ {X_ {i, j} \}) | j = 1 \ dots J) \})}$ .

Here is a function that operates on a set of result tuples of the function . and are defined as above. ${\ displaystyle G}$ ${\ displaystyle H}$ ${\ displaystyle \ {X_ {i} \}}$ ${\ displaystyle \ {X_ {i, j} \}}$

The expressions and denote the first or second element of a 2-tuple. ${\ displaystyle x.first}$ ${\ displaystyle x.second}$

Holistic functions : median , rank , percentile, most common value ...

Holistic functions are aggregation functions for which neither of the previous two definitions apply.

Aggregatability

Requirements for the dimensions

No overlap in the assignment of classification nodes. Example: the subject area of a student is not free of overlap, because it cannot be ruled out that a student is registered for several subject areas.
Completeness of the decomposition per classification level. Examples: The postcode of a resident's main place of residence is incomplete because there are also residents who do not have a permanent address. They require the expansion of the dimension to include a zero value or an 'other' value.
Type compatibility of fact and aggregate function. Example: For average values, a repeated averaging at a higher aggregation level is usually not possible.

Data types of facts are

fully aggregable if all the facts that quantitatively describe a flow of a good in a certain time unit. All aggregate functions can be used sensibly here. Examples: sales, costs, electricity consumption, goods issue.
Can be partially aggregated if all the facts for which some aggregate functions cannot be used or can only be used to a limited extent. These are 1. all facts of the type piece or stock. They describe an existing quantity at a specific point in time. Example: number of inhabitants, flood level, account balance. These facts cannot be summed up over the time dimension. They can already be totaled over other dimensions that receive the reference at the same point in time. 2. There are other reasons that restrict the use of some aggregate functions. Examples: school grades, dates, averages. Here, summation is generally not useful or not defined at all. Most of the other aggregate functions can already be used.
cannot be aggregated if all facts for which no aggregate functions at all can be used meaningfully. Example: departments, postcodes, account numbers, surnames. If no order (sortability) is defined for the data type, then even the functions maximum and minimum cannot be used.

Dimension hierarchy and aggregation paths

Dimensions can often be broken down as hierarchies.

If each node is functionally dependent on its predecessors, then a linear aggregation path can result. Example:

Tag --- Monat --- Quartal --- Jahr

It can happen that different aggregation paths are possible in one dimension. Example:

Tag --- Woche --- Jahr
Tag --- Monat --- Jahr

There are two different aggregation paths to lead from day to year. Both give the same result. However, you cannot get from the week to the month and vice versa.

This example also shows that different aggregation paths often involve the risk that the results do not exactly match. If the turn of the year does not coincide with the change of a week, there may be deviations in the two aggregation paths. A clear handling must be found for the treatment of the last week of the year, otherwise the aggregation path day --- week --- year is wrong.

literature

Lenz and Shoshani: Summarizability in OLAP and Statistical Databases . SSDBM, 1997.