Star scheme

from Wikipedia, the free encyclopedia
Star schema: The fact table has a key composed of the primary keys of the individual dimension tables as the primary key

The star schema is a special form of a data model , the aim of which is not normalization , but rather an optimization for efficient read operations. The main fields of application are data warehouses and OLAP applications.

The term star schema comes from the fact that the tables are arranged in a star shape: in the center is a fact table around which several dimension tables are grouped.

A star schema is usually denormalized . Possible anomalies and increased memory requirements are accepted for performance reasons. An improvement is possible through the snowflake scheme related to the star scheme . There, however, multi-level dimension tables must be linked using join queries .

definition

The so-called star schema has established itself as the logical database schema for data warehouse applications. This schema consists of a fact table and several dimension tables, which are arranged in a star-shaped manner around a fact table in a query-friendly manner and which in this schema relate to exactly one fact table. The name of this schema is derived from the star-shaped arrangement of the dimension tables around the fact table at the center. The mentioned fact table has information-carrying attributes, such as B. sales, periods, costs, etc., and as the primary key a composite key from the primary keys of the dimension tables involved.

Each dimension table has a 1: n relationship with a fact table. The one-to-many relationship is mediated by a key from the dimension table and a foreign key from the fact table. The fact table implicitly integrates many-to-many relationships in a single table and therefore contains a lot of redundancy. The key of the fact table consists of the primary key of the respective dimension table as a foreign key .

The star scheme allows the selection, summary and navigation of the measured values ​​or facts. The dimension tables are usually not normalized and are therefore denormalized: There are functional dependencies between non-key attributes , so that the 3rd normal form (3NF) is violated. However, this violation is accepted with this scheme, because the data structure enables a better processing speed at the expense of data integrity and storage space.

Fact and dimension tables

The data to be managed is called facts ; they are typically continuously stored in the fact table . Other names for the facts are metrics, metrics, or measures. Fact tables can become very large, which forces a data warehouse to gradually condense (aggregate) the data and finally delete or outsource it ( archiving ) after a holding period . The tables contain key figures or results that can be derived from ongoing business and reflect economic performance, such as: B. profitability, costs, performance / revenue, expenses, income, expenses, income, etc. However, only when these figures are put into context do they make sense. An example is that sales in a certain area are compared with specified products in a defined period of time, which reflects dimensions in which the economic performance is evaluated and analyzed.

In contrast, the dimension table contains the "descriptive" data. The fact table contains foreign keys to the dimension entries that define their meaning. Typically, the total number of foreign keys on the dimension tables simultaneously represents the primary key in the fact table. This implies that there can only be one entry for a combination of dimensions.
Dimension tables are comparatively static and usually considerably smaller than fact tables. The term “dimension” comes from the fact that each dimension table represents one dimension of a multidimensional
OLAP cube .

Due to the existence of functional dependencies between non-key attributes, the third normal form is deliberately violated in the dimension tables . In order to satisfy the 3NF, the relevant dimension table would have to be broken down into individual hierarchical tables, but for reasons of performance one refrains from normalizing the dimension tables with the star scheme and accepts the redundancy that occurs.

The advantage of separating facts and dimensions is that the facts can be analyzed generically and independently according to each dimension. An OLAP application does not need “knowledge” about the meaning of a dimension. The interpretation is left to the user alone.

However, the size of dimension tables should be considered. Fact tables can often contain more than 10 million records in a star schema. Dimension tables are smaller, but individual dimensions can be large. To reduce such large data sets and the associated shortened access times, individual, very large dimension tables can be converted into a snowflake scheme by normalization .

Slowly Changing Dimensions

One problem with the star schema is that data in the dimension tables is related to data in the fact tables over a long period of time. However, changes to the dimensional data may become necessary over time. However, these changes are usually not allowed to affect data before the change. For example, if the seller changes for a product group, the respective entry in the dimension table must not be simply overwritten. Instead, a new entry must be generated, otherwise the sales figures of the previous seller would no longer be ascertainable. Slowly changing dimensions are a concept to avoid such conflicts . This concept combines methods in data warehousing in order to record changes in dimension tables and, if necessary, to document them historically.

Properties star scheme

  • Dimension tables
    • Primary key to identify the dimension values
    • Mapping of the dimension hierarchy through attributes
    • Denormalized, that is, non- normalized dimension tables
  • Fact table
    • Foreign key to the dimension tables, i.e. the lowest level of each dimension is included as a key in the fact table
    • Foreign keys on the dimensions form composite primary keys for the facts

Pros & cons of star scheme

advantages

  • Fast processing of queries: Analytical queries are typically at a higher aggregation level, and by not normalizing the dimension tables, joins are saved. Furthermore, a special join ( star join ) can be optimized well.
  • Volume of data: Dimension tables are very small compared to fact tables. The additional data volume due to a denormalization of the dimension table does not have to be taken into account.
  • Change anomalies can be easily controlled as there are hardly any changes to classifications.
  • simple, intuitive data model: the star schema has significantly fewer relations than a converging snowflake schema and the JOIN depth is not greater than 1
  • Comprehensibility and traceability: the reporting system can be modernized using the star scheme. For this purpose, data collections for trend recognition and data mining can be created.

disadvantage

  • Deteriorated response time behavior for frequent queries of very large dimension tables (browsing functionality)
  • Redundancy within a dimension table through the multiple storage of identical values ​​or facts.
  • Aggregation formation is difficult

Star scheme vs. Snowflake scheme (normalized)

Star scheme Snowflake scheme
target
  • User-friendly query (aggregate access; simple, intuitive data model)
  • Redundancy minimization through normalization
  • Efficient transaction processing
Result
  • Simple, local and standardized data model
  • a fact table and a few dimension tables
  • Complex and specific scheme
  • Many entities and relationships in large data models

Requirements diagram of the star scheme

1. Collect operational requirements:
Building a star schema begins with the question

  • Which facts are of interest according to which criteria?
  • Available data, required evaluations & table contents

2. Create requirements diagram:
Defined specifications can be subsumed in a requirements diagram

  • Required indicators:
    • Attributes that evaluate the outcome of a business unit
    • Question: How well?
  • Dimensions:
    • Attributes along which the indicators are measured
    • Ask, what? When? Where?
  • Categories:
    • Value ranges of a dimension
    • Questions: How exactly?

Star join

In typical queries, the star scheme leads to so-called star joins, which look like this:

SELECT Fakt- oder Dimensionsattribut
FROM Fakt- oder Dimensionstabellen
WHERE Bedingung
GROUP BY Fakt- oder Dimensionsattribut
ORDER BY Fakt- oder Dimensionsattribut

example

Example star schema for a query query

For example, the query selects several relevant measures from the fact table, links the fact lines with one or more dimensions using the substitute keys, fills the business columns of the dimension tables with filter predicates, groups by one or more business columns and finally aggregates the measures retrieved from the fact table via one certain period. The following example shows the total sales of a product over a defined period of time.

select ProductAlternateKey, CalendarYear, sum(SalesAmount)
from FactInternetSales Fact

join DimProduct
on DimProduct.ProductKey = Fact.ProductKey

join DimTime
on Fact.OrderDateKey = TimeKey

where ProductAlternateKey like 'XYZ%'
      and CalendarYear between 2008 and 2009

group by ProductAlternateKey,CalendarYear

The number of joins used (in this example there are 2 joins) are independent of the length of the aggregation paths in the star scheme, in contrast to the snowflake scheme.

See also

literature

  • A. Kemper, A. Eickler: Database systems. An introduction . 6th edition. Oldenbourg Wissenschaftsverlag, Munich 2006, ISBN 3-486-57690-9 .
  • R. Kimball, M. Ross: The Data Warehouse Toolkit. The Complete Guide to Dimensional Modeling . 2nd Edition. John Wiley & Sons, New York 2002, ISBN 0-471-20024-7 (English).
  • P. Rob, C. Coronel, K. Crockett: Database systems: design, implementation & management . Cengage Learning, London 2008, ISBN 1-84480-732-0 (English).
  • L. Langit: Foundations of SQL Server 2005 Business Intelligence . 1st edition. Apress, New York 2007, ISBN 978-1-59059-834-4 (English).

Web links