Query optimizer

The query optimizer is part of a database management system that tries to calculate an optimal evaluation plan for a database query .

Not all DBMS have a query optimizer. The DBMS IMS z. B. does not need a query optimizer, because it is a hierarchical DBMS and the access to the data is not formulated in SQL syntax, but the evaluation plan must be specified with each query.

Query optimizers are typically used with relational DBMS . Because the SQL syntax only specifies which data should be determined, but not how it should be done.

There are two types of query optimizers: rule-based and cost-based query optimizers.

Query Optimizer Tasks

The task of the query optimizer is to minimize the response time of a data access. Since several access paths are often possible, especially with complex SQL queries, which often have very different response times, the task is to select an access path with the shortest possible response time.

In a first step, the various access routes are analyzed. In a second step, the various alternatives are evaluated and finally an access route is selected.

In order to determine the various access paths, it is also examined whether other formulations of the SQL query with an identical result set are possible. Example: Transformation of a subselect into a join or transformation of an OR predicate into a union.

If two or more tables are joined, there are several ways to do this.

Example with 3 tables A, B and C.

(A * B) * C
(A * C) * B
(B * C) * A
(B * A) * C
(C * A) * B
(C * B) * A

With 3 tables there are variants. If 10 tables are joined together, then there are 3.6 million theoretically possible variants. ${\ displaystyle 3! = 3 * 2 * 1 = 6}$

Most DBMS have several join algorithms that can be used to perform a join. If z. If, for example, there are 4 join algorithms to choose from, there are (A * B) * Cexactly possibilities for the variant alone to execute the two joins. This results in different combined variants to join the three tables with one another. ${\ displaystyle {\ text {4 join algorithms}} ^ {{\ text {3 tables}} - 1} = 16}$ ${\ displaystyle {\ text {6 variants}} * {\ text {16 join options}} = 96}$

If there are indexes for the tables, there are further options for structuring data access.

Some DBMS have the option of having a query carried out by parallel processing by several processors if suitable hardware is available.

The number of possible access variants can quickly become very high. The individual access variants usually differ considerably in their execution time :

Depending on the proportions of the tables involved in a join, certain join algorithms produce results quickly, while others are very time-consuming to execute.
The use of an index is usually only worthwhile if the predicate concerned (e.g. ID = 1234 ) has a strong selectivity , otherwise reading the entire table leads to the goal more quickly.
Parallel processing also requires a certain amount of coordination. Therefore, parallel processing is only beneficial in certain cases.

It is the job of the query optimizer to determine which of the various access variants has the best execution time.

Rules-based query optimizer

A rule-based query optimizer only uses the existing data structures (tables, indices ) to determine the evaluation plan . A fixed set of rules is used here. A rule can e.g. For example: If an existing index can be used, then do so. Only if no suitable index is available, perform a full table scan.

Rule-based query optimizers do not use information about the data stored in the tables to make decisions. In particular, the number of data records in the tables involved, values that occur particularly frequently, and the sorting status of the records are not taken into account.

The DBMS Oracle had only one rule-based query optimizer up to version 7. One of the rules stated that the order in which the individual tables are accessed during a join depends on the order in which the tables are specified in the SQL statement. This had the advantage that the programmer could influence which evaluation plan was used by formulating the SQL statement.

Rule-based query optimizers have the advantage that the evaluation plans determined are static. If a program was developed and tested in a test environment with a small amount of data, then you can be sure that the same evaluation plans will be used in a larger, productive environment. Periodic analysis of the data to determine statistics is not required.

The decisive disadvantage of the rule-based query optimizer is that the rigid rules are not based on the actual volume of data available. It's like choosing a route through the city center using just a city map without knowledge of existing traffic jams and roadworks. A cost-based query optimizer usually gives better results than a rule-based query optimizer.

Cost-based query optimizer

The cost-based optimizer uses statistical evaluations of the stored data for its decisions. These statistics need to be updated from time to time. It is advisable to refresh the statistics after every major change to the data volume.

Different statistical values are determined depending on the DBMS:

the number of records and the space used per table and index
the number of different values per column (this information is particularly useful for columns that also appear in an index)
the largest and the smallest value per column
the sort status of a table compared to the sort order of an index (clustering order)
Some DBMS (e.g. Oracle ) can determine histograms for each column. Other DBMS (such as DB2 ) can find frequency distributions for the most common values for each table column.
Key figures about the existing hardware such as B. the access speed of the hard disks , the size of the main memory , the number of available processors

In a first step, the cost-based optimizer determines all theoretically possible access plans. In a second step, an estimate is made for each access plan of the costs that the execution of this access plan will cause. The term "costs" primarily refers to the computing time . The use of the system components (e.g. memory requirements ) can also be included .

An exact calculation of the costs is usually too time-consuming or not possible in practice. Therefore heuristics are used to estimate costs. The access plan that receives the best rating is ultimately executed to determine the requested data.

The quality of the cost-based optimizer depends on how sophisticated the calculation models are. Since the SQL syntax is very extensive, many special forms and exceptions must be taken into account. There is a risk that the computationally most favorable execution plan is actually not optimal or is even significantly worse. In such cases, the actually cheapest execution plan received a worse rating due to the heuristics used and was therefore discarded. The goal is not necessarily to find the absolute best of the many possible execution plans. If the second best execution plan is only slightly worse in real cost, then it doesn't matter if it is chosen. However, if the heuristic estimates reality so poorly that it actually uses a much worse execution plan, then the optimizer has made a bad choice.

If a cost-based optimizer is used, it is important to collect statistics on the stored data periodically. If the data volume changes and the statistics are not updated afterwards, then the queries are optimized based on the outdated statistics. This increases the risk that the mathematically optimal access path is actually not the optimal one.

Another problem is the general incompleteness of the statistical data. Only certain parameters can be determined, but in reality there are many more factors that influence the costs of data access. There are often interactions between the individual predicates that are not taken into account in the model of statistical numbers.

The determination of histograms or frequency distributions - if the DBMS offers the possibility - is very time-consuming and additional storage space is required to store this data.

It is not only important that the execution of a query is optimized, but the determination of the optimal execution plan itself must not take too long. So that the search for complex accesses does not take too long, only part of the theoretically possible access plans are examined in many DBMS. B. only a maximum of 2000. All other access plans are not even analyzed. This has the consequence that with a complex access, in which z. B. 200,000 theoretically possible access plans exist, only 1% of all access plans are examined more closely. The probability that a good and fast access plan will be found with this approach is rather low. In fact, it often happens that for SQL statements in which many tables (e.g. more than 10) are joined together, execution plans with an unsatisfactory runtime are used, although there are other execution plans that are a hundred times faster than the desired ones Would provide data.

Another disadvantage of the cost-based optimizer is the unexpected change in an execution plan. Since the execution plan is not defined once, but is determined anew every time, after every update of the statistics, and especially after every release change of the DBMS software, there is a possibility that a less favorable execution plan will be used. This is then mathematically better, but actually worse than the one previously used. Such a change in an execution plan can increase the execution time by a factor of 10 or 100. This can be the reason why a program that has been in use for months with good performance suddenly gets significantly worse performance.

Individual evidence

^ Christian Antognini: Troubleshooting Oracle Performance . 2011, ISBN 978-1-4302-4296-3 , chapter Processing Joins .
↑ Jonathan Lewis: Cost-Based Oracle Fundamentals . Apress, New York 2006, ISBN 1-59059-636-6 , chapter 10.

[Antognini-Oracle-Joins-1] Christian Antognini: Troubleshooting Oracle Performance . 2011, ISBN 978-1-4302-4296-3 , chapter Processing Joins .

[Lewis-Oracle-2] Jonathan Lewis: Cost-Based Oracle Fundamentals . Apress, New York 2006, ISBN 1-59059-636-6 , chapter 10.