Selectivity (computer science)

Selectivity is a measure that is used in computer science for database queries on database tables in relational database systems ; it determines the proportion of data records that are not filtered out of the result set by a selection condition in a query, relative to the total number of data records in the database on which the query is based. In the most widely used query language for relational databases, SQL , selection conditions are specified in the WHERE clause of the query.

definition

Selectivity is often referred to in the literature with (lower case sigma ) and can therefore easily be confused with the selection operator in relational algebra, which is why the abbreviation sel is also used. ${\ displaystyle \ sigma}$

In formal terms, selectivity describes the proportion of qualifying tuples (data records) in a database table that meet the predicate (a comparison expression) of the selection of this query. Be now ${\ displaystyle P}$ ${\ displaystyle \ sigma}$

${\ displaystyle | \ sigma _ {P} (A) |}$ the number of data records that meet the selection predicate P of a database query and
${\ displaystyle | A |}$ the number of data records in the database table on which this query is based,

then the following formula applies to the calculation of the selectivity sel:

 $sel_{P}={\dfrac {|\sigma _{P}(A)|}{|A|}}$

In a join two database tables B and C, in the calculation of selectivity, the proportion of tuples that will qualify for the result set relative to the total number of lines of cross product represented by B and C:

 $sel_{BC}={\dfrac {|B\bowtie C|}{|B\times C|}}={\dfrac {|B\bowtie C|}{|B|\cdot |C|}}$

Since a selection predicate always restricts the amount of results compared to the amount of data queried, sel is a value between 0 and 1, i.e. it can be used as a percentage of those data records that are not filtered out by a condition in a query, relative to the total number of data records in the data set that the query is to be interpreted.

Examples

A relational database has the following table CUSTOMERS with 1000 entries:

ID	SURNAME
1	Müller
2	Schmitt
3	Schulz
...	...
999	cutter
1000	Meier

The following query is now made with SQL :

 select *
 from   kunden

In this case, the selectivity for the above query is 1, since no selection condition is specified in the query and no data records are filtered out by the underlying database table when generating the result set, so that the cardinality of the result set and that of the queried database table are the same :

${\ displaystyle sel = {\ dfrac {| \ sigma _ {P} (CUSTOMER) |} {| CUSTOMER |}} = {\ dfrac {1000} {1000}} = 1}$

Now the same query is supplemented with a WHERE clause, which represents a selection condition that restricts the result set by filtering the data records in the queried table:

 select *
 from   kunden
 where  id < 221

220 records of the table CUSTOMERS pass the filter of this query; their selectivity is thus 0.22 (220 divided by 1000).

Finally, when it comes to query

  select *
  from   kunden
  where  id > 1000

the result set is empty, that is, all data records of the underlying data set are filtered out of the result set using the selection predicate, so that the selectivity is 0 (0 divided by 1000).

The absolute number of records in the result set generated by a query generally does not play a role in determining selectivity, as the following query shows:

  select count(*)
  from   kunden
  where  id < 401

Due to the aggregation by the SQL function count , the query generates only one data record as a result, but the selectivity of the query is 0.4 (400 data records pass the filter of the selection predicate and 400 divided by 1000 is 0.4).

  select *
  from   kunden
  where  id = 300

The predicate ID = 300 has a selectivity of 1/1000 = 0.001, since only one record of the 1000 existing records in the table is determined.

  select *
  from   kunden
  where  name = 'Maier'

It must be taken into account here that several customers can also bear the Maier name . If there are 5 customers with this name, then the selectivity of the query is 5/1000 = 0.005.

Estimation of the selectivity in DBMS

Many database management systems (DBMS) have a cost-based query optimizer that tries to find the optimal access strategy for a database query , among other things by estimating the selectivity of the individual predicates of the query. The selectivity is not exactly determined by the query optimizer because it would be too time-consuming, but only estimated on the basis of statistical data. The statistics collected by a DBMS include, for example, information about the total number of rows in a table, the number of different values in the table columns ( cardinality ), the number of NULL values in a column, etc.

With a predicate of the form column = value - provided that the cardinality of the relevant column is known in the statistical data - the selectivity of this predicate can also be estimated, where c corresponds to the cardinality of the column. The estimate of the selectivity is correct in this case if there is an even distribution of the values in the relevant column. ${\ displaystyle sel = {\ dfrac {1} {c}}}$

If several individual predicates are linked to one another by logical operators or if a single predicate is negated, then the selectivity of the entire condition can be calculated from the estimate of the selectivities of the individual predicates; let the selectivity of the predicate and the selectivity of the predicate be : ${\ displaystyle sel_ {1}}$ ${\ displaystyle P_ {1}}$ ${\ displaystyle sel_ {2}}$ ${\ displaystyle P_ {2}}$

When the two predicates are linked by means of an AND link ( conjunction ), the selectivity of the link can be estimated by multiplying the individual selectivities:

${\ displaystyle sel (P_ {1} \; AND \; P_ {2}) = sel_ {1} * sel_ {2}}$

In the case of an OR link ( disjunction ), the selectivity of the link can be estimated by adding the individual selectivities. However, the intersection has to be subtracted from this:

${\ displaystyle sel (P_ {1} \; OR \; P_ {2}) = sel_ {1} + sel_ {2} -sel_ {1} * sel_ {2}}$

If a predicate is negated with NOT, the selectivity can be estimated by subtracting the selectivity from 1:

${\ displaystyle sel (NOT \; P_ {1}) = 1-sel_ {1}}$

Strong vs. weak selectivity

If the selectivity is high, that is, its value is close to or equal to 1, then one speaks of weak selectivity ; if the value is low, i.e. close to or equal to 0, then one speaks of strong selectivity . The line between strong and weak selectivity is blurred.

In practice it is necessary for software developers , database administrators and the query optimizer of a database system to be able to categorize the selectivity of a query into strong and weak selectivity. To ensure a good performance to achieve in a query, it is important that the number of data blocks , which from the hard drive to keep reading, as low as possible. For queries with strong selectivity, other options for accessing the data on the hard disk are suitable than for those with weak selectivity:

In the case of SQL queries with poor selectivity, a scan of the entire table ( full table scan ) is most efficient: since a selection predicate only filters out a few data records from the query table, a large number of data records qualify for the result set; This means that a very large percentage of the data blocks that store the data records of the queried table must be read from the hard disk into the main memory anyway ; the additional reading of data blocks of a database index would mean unnecessary overhead in this case . In special cases, when a database index contains all the data that is queried, a full index scan can be carried out instead of a full table scan .

In the case of SQL queries with strong selectivity, it is best to access the data using a suitable database index. If, for example, only one data record of the queried table fulfills the selection predicate of the query, then it would not be efficient to read all blocks with all data records of this table into the main memory by means of a full table scan. Instead, only very few blocks have to be read during an index access, that is, a few index blocks and only the block from the queried table that stores the data record sought.

The query optimizer therefore tries to estimate the selectivity of each query and to select the optimal access method. Developers and database administrators, in turn, must use the selectivity and frequency of queries to decide which indexes need to be created, which the query optimizer can then use when executing the query.

Individual evidence

↑ ^a ^b ^c Alfons Kemper, André Eickler: Database systems. Oldenbourg Verlag 2004, ISBN 3-486-25706-4 , page 254

[ke-1] Alfons Kemper, André Eickler: Database systems. Oldenbourg Verlag 2004, ISBN 3-486-25706-4 , page 254