Relational algebra
In theory, the database is meant by relational algebra or relational algebra a set of operations for manipulating relations . It enables relations to be filtered, linked, aggregated or otherwise modified in order to formulate queries to a database.
Usually, queries and programs are not formulated directly in a relational algebra, but in a declarative language such as SQL , XQuery , SPARQL or Datalog . These programs and queries are usually first translated into an (generally extended) relational algebra. The resulting operator tree is then transformed with the help of relational laws in order to enable the most efficient possible evaluation of the requests.
History and meaning
In 1941 Alfred Tarski presented ideas of a relational algebra for the first time in his paper “On the calculus of relations”. In particular, he introduced the relational operations “union”, “average” and “join”, although he limited himself to two-digit relations.
At the end of his article, he mentions that his aim was not so much to present new results as to arouse interest in a particular logical theory that had not been considered before:
"The aim of this paper has been, not so much to present new results, as to awaken interest in a certain neglected logical theory, and to formulate some new problems concerning this theory."
In the late 1960s, Edgar F. Codd developed the foundations of today's relational algebra at the IBM Research Laboratory in San Jose. It is not known whether the work of Tarski inspired him. At the beginning of his paper from 1969 he asserted that the relational model was superior in many aspects to the graph model and the network model , which at the time were "en vogue" (French: "in fashion").
“The first part of this paper is concerned with an explanation of a relational view of data. This view (or model) of data appears to be superior in several respects to the graph or network model [1, 2] presently in vogue. "
He is referring to the fact that the time it takes to answer queries depends very much on the structure of the respective network. If data is to be retrieved that is adjacent in the network, the user only has to wait very briefly for a response. However, if the desired data is widely scattered in the network, the waiting time can be unreasonably long. When creating a network model, the database developers had to consider all conceivable requests from the outset, as subsequent changes to the data model were very difficult to implement. To solve this problem, Codd had the idea of no longer storing the data in a network, but in relations (tables) that can be linked to one another in different ways depending on the request:
"Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation)."
He dared the following prophetic prognosis that databases would in future contain many relationships in stored form:
"The large, integrated data banks of the future will contain many relations of various degrees in stored form."
Late 1970, d. H. In the same year in which Codd's work became public, Rudolf Bayer and Ed McCreight presented the B-tree , a data structure that enables relations with a large number of tuples to be stored on a hard drive in such a way that tuples can be read and the modification of tuples can be done highly efficiently.
In the 1970s, the success story of the relational databases, including the associated SQL language , began on the basis of these two works . At Codd's place of work, d. H. at the IBM Research Laboratory in San Jose, the SEQUEL language and the experimental System R database system were developed. SEQUEL was later renamed to SQL. At the beginning of the 1980s there were the first commercial relational database systems for the query language SQL: Db2 from IBM and Oracle from Relational Software Inc. Today, the world of databases is indispensable for SQL (see, for example, category: Relational database management system ). But also various other languages, such as initially QBE or QUEL and later Datalog , XQuery or SPARQL , are ultimately based on Codds' idea of using relations to store data.
General
Relational algebra defines operations that can be applied to a set of relations . This means that relations can be filtered, linked or aggregated, for example. The results of all operations are also relations. For this reason the relational algebra is called closed .
Relational algebra is important as a theoretical basis for query languages in relational databases . Here the operations of relational algebra are implemented in so-called database operators. If every operation of relational algebra in the query language can be implemented by (at least) one expression, it is said to be relationally complete; the expression can link several database operators. If every operation can also be implemented by (exactly) one database operator, it is called strictly relationally complete; so only one database operator can ever be contained in one and the same converting expression. If the condition of strict relational completeness also applies in the other direction, i.e. there is a corresponding operation of relational algebra for each database operator, then the query language is called equivalent to relational algebra, in short: relational equivalent.
Since there are (several) minimal sets of operations for relational algebra from which all further operations can be put together, it is sufficient for (strictly) relational completeness to compare the query language with these "basic operations". This follows from the fact that relational algebra is trivially self-equivalent and is already completely described (with regard to operations) by a minimal system of operations. A common minimal system of operations consists of the six operations: projection, selection, cross product, union, difference and renaming.
Because of its theoretical clarity, relational algebra is often used as a yardstick for evaluating the power or expressiveness of query languages. a. by means of the comparative terms just described. However, one should not infer from the closer proximity of a query language to relational algebra to its greater power. Query languages that are relationally complete or even strictly relationally complete often have a significantly larger range of functions than would be possible by implementing the relations-algebra operations alone. For example, in relational algebra there is no possibility of forming the transitive envelope of a relation, which is interesting in the case of back-referencing relations. The strict relational completeness of a query language tends to indicate a minimum functionality, and the relational equivalence rather a maximum functionality, while the non-strict relational completeness provides the least concrete information about the query language.
In contrast to calculi , relational algebra is safe ; that is, it gives a finite result in finite time. Relational algebra is also an example of a procedural language ; in contrast to calculi, which are mostly formalized as descriptive languages .
Operations
Set operations
In order to be able to perform set operations on the relations R and S, both must be compatible with each other. The type compatibility of two relations is given if
- R and S have the same degree (number of attribute elements)
- the value range of the attributes of R and S is identical
The type compatibility is also called union compatibility.
Union
With the union R ∪ S, all tuples of relation R are combined with all tuples of relation S to a single relation. The prerequisite for this is that R and S have the same relationship scheme. That is, they have the same attributes and attribute types. Duplicates will be deleted upon merging.
definition
example
|
|
|
requirement
- Unification compatibility of R and S
Intersection
The result of the average operation R ∩ S are all the tuples that can be found in both R and S. The average amount can also by the amount of difference expressed: R ∩ S = R \ (R \ S)
definition
example
|
|
|
requirement
- Unification compatibility of R and S
difference
With the operation R \ S or R - S, all tuples that are also present in the second relation S are removed from the first relation R. The difference (as well as the symmetrical difference) is not a monotonic operation, therefore relational algebra is not monotonic either in comparison to other declarative query languages (e.g. Datalog ).
definition
example
|
|
|
requirement
- Unification compatibility of R and S
Symmetrical difference
The symmetrical difference R △ S is the set of all tuples that are contained in either R or S, but not in both at the same time.
The operation can be derived from the basic operations:
example
|
|
|
requirement
- Unification compatibility of R and S
Cartesian product (cross product)
The Cartesian product R × S is an operation that is similar to the Cartesian product from set theory .
The result of the Cartesian product is the set of all combinations of the tuples from R and S, i.e. that is, each row of one table is combined with each row of the other table. If all characteristics (columns) are different, the result table includes the sum of the characteristics of the source tables. Features of the same name in the two tables are referenced by placing the table name in front. The number of tuples (lines) in the result table is the result of multiplying the number of lines in the output table.
definition
Any two relations and are given. The Cartesian product is defined by
example
|
|
|
projection
The projection corresponds to the projection mapping from set theory and can also be called attribute restriction. It extracts individual attributes from the original set of attributes and is therefore to be understood as a kind of selection at column level, i.e. the projection hides columns. If β is the list of attributes, write π β (R) or in the linear notation R [β]. β is also called the projection list . Duplicates in the result relation are eliminated.
definition
Let R be a relation over {A 1 ,…, A k } and β ⊆ {A 1 ,…, A k }.
The t β : = (β), that is, the tuples only receive the attributes from the attribute list β.
example
|
|
|
requirement
- The specified columns must be in R.
selection
During the selection, a comparison expression ( predicate ) can be used to determine which tuples are to be included in the result set. So tuples ("lines") are hidden. One writes or in the linear notation R [expression]. Expression is then called a selection condition .
definition
Let R be a relation.
Expression denotes a formula . This can consist of:
- Constant selections Attribute θ Constant , where θ is a common (suitable) comparison operator.
- Attribute selections Attribute θ attribute
- A combination of a formula with logical predicates ∧, ∨, ¬ (brackets as usual).
example
|
|
|
requirement
- Each element of the specified column must be comparable with the comparison value via the conditional operator.
Join
A join designates the two consecutive operations Cartesian product and selection . The selection condition is usually a comparison of attributes A θ B, where θ is a suitable comparison operator. The general network is therefore also referred to as the θ network (theta network). The equi-join is a special case of the general join (see below).
definition
For two relations and is the result of the general combination with a formula expression as selection condition
The derivation is:
example
|
|
|
|
Join corruption
In the event of a join corruption, the table is first split, except for one column . The 2 tables are then joined over the common column .
Be and , then:
example
|
|
Equi-join
With the equi-join, the Cartesian product is formed first. The selection then takes place with the condition that the content of certain columns must be identical. The equi-join is a general combination with a formula of the form A = B .
definition
The equi-join is used for the relations R, S and associated attributes A (is an attribute of R) and B (is an attribute of S)
example
Here:
|
|
|
|
Natural join
The natural join consists of the equi-join and an additional hiding of the duplicated columns (projection). The join takes place via the attributes (columns) which have the same name in both relations. If there are no common attributes, the result of the natural compound is the Cartesian product. The natural compound is commutative and associative, that is, it applies as well as what plays a role in the optimization of queries. The number of attributes of the result relation is the sum of the numbers of the two output relations minus the number of compound attributes.
definition
For two relations and is the result of the natural union
example
|
|
|
Semi join
The semi join calculates the proportion of a natural join that remains after a reduction to the left relation.
definition
For two relations and is the result of half the natural union
example
|
|
|
Outer join
In contrast to the equi-join, the outer join also includes the tuples of the left ( left outer join ) or right ( right outer join ) table in the result relation that cannot find a join partner. The non-existent attributes of the join relation are filled with null values.
The combination of the left and right outer join is called an outer join or a full outer join. All tuples are included in the result relation and those attributes of a tuple that have not found a join partner in the other relation are filled with zero values.
The outer join can be used with or without a ( natural outer join ) join condition.
example
|
|
|
renaming
With this operation attributes and relations can be renamed. This surgery is important to
- Enable joins from different named relations,
- To enable Cartesian products where there are the same attribute names, in particular with the same relation
- To enable set operations between relations with different attributes.
The notation is , linear R [old → new].
definition
We construct a new tuple set t ' from the old one:
example
|
|
division
definition
Since division is a derived operation, we define it with the help of the other operations of relational algebra. Let R, S be relations and the sets of attributes belonging to R and the sets of attributes belonging to S. .
The division is then defined by:
In plain words containing ie those of attributes , which in any combination with the attributes of in occur.
example
Given is a relation R that contains fathers and mothers, their children and the age of these children. In addition, there is a relation S that contains some children and their ages: Maria (4) and Sabine (2). If you divide R by S, the result is a relation that only contains those married couples who have both a daughter Maria at age 4 and a daughter Sabine at age 2:
|
|
|
Division is used when the question contains "for all". For our example the question is: "Select all parents (father, mother) who have a child with the name Maria and the age 4 and a child with the name Sabine and the age 2 (the relation S)."
Minimalism and completeness
A minimal set of operations, that is, a set of operations that is at least necessary to be able to form all expressions of relational algebra, includes
- projection
- selection
- Cartesian product
- Union
- difference
- renaming
All other operations ( e.g. joins ) can be simulated using these basic operations.
Any other set of operations is relationally complete if they have the same power as the operations above.
Relational Algebra Extensions
Relational algebra is not powerful enough to be able to completely map other query languages, especially SQL, into relational algebra. There are e.g. B. no way GROUP BY/HAVING
to translate the SQL operators , aggregate functions and null values into relational algebra. Here we consider some extensions (some of which have a similar effect) that enable a complete mapping into relational algebra and thus a complete theoretical consideration of these query languages.
Zero values
SQL allows the use of NULL values , which IS NULL
can be queried with the special predicate . This is particularly important when creating outer joins that create a relation that contain all values of one relation, as well as all values of the other for which the join condition is true, otherwise NULL values. This cannot be mapped with relational algebra.
One possibility is the definition of null values with a three-valued logic, as in SQL, that is, the Boolean operators are extended using truth tables in such a way that it is specified how to proceed when an operand is NULL.
∧ | true | false | ZERO |
---|---|---|---|
true | true | false | ZERO |
false | false | false | false |
ZERO | ZERO | false | ZERO |
Selection conditions or combinations that are applied to null values result in NULL. One difficulty with this (i.e., with the SQL-like treatment of null values) is that the results of queries with subqueries that return NULL do not necessarily match the user's intention. This type of zero-value treatment is also not orthogonal ; H. the behavior on one level (Boolean operators, 3-value logic) does not correspond to that on another (compound, 3-value logic is mapped to 2-value).
Another possibility is to differentiate between two different types of zero values, each of which means “any” or “not defined”.
Grouping operator and aggregate functions
The grouping applies functions to the same attributes in a relation. The operator γ receives a list of functions and a list of attributes. The functions are then applied to tuples for which the attributes of the attribute list are the same. The output is a new relation consisting of the attribute list and a new attribute that contains the results of the function list.
The functions are then the usual aggregate functions count, sum, max, avg …
.
definition
Let R be a relation and A = {A 1 ,…, A n } attributes from R. Let F (X) be a function list f 1 (x 1 ),…, f n (x n ). The grouping is then
For an empty set of attributes (i.e. γ F (X); {} (…)) an additional attribute is generated that contains the value of the function application over the entire relation. This is used to break down the relation with the selection into partial relations with the same attributes, which are then reassembled with the function application.
Furthermore, a grouping with an empty function list has no effect.
NF²
The NF² model is an extension of the relational database model. The name stands for Non-first-normal-form (NFNF), which should indicate that the condition of atomic attribute values of the 1st normal form is broken. As a result, sets of attributes and sets of sets are allowed, which means that an attribute of a relation can again be a relation. The domain (range of values) of a combined attribute is the cross product of the attribute domains involved.
NF² extends relational algebra in such a way that, in addition to the usual (appropriately adapted) operations of relational algebra, two operations are added that nest a relation (nesting ν) and deinterleave a relation (denesting μ). Nesting combines a number of attributes in a sub-relation that is given a new attribute name. De-nesting removes nesting. These operations serve to transform NF² relations into the 1st normal form and vice versa. The operations are generally not bijective.
For the above reasons, NF² does not need any foreign keys.
Multi-set semantics
As a result of queries, SQL returns a multiset , i.e. a set that can contain multiple elements. This was done for performance reasons in order to save the additional step of removing duplicates. Strictly speaking, only queries DISTINCT
that are specified can be translated into relational algebra .
For relational algebra one can then additionally specify a function bag-to-set , which removes the duplicates from a multiset and thus creates a set, and the basic operations then simply as a multiset {{t | ...}} specify. However, care must be taken when defining derived operations.
Examples
As the relationship schemes for the examples, we take the classic sample database consisting of the schemes customer , supplier and goods . The schemes are:
- CUSTOMER (customer number, name, place of residence, account balance)
- SUPPLIER (supplier number, name, place, telephone)
- GOODS (goods number, description, supplier number, price)
Basic operations of relational algebra are then used as follows:
- The prices of all goods:
- π name, price (GOODS)
- All suppliers from Bremen:
- σ place = 'Bremen' (SUPPLIER)
- Customers with a negative account balance:
- σ account balance <0 (CUSTOMER)
- Rename the location of the SUPPLIER (for example to be able to perform quantity operations):
- ρ place of residence ← place (SUPPLIER)
Since the results of relational algebra are again relations (the RA is orthogonal), the operations can again be applied to the results of operations. This allows complex queries. For a simpler notation, we assume that the cross product implicitly renames the attributes so that the new attribute names are qualified with the relation name, i.e. H. the supplier number from the relation WARE becomes WARE.
- The telephone numbers of all suppliers who deliver vegetables in Bremen:
- π Telephone (σ Name = 'Vegetables' ∧ Place = 'Bremen' ∧ SUPPLIER.supplier no = WARE.supplier no (SUPPLIER × GOODS))
- All locations that contain at least one supplier and at least one customer
- π place (ρ place ← place of residence (CUSTOMER)) ∩ π place (SUPPLIER)
See also
literature
- Edgar F. Codd : A Relational Model of Data for Large Shared Data Banks. In: Communications of the ACM. 6/13/1970, pp. 377-387. (The fundamental work with which Codd first presented the relational data model in 1970. ACM Classics or Peter Morcinek: FHTW Berlin PDF; 1.4 MB)
- Alfons Kemper, André Eickler: Database Systems - An Introduction. ISBN 3-486-57690-9
- Peter Kandzia, Hans-Joachim Klein: Theoretical foundations of relational database systems. ISBN 3-411-14891-8
- Andreas Heuer, Gunter Saake : Databases: Concepts and Languages. MITP Verlag, ISBN 3-8266-0619-1 , p. 297 ff.
- H. Buff Database Theory . Book-on-Demand , Norderstedt, ISBN 3-0344-0201-5 , 312 pages
- H.-J. Schek, P. Pistor: Data Structures for an Integrated Data Base Management and Information Retrieval System (Non First Normal Form NF²), Proceedings of the 8th International Conference on Very Large Data Bases, 1982, ISBN 0-934613-14-1 , p 197-207
- Dirk Leinders, Jerzy Tyskiewicz, Jan Van den Bussche: On the expressive power of semijoin queries : arxiv : cs.DB / 0308014
Web links
- Explanatory videos for relational algebra , Big Data Analytics Group, Saarland University
- RAT, Software Rational Algebra Translator to SQL
- SELECT2OBree: Conversion of SQL into relational algebra
- Database / introduction to joins. In: SELFHTML . Retrieved July 4, 2017 (Illustrative Examples).
- Bibliography server
- Further examples
Individual evidence
- ^ Jeffery D. Ullman: Principles od Database and Knowledgebase Systems - Volume I: Classical Database Systems . Computer Science Press, 1988, ISBN 0-7167-8158-1 , pp. 53 .
- ^ Jeffery D. Ullman: Principles of Database and Knowledgebase Systems - Volume I: Classical Database Systems . Computer Science Press, 1988, ISBN 0-7167-8158-1 , pp. 210 .
- ↑ a b Torsten Grust, Jens Teubner: Relational Algebra: Mother Tongue - XQuery: Fluent . In: Proc. of the first Twente Data Management Workshop on XML Databases . Enschede, The Netherlands 2004, pp. 9–16 ( uni-konstanz.de ).
- ↑ a b Richard Cyganiak: A relational algebra for SPARQL . Ed .: Hewlett-Packard Development Company. Bristol 2005 ( https://www.semanticscholar.org/paper/A-relational-algebra-for-SPARQL-Cyganiak/129ee5496d9b54fdfdc11214765f90d047fe57a1 , http://www.hpl.hp.com/techreports/2005/HPL-2005-170 .html ).
- ↑ a b Werner Kießling, Gerhard Köstler: Multimedia course database systems . Springer-Verlag, Berlin / Heidelberg 1998, ISBN 3-540-63836-9 .
- ^ Jeffery D. Ullman: Principles od Database and Knowledge-base Systems - Volume II: The New Technologies . Computer Science Press, 1989, ISBN 0-7167-8162-X , pp. 633-733 . , Chapter 11
- ^ A b Alfred Tarski: On the calculus of relations . In: The Journal of Symbolic Logic . tape 6 , no. 3 . Association for Symbolic Logic, New York September 1941, p. 73-89 , JSTOR : 2268577 .
- ^ A b c Edgar F. Codd: Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks . In: ACM SIGMOD Record . tape 38 , no. 1 . Association for Computing Machinery, New York 1969, p. 17-36 ( acm.org ).
- ^ A b Edgar F. Codd: A Relational Model of Data for Large Shared Data Banks . In: Communications of the ACM . tape 13 , no. 6 . Association for Computing Machinery, New York 1970, pp. 377-387 ( acm.org ).
- ^ Rudolf Bayer, Edward M. McCreight: Organization and Maintenance of Large Ordered Indexes . In: Proceedings of the 1970 ACM SIGFIDET (SIGFIDET '70) . Association for Computing Machinery, 1970, pp. 107-141 .
- ^ Rudolf Bayer, Edward M. McCreight: Organization and Maintenance of Large Ordered Indexes . In: Acta Informatica . tape 1 , no. 3 . Springer-Verlag, 1972, p. 173-189 .
- ↑ Alexis Leon and Mathews Leon: SQL - A Complete Reference . Tata McGraw-Hill, New Delhi 1999, ISBN 0-07-463708-8 , pp. 68–69 ( limited preview in Google Book Search).
- ^ Jeffery D. Ullman: Principles od Database and Knowledgebase Systems - Volume I: Classical Database Systems . Computer Science Press, 1988, ISBN 0-7167-8158-1 , pp. 195-210 .
- ^ Jeffery D. Ullman: Principles od Database and Knowledgebase Systems - Volume I: Classical Database Systems . Computer Science Press, 1988, ISBN 0-7167-8158-1 , pp. 185-195 .
- ↑ Günther Pernul, Rainer Unland: Databases in the company - analysis, modeling and use . In: Textbooks Business Information Systems . 2nd, corrected edition. Oldenbourg Wissenschaftsverlag, ISBN 3-486-27210-1 , p. 228, 248 ( limited preview in Google Book search).