Join algorithms
Joinalgorithmen are possible strategies ( algorithms ) for the implementation of joins .
The optimal strategy depends on the size and structure of the relations involved in the join, the indices used or usable , the size of the main memory and the type of join (natural join, θ-join or equi-join).
Nested loop join
All tuples from the relation are selected one after the other and compared with each tuple from the relation .
The algorithm is suitable for every join and can be extended to several relations.
Pseudocode
Implementation of with as an outer relation and as an inner relation in pseudocode :
foreach r in R do
foreach s in S do
if r.a = s.b do
⇒ Ausgabe von (r,s)
endif
end foreach
end foreach
rating
Since all tuples are linked in the cross product , there is a computational effort of .
The number of blocks to be read is in the worst case , since for each tuple in R all blocks of S are reloaded. If both relations fit into the memory, then blocks have to be read for this best case . If one of the relations fits completely into the memory, it is selected as the inner relation, otherwise it makes more sense to choose the smaller relation as the outer relation.
variants
With the index nested loop join , existing or created indexes are compared. The number of block accesses then depends on the size and structure of the index structures.
Block nested loop join
In the event that both relations and do not fit into main memory, a block-by-block comparison improves the nested loop join.
The algorithm is suitable for any join.
Pseudocode
Implementation of in pseudocode :
foreach Block_r in R do
foreach Block_s in S do
foreach r in Block_r do
foreach s in Block_s do
if r.a = s.a do
⇒ Ausgabe von (r,s)
endif
end foreach
end foreach
end foreach
end foreach
rating
As for the nested loop join, there is complexity .
The number of blocks to be read is in the worst case , in the best case .
variants
If both relations do not fit in the main memory, as many blocks as possible are loaded into the main memory from the outer relation. This reduces the number of block accesses to the inner relation.
Sort-merge join
Both relations are sorted according to their join attributes. The result can then be calculated with a single scan through both sorted relations.
The algorithm is only suitable for natural join and equi-join.
Pseudocode
Implementation of in pseudocode :
pr := erstes Tupel in R ps := erstes Tupel in S while pr ≠ endofR and ps ≠ endofS do // Sammeln aller Tupel mit gleichen Joinattributen aus S Ms := Menge mit Inhalt ps foreach ts in S > ps if ts.a = ps.a Ms += Menge mit Inhalt ts elseif ps := ts break foreach endif end foreach // suche passendes Anfangstupel in R foreach tr in R > pr pr = tr if tr.a ≥ ts.a break foreach endif end foreach // Tupel ausgeben foreach tr in R > pr if tr.a > ts.a break foreach endif foreach ts in Ms ⇒ Ausgabe von (tr,ts) end foreach pr = tr end foreach end while
rating
The number of block accesses for sorting is in the worst case , analogous for .
variants
If indexes exist, they can be used to get the sorted join attributes. When reading out the tuples, the random distribution can lead to block accesses in the worst case .
When hybrid merge join , therefore, a relation is sorted. Then this is merged with the tuple addresses of the other relation via the index . The result set is sorted according to the tuple addresses and the tuples are read out with as few block accesses as possible.
Hash join (simple hash)
The first relation creates a hash table with the join attributes. The join attributes of the second relation are also hashed. If the value points to a non-empty bucket in the hash table, the bucket is compared with the current tuple.
The hash table is generally created for the smaller relation. The hash join is only suitable for equi-joins.
Pseudocode
Für alle Elemente s aus S wiederhole Hashe über den Join Attributen s(b); Trage Tupel entsprechend den Werten in Hashtabelle ein end Für alle Elemente r aus R wiederhole Hashe über den Join Attributen r(a); if r auf nicht-leeres Bucket hashed if r s speichere r s in Ergebnismenge end end end
rating
The average runtime of the hash join is:
variants
Hash-partioned join
The relations and are broken down into disjoint partitions bis (analogous for R) using an equally and randomly distributing hash function . The tuples of the two relations fall into the same buckets if the value ranges are similar.
Only the partitions and have to be compared because they contain the tuples with the same join attributes. The algorithm is only suitable for natural join and equi-join.
Implementations of the hash partitioned join
Parallel join
Parallelized algorithms distribute the work on a join over several processors or computers. They are used in parallel databases or in work-sharing frameworks such as MapReduce , Dryad and Hadoop .
Any join algorithm can run on the individual processors.
Fragment-and-Replicate Join
An extension of the partitioned join for θ joins: the elements of the cross product of the partition pairs and are passed to one processor each. The data transfer increases significantly due to the replication.