Join algorithms

Joinalgorithmen are possible strategies ( algorithms ) for the implementation of joins .

The optimal strategy depends on the size and structure of the relations involved in the join, the indices used or usable , the size of the main memory and the type of join (natural join, θ-join or equi-join).

Nested loop join

All tuples from the relation are selected one after the other and compared with each tuple from the relation . ${\ displaystyle {\ mathit {R}}}$ ${\ displaystyle {\ mathit {S}}}$

The algorithm is suitable for every join and can be extended to several relations.

Pseudocode

Implementation of with as an outer relation and as an inner relation in pseudocode : ${\ displaystyle R \ bowtie _ {Ra \ theta Sb} S}$ ${\ displaystyle {\ mathit {R}}}$ ${\ displaystyle {\ mathit {S}}}$

  foreach r in R do
    foreach s in S do
      if r.a = s.b do
        ⇒ Ausgabe von (r,s)
      endif
    end foreach
  end foreach

rating

Since all tuples are linked in the cross product , there is a computational effort of . ${\ displaystyle {\ mathcal {O}} (| R | \ cdot | S |)}$

The number of blocks to be read is in the worst case , since for each tuple in R all blocks of S are reloaded. If both relations fit into the memory, then blocks have to be read for this best case . If one of the relations fits completely into the memory, it is selected as the inner relation, otherwise it makes more sense to choose the smaller relation as the outer relation. ${\ displaystyle | R | \ cdot b_ {s} + b_ {r}}$ ${\ displaystyle b_ {s} + b_ {r}}$

variants

With the index nested loop join , existing or created indexes are compared. The number of block accesses then depends on the size and structure of the index structures.

Block nested loop join

In the event that both relations and do not fit into main memory, a block-by-block comparison improves the nested loop join. ${\ displaystyle {\ mathit {R}}}$ ${\ displaystyle {\ mathit {S}}}$

The algorithm is suitable for any join.

Pseudocode

Implementation of in pseudocode : ${\ displaystyle R \ bowtie _ {Ra \ theta Sa} S}$

  foreach Block_r in R do
    foreach Block_s in S do
      foreach r in Block_r do
        foreach s in Block_s do
          if r.a = s.a do
            ⇒ Ausgabe von (r,s)
          endif
        end foreach
      end foreach
    end foreach
  end foreach

rating

As for the nested loop join, there is complexity . ${\ displaystyle {\ mathcal {O}} (| R | \ cdot | S |)}$

The number of blocks to be read is in the worst case , in the best case . ${\ displaystyle b_ {r} \ cdot b_ {s} + b_ {r}}$ ${\ displaystyle b_ {s} + b_ {r}}$

variants

If both relations do not fit in the main memory, as many blocks as possible are loaded into the main memory from the outer relation. This reduces the number of block accesses to the inner relation.

Sort-merge join

Both relations are sorted according to their join attributes. The result can then be calculated with a single scan through both sorted relations.

The algorithm is only suitable for natural join and equi-join.

Pseudocode

Implementation of in pseudocode : ${\ displaystyle R \ bowtie _ {Ra = Sa} S}$

 p_r := erstes Tupel in R
 p_s := erstes Tupel in S
 while p_r ≠ endof_R and p_s ≠ endof_S do
   // Sammeln aller Tupel mit gleichen Joinattributen aus S
   M_s := Menge mit Inhalt p_s
   foreach t_s in S > p_s
     if t_s.a = p_s.a
       M_s += Menge mit Inhalt t_s
     elseif
       p_s := t_s
       break foreach
     endif
   end foreach
   // suche passendes Anfangstupel in R
   foreach t_r in R > p_r
     p_r = t_r
     if t_r.a ≥ t_s.a
       break foreach
     endif
   end foreach
   // Tupel ausgeben
   foreach t_r in R > p_r
     if t_r.a > t_s.a
       break foreach
     endif
     foreach t_s in M_s
       ⇒ Ausgabe von (t_r,t_s)
     end foreach
     p_r = t_r
   end foreach
 end while

rating

The number of block accesses for sorting is in the worst case , analogous for . ${\ displaystyle {\ mathit {S}}}$ ${\ displaystyle b_ {s} \ cdot \ left (2 \ cdot \ log \ left ({\ frac {b_ {s}} {b_ {free}}} \ right) \ right) + b_ {s}}$ ${\ displaystyle {\ mathit {R}}}$

variants

If indexes exist, they can be used to get the sorted join attributes. When reading out the tuples, the random distribution can lead to block accesses in the worst case . ${\ displaystyle | S | + | R |}$

When hybrid merge join , therefore, a relation is sorted. Then this is merged with the tuple addresses of the other relation via the index . The result set is sorted according to the tuple addresses and the tuples are read out with as few block accesses as possible. ${\ displaystyle {\ mathit {S}}}$ ${\ displaystyle {\ mathit {R}}}$ ${\ displaystyle {\ mathit {R}}}$

Hash join (simple hash)

The first relation creates a hash table with the join attributes. The join attributes of the second relation are also hashed. If the value points to a non-empty bucket in the hash table, the bucket is compared with the current tuple.

The hash table is generally created for the smaller relation. The hash join is only suitable for equi-joins.

Pseudocode

${\ displaystyle R \ bowtie _ {Ra = Sb} S}$

Für alle Elemente s aus S wiederhole
    Hashe über den Join Attributen s(b);
    Trage Tupel entsprechend den Werten in Hashtabelle ein
end
Für alle Elemente r aus R wiederhole
    Hashe über den Join Attributen r(a);
    if r auf nicht-leeres Bucket hashed
        if r  $\theta$  s
            speichere r  $\bullet$  s in Ergebnismenge
        end
    end
end

rating

The average runtime of the hash join is: ${\ displaystyle {\ mathcal {O}} (n + m)}$

variants

Hash-partioned join

The relations and are broken down into disjoint partitions bis (analogous for R) using an equally and randomly distributing hash function . The tuples of the two relations fall into the same buckets if the value ranges are similar. ${\ displaystyle {\ mathit {R}}}$ ${\ displaystyle {\ mathit {S}}}$ ${\ displaystyle n_ {h}}$ ${\ displaystyle H_ {s_ {0}}}$ ${\ displaystyle H_ {s_ {n_ {h}}}}$

Only the partitions and have to be compared because they contain the tuples with the same join attributes. The algorithm is only suitable for natural join and equi-join. ${\ displaystyle H_ {s_ {i}}}$ ${\ displaystyle H_ {r_ {i}}}$

Implementations of the hash partitioned join

Parallel join

Parallelized algorithms distribute the work on a join over several processors or computers. They are used in parallel databases or in work-sharing frameworks such as MapReduce , Dryad and Hadoop .

Any join algorithm can run on the individual processors.

Fragment-and-Replicate Join

An extension of the partitioned join for θ joins: the elements of the cross product of the partition pairs and are passed to one processor each. The data transfer increases significantly due to the replication. ${\ displaystyle H_ {r_ {i}}}$ ${\ displaystyle H_ {s_ {i}}}$