Denseest pair of points

The problem of densely or closest point pair (also closest pair problem ) is the search for the two lying closest to each other points in a plane.

Formal description

A set of points is given and we are looking for two points such that the Euclidean distance is minimal. ${\ displaystyle P = (P_ {1} \ ldots P_ {n}) \ in \ mathbb {R} ^ {2}}$ ${\ displaystyle (P_ {i}, P_ {j})}$ ${\ displaystyle d (P_ {i}, P_ {j})}$

Brute force algorithm

The naive approach is to simply calculate all the distances between all possible pairs of points and choose the smallest one:

minDist = unendlich
für alle p in P:
 für alle q in P:
  wenn p ≠ q und dist(p,q) < minDist:
   minDist = dist(p,q)
   dichtestesPaar = (p,q)
return dichtestesPaar

Since both the inner and the outer loop have to be processed n times, the runtime of the algorithm is . ${\ displaystyle {\ mathcal {O}} (n ^ {2})}$

Divide & Conquer algorithm

First of all, the given set of points has to be sorted once according to x and once according to y coordinates ; you get two sorted lists and . ${\ displaystyle L_ {x}}$ ${\ displaystyle L_ {y}}$

The divide step divides into two halves and to the left and right of the median . ${\ displaystyle L_ {x}}$ ${\ displaystyle P_ {L}}$ ${\ displaystyle P_ {R}}$

The recursive call now takes place on both halves; the closest pair of points is obtained in both halves, without any possible point pairs between the two halves being taken into account.

The Conquer step now combines these two halves, the minimum of the distances found is selected. There are two cases to consider:

On the one hand, the smallest distance could be read off directly from the left or right subfield. In this case there are no other problems.
But it could also be that two points exist in different halves, the distance between them is smaller than the one previously found in the halves - this case must now be considered separately: ${\ displaystyle \ delta}$

Examination of P's neighbors

In principle, it would be necessary to check all such pairs of points individually. Since there are still many squares, however, there would be no advantage in terms of running time (the recursion equation would be T (n) = 2 · T (n / 2) + n² / 4, which is still in ). It makes more sense to use that a certain “default” is already known for the smallest distance. On the one hand, only points now have to be considered whose distance in the x direction from the division limit is at most ; in addition, only partners whose distance in the y-direction is smaller than must be considered for these points . So for each point P to be checked it results that only a constant number of potential partners have to be checked: If one imagines a grid of the width in the vicinity of the division boundary, then only one can be in each grid field , otherwise a distance would already be found smaller would. Since only those cells are to be checked that are at most spaced from P (a maximum of 12 at the top and bottom), a maximum of 24 further spacings are now to be calculated for each point, which means that instead of a quadratic transit time, only linear transit time is required. Overall, the recursion equation is now T (n) = 2 * T (n / 2) + 24 * n, which is in (n log n). ${\ displaystyle O (n ^ {2})}$ ${\ displaystyle \ delta}$ ${\ displaystyle \ delta}$ ${\ displaystyle \ delta}$
${\ displaystyle \ delta / 2}$ ${\ displaystyle P_ {i}}$ ${\ displaystyle \ delta}$ ${\ displaystyle \ delta}$ ${\ displaystyle O}$

Randomized algorithm

functionality

The randomized algorithm is based on the fact that the points are inserted into a hash table in random order , whereby the insertion of a point as well as testing whether it undercuts the previously known minimum distance works in expected O (1). The hash table maps all grid cells of size (δ / 2) ² to any point in them. Although the hash table has to be rebuilt with each such update , the total number of insert operations in all of these hash tables is still O (n). Without considering how the hash table is actually used, the following algorithm initially results: ${\ displaystyle \ delta}$

Bilde eine zufällige Reihenfolge der Punkte  $P_{1}..P_{n}$ 
δ = d( $P_{1},P_{2}$ )
Füge  $P_{1}$  und  $P_{2}$  in die Hashtabelle ein (Gittergröße δ/2)
Für i = 3..n
  Falls  $P_{i}$  eine Distanz  $\delta '<\delta$  zu einem der  $P_{1}..P_{i-1}$  hat:
    δ := δ'
    Baue die Hashtabelle neu auf (Gittergröße δ'/2)
  Füge  $P_{i}$  in die Hashtabelle ein

Lattice structure and hash function

To simplify matters, one can assume that all points have coordinates between (0,0) and (1,1), if necessary a coordinate transformation can be carried out. The plane in which the points lie is then covered by a grid of width δ / 2. A universal hash function is now required; On the one hand, it makes it possible to determine from a given grid cell in O (1) whether there is one of the points in this grid cell, and on the other hand to insert new points in O (1) into the existing hash table.

Inserting new points

Inserting P

Each time a new point P is inserted, it must be checked whether the minimum distance δ already known is not reached. Using the coordinates of P, it is immediately possible to read which of the grid cells P will fall into. Due to the division into grid cells of size (δ / 2) ², it only needs to be determined whether there is already a point in one of the surrounding grid cells. All the grid cells that could justify such a distance are surrounding it, i.e. only the surrounding 5x5 area - all points that lie outside this area cannot have a distance <δ to P, as they are at least 2 (δ / 2) due to the grid size far from P. Since there can only be a single point in each grid cell (otherwise a distance <δ would have been found beforehand), only a maximum of 25 points have to be checked. If there are no points in this area, P can be inserted into the existing hash table without hesitation.
However, if further points are already present in this area, then the distance δ is updated and set to the new minimum distance found. This has the consequence that the previous hash table has become useless because its grid width no longer corresponds to δ / 2. So we need a new hash table with the correct δ - we simply create such a table from scratch for all points up to P, for efficiency you can save the distance check when creating a new one, since the minimum distance δ of all points up to P. is already known.
In the end, after inserting a new point, you always have a correct hash table with grid width δ / 2 and there is at most one point in each grid square.

Runtime analysis

Only the number of insert operations for new points is relevant for the runtime. Trivially, each point must be inserted at least once, i.e. it is at Ω (n).

However, it is questionable what effect the occasional rebuilding of the hash table will have, since all known points have to be reinserted for this. For this it is necessary to state the probability with which a new structure will be necessary when inserting the i -th point and to calculate the necessary costs. You can intuitively see that the higher the number of points, the greater the effort for the reorganization, but the lower the probability. Mathematically speaking:

The probability that the i -th inserted point establishes a distance δ '<δ is <2 / i , because for this there would have to be exactly one of the two points that establish that δ'. Since we have a total of i points, two of which meet our requirement, the probability is at most 2 / i . ${\ displaystyle P_ {i}}$ ${\ displaystyle P_ {i}}$

Let us now define the random variable . Let it be 1 if δ '<δ when inserting the i -th point, and 0 otherwise. ${\ displaystyle X_ {i} \ longrightarrow \ {0,1 \}}$

Then: The total number of insert operations is ; So the points to be inserted anyway plus the number of insert operations by the reorganization. ${\ displaystyle n + \ sum _ {i = 2} ^ {n} i \ cdot X_ {i}}$ ${\ displaystyle n}$

For the expected value can be due to its linearity say . ${\ displaystyle E (X) = \ sum _ {i = 2} ^ {n} i \ cdot E (X_ {i}) = \ sum _ {i = 2} ^ {n} i \ cdot 2 / i = 2n}$

That is, the total number of expected inserts for all reorganizations of the hash table is only 2n. Overall, the expected running time is O (n + 2n), i.e. linear.

swell

Thomas H. Cormen, Charles E. Leiserson , Ronald L. Rivest , and Clifford Stein. Algorithms - An Introduction. Oldenbourg-Verlag, 2004. ISBN 3-486-27515-1 . Pages 959–963 (Chapter 33.4: Calculating the closest pair of points).
Jon Kleinberg , Éva Tardos . Algorithm design. Pearson International Edition, 2006. ISBN 0-321-37291-3 . Pages 225ff (Divide & Conquer); Pages 741ff (randomized algorithm).