Binary search algorithm

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Caesura (talk | contribs) at 09:27, 27 July 2007 (Wikipedia:Avoid self-references). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

A binary search algorithm (or binary chop) is a technique for finding a particular value in a sorted list. It makes progressively better guesses, and closes in on the sought value, by comparing an element halfway with what has been determined to be an element too low in the list and one too high in the list. A binary search finds the median element in a list, compares its value to the one you are searching for, and determines if it’s greater than, less than, or equal to the one you want. A guess that turns out to be too high becomes the new top of the list, and one too low the new bottom of the list. The binary search's next guess is halfway between the new list's top and bottom. Pursuing this strategy iteratively, it narrows the search by a factor 2 each time, and finds your value. A binary search is an example of a divide and conquer algorithm (more specifically a decrease and conquer algorithm) and a dichotomic search (more at Search algorithm).

The algorithm

The most common application of binary search is to find a specific value in a sorted list. To cast this in the frame of the guessing game (see Example below), realize that we are now guessing the index, or numbered place, of the value in the list. This is useful because, given the index, other data structures will contain associated information. Suppose a data structure containing the classic collection of name, address, telephone number and so forth has been accumulated, and an array is prepared containing the names, numbered from one to N. A query might be: what is the telephone number for a given name X. To answer this the array would be searched and the index (if any) corresponding to that name determined, whereupon it would be used to report the associated telephone number and so forth. Appropriate provision must be made for the name not being in the list (typically by returning an index value of zero), indeed the question of interest might be only whether X is in the list or not.

If the list of names is in sorted order, a binary search will find a given name with far fewer probes than the simple procedure of probing each name in the list, one after the other in a linear search, and the procedure is much simpler than organising a hash table though that would be faster still, typically averaging just over one probe. This applies for a uniform distribution of search items but if it is known that some few items are much more likely to be sought for than the majority then a linear search with the list ordered so that the most popular items are first may do better.

The binary search begins by comparing the sought value X to the value in the middle of the list; because the values are sorted, it is clear whether the sought value would belong before or after that middle value, and the search then continues through the correct half in the same way. Only the sign of the difference is inspected: there is no attempt at an interpolation search based on the size of the differences.

The most straightforward implementation is recursive and recursively searches the subrange indicates by the comparison, as in this pseudocode implementation:

   BinarySearch(A[0..N-1], value, low, high) {
       if (high < low)
           return not_found
       mid = (low + high) / 2
       if (A[mid] > value)
           return BinarySearch(A, value, low, mid-1)
       else if (A[mid] < value)
           return BinarySearch(A, value, mid+1, high)
       else
           return mid
   }

It is invoked with initial low and high values of 0 and N-1. We can eliminate the tail recursion above and convert this to an iterative implementation:

   BinarySearch(A[0..N-1], value) {
       low = 0
       high = N - 1
       while (low <= high) {
           mid = (low + high) / 2
           if (A[mid] > value)
               high = mid - 1
           else if (A[mid] < value)
               low = mid + 1
           else
               return mid
       }
       return not_found
   }

Some implementations may not include the early termination branch, preferring to check at the end if the value was found, shown below. Checking to see if the value was found during the search (as opposed to at the end of the search) may seem a good idea, but there are extra computations involved in each iteration of the search. Also, with an array of length N using the low and high indices, the probability of actually finding the value on the first iteration is 1 / N, and the probability of finding it later on (before the end) is the about 1 / (high - low). The following checks for the value at the end of the search:

       while (low < high) {
           mid = (low + high)/2;
           if (A[mid] < value){ 
               low = mid + 1; 
           }
           else {
                high = mid; //can't be high = mid-1: here A[mid] >= value,
                            //  so high can't be < mid if A[mid] == value
           }   
       }
       if (low < N) and (A[low] == value)
           return low
       else
           return not_found                 

This algorithm has two other advantages. At the end of the loop, low points to the first entry greater than or equal to value, so a new entry can be inserted if no match is found. Moreover, it only requires one comparison; which could be significant for complex keys in languages which do not allow the result of a comparison to be saved.

In practice, one frequently uses a three-way comparison instead of two comparisons per loop. Also, real implementations using fixed-width integers with modular arithmetic need to account for the possibility of overflow. One frequently-used technique for this is to compute mid differently:

           mid = low + ((high - low) / 2)

Equal elements

The elements of the list are not necessarily all unique. If one searches for a value that occurs multiple times in the list, the index returned will be of the first-encountered equal element, and this will not necessarily be that of the first, last, or middle element of the run of equal-key elements but will depend on the positions of the values. Modifying the list even in seemingly unrelated ways such as adding elements elsewhere in the list may change the result.

To find all equal elements an upward and downward linear search can be carried out from the initial result, stopping each search when the element is no longer equal. Thus, e.g. in a table of cities sorted by country, we can find all cities in a given country.

Sort key

A list of pairs (p,q) can be sorted based on just p. Then the comparisons in the algorithm need only consider the values of p, not those of q. For example, in a table of cities sorted on a column "country" we can find cities in Germany by comparing country names with "Germany", instead of comparing whole rows. Such partial content is called a sort key.

Correctness and testing

Binary search is one of the trickiest "simple" algorithms to program correctly. A study has shown that an astounding 90 percent of professional programmers fail to code a binary search correctly after a whole hour of working on it, and another study shows that accurate code for it is only found in five out of twenty textbooks. (Kruse, 1999) Given this insight, it is important to remember that the best way to verify the correctness of a binary search algorithm is to thoroughly test it on a computer. It is difficult to visually analyze the code without making a mistake.

To that end, the following code will thoroughly test a binary search at every index for many multiple lengths of arrays:

       for(length = 1; length < 2049; length++){ //make array longer on each iteration
           int *A = new int[length];
           for(i = 0; i < length; i++)          //init array values from 0 to length-1
               A[i] = i;
           for(value = 0; value < length; value++){     //search for every array value
               index = binary_search(A, length, value);
               if (index >= 0)
                   passed = true;
               else passed = false;      //if this line executes, BUG in binary search
           }
           delete []A;
       }

In the above test-code if passed is ever false, then the binary search function has a bug. Note that a binary search function should check whether it found the value before it exits; if it did not find it, then it should return a negative index for the above test-code. With duplicate values in an array, a binary search will only find one of those values and therefore doesn't test whether every index is accessible.

Performance

Binary search is a logarithmic algorithm and executes in O(log n) time. Specifically, iterations are needed to return an answer. In most cases it is considerably faster than a linear search. It can be implemented using recursion or iteration, as shown above. In some languages it is more elegantly expressed recursively; however, in some C-based languages tail recursion is not eliminated and the recursive version requires more stack space.

Binary search can interact poorly with the memory hierarchy (i.e. caching), because of its random-access nature. For in-memory searching, if the interval to be searched is small, a linear search may have superior performance simply because it exhibits better locality of reference. For external searching, care must be taken or each of the first several probes will lead to a disk seek. A common technique is to abandon binary searching for linear searching as soon as the size of the remaining interval falls below a small value such as 8 or 16.

Examples

An example of binary search in action is a simple guessing game in which a player has to guess a positive integer, between 1 and N, selected by another player, using only questions answered with yes or no. Supposing N is 16 and the number 11 is selected, the game might proceed as follows.

  • Is the number greater than 8? (Yes).
  • Is the number greater than 12? (No)
  • Is the number greater than 10? (Yes)
  • Is the number greater than 11? (No)

Therefore, the number must be 11. At each step, we choose a number right in the middle of the range of possible values for the number. For example, once we know the number is greater than 8, but less than or equal to 12, we know to choose a number in the middle of the range [9, 12] (in this case 10 is optimal).

At most questions are required to determine the number, since each question halves the search space. Note that one less question (iteration) is required than for the general algorithm, since the number is constrained to a particular range.

Even if the number we're guessing can be arbitrarily large, in which case there is no upper bound N, we can still find the number in at most steps (where k is the (unknown) selected number) by first finding an upper bound by repeated doubling. For example, if the number were 11, we could use the following sequence of guesses to find it:

  • Is the number greater than 1? (Yes)
  • Is the number greater than 2? (Yes)
  • Is the number greater than 4? (Yes)
  • Is the number greater than 8? (Yes)
  • Is the number greater than 16? (No, N=16, proceed as above)

( We know the number is greater than 8 )

  • Is the number greater than 12? (No)
  • Is the number greater than 10? (Yes)
  • Is the number greater than 11? (No)

As one simple example, in revision control systems, it is possible to use a binary search to see in which revision a piece of content was added to a file. We simply do a binary search through the entire version history; if the content is not present in a particular version, it appeared later, while if it is present it appeared at that version or sooner. This is far quicker than checking every difference.

There are many occasions unrelated to computers when a binary search is the quickest way to isolate a solution we seek. In troubleshooting a single problem with many possible causes, we can change half the suspects, see if the problem remains and deduce in which half the culprit is; change half the remaining suspects, and so on. See: Shotgun debugging.

People typically use a mixture of the binary search and interpolative search algorithms when searching a telephone book, after the initial guess we exploit the fact that the entries are sorted and can rapidly find the required entry. For example when searching for Smith, if Rogers and Thomas have been found, one can flip to the page halfway between the previous guesses, if this shows Samson, we know that Smith is somewhere between the Samson and Thomas pages so we can bisect these.

Language support

Many standard libraries provide a way to do binary search. C provides bsearch in its standard library. C++'s STL provides algorithm functions lower bound and upper bound. Java offers a set of overloaded binarySearch() static methods in the classes Arrays and Collections for performing binary searches on Java arrays and Lists, respectively. They must be arrays of primitives, or the arrays or Lists must be of a type that implements the Comparable interface, or you must specify a custom Comparator object. Microsoft's .NET Framework 2.0 offers static generic versions of the Binary Search algorithm in its collection base classes. An example would be System.Array's method BinarySearch<T>(T[] array, T value). Python provides the bisect module. COBOL can perform binary search on internal tables using the SEARCH ALL statement.

Applications to complexity theory

Even if we do not know a fixed range the number k falls in, we can still determine its value by asking simple yes/no questions of the form "Is k greater than x?" for some number x. As a simple consequence of this, if you can answer the question "Is this integer property k greater than a given value?" in some amount of time then you can find the value of that property in the same amount of time with an added factor of log k. This is called a reduction, and it is because of this kind of reduction that most complexity theorists concentrate on decision problems, algorithms that produce a simple yes/no answer.

For example, suppose we could answer "Does this n x n matrix have determinant larger than k?" in O(n2) time. Then, by using binary search, we could find the (ceiling of the) determinant itself in O(n2log d) time, where d is the determinant; notice that d is not the size of the input, but the size of the output.

See also

External links


References

  • Donald Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching, Third Edition. Addison-Wesley, 1997. ISBN 0-201-89685-0. Section 6.2.1: Searching an Ordered Table, pp.409–426.
  • Kruse, Robert L.: "Data Structures and Program Design in C++", Prentice-Hall, 1999, ISBN 0-13-768995-0, page 280.