Data stream algorithm

In computer science , a is data stream algorithm an algorithm , the data of one or more data streams sequentially reads and thereby directly ( "online") processes.

application

Many of today's applications in computer science make it necessary to process a data stream due to the extremely large, continuously supplied amount of data . This is the case, for example, when recording routing data in networks, recording telecommunications data, with bank transactions or with stock market tickers .

Mathematical perspective and efficiency requirements

The continuously accumulating data is modeled as a stream - a sequence of input characters, the length of which is often unknown, but is assumed to be very large.

An algorithm processing the stream is only allowed to read character by character of the stream, random access, i. H. "Jumping" to the input characters is not permitted.

In the data flow scenario, there are essentially two efficiency requirements due to the amount of data generated: The storage space complexity of the data flow algorithm should be sub-linear, ideally logarithmic or polylogarithmic, as well as the computing time per input character .

Certain problems can therefore be solved precisely with data flow algorithms, since the entire input can be read. Nevertheless, sublinear storage space and sublinear computing time per input character are efficiency requirements that often lead to the fact that this is not possible and only approximate solutions can be given and randomization has to be used.

Because a data stream algorithm may not save the entire input because of the sublinear storage space, but only a summary of what has been seen so far. It is said that the algorithm saves a sketch of the input seen so far.

In the following example an algorithm is presented that can solve the given problem exactly.

Examples

Number of elements

The number of elements in a data stream can easily be determined with a counter. The memory requirement can be further reduced with randomized algorithms.

Missing number

Let be a permutation of the number with one missing element . ${\ displaystyle \ pi}$ ${\ displaystyle \ {1, ..., n \}}$ ${\ displaystyle s}$

A simple way to find the missing number would be to collect all the numbers, sort them, and then search through that ordered set in turn for the missing item. To do this, however, all numbers would have to be saved as described. The memory consumption of this algorithm is bytes if it is assumed that each number is stored as a 32-bit integer . For example, you would have to save around 3.7 GB. In order to achieve adequate performance, this data would have to be stored in the main memory, but this is not possible with most PCs due to the large volume of data. This means that the hard drive would have to be accessed, which, however, slows down this algorithm extremely. ${\ displaystyle n \ cdot 4}$ ${\ displaystyle n = 1,000,000,000}$

If all numbers were contained in the data stream, the sum of the elements of the stream would be according to the Gaussian sum formula . Therefore, taking the sum of the power contained in the elements , so can the number sought after reading the entire input to determine. This algorithm only needs to save one number to calculate the sum and then determine it, and the memory space is therefore only O (log n). It is obviously more efficient. ${\ displaystyle 1, ..., n}$ ${\ displaystyle \ sum _ {i = 1} ^ {n} i = {\ frac {n (n + 1)} {2}}}$ ${\ displaystyle n-1}$ ${\ displaystyle \ pi}$ ${\ displaystyle s}$ ${\ displaystyle s = {\ frac {n (n + 1)} {2}} - \ sum {\ pi}}$ ${\ displaystyle s}$