Data Stream Management System

from Wikipedia, the free encyclopedia

A data stream management system (DSMS) is a software system for managing continuous data streams . It is comparable to a database management system (DBMS), which is used for databases. In contrast to a DBMS, in which requests for static data are carried out briefly, a DSMS must be able to carry out continuous requests for data streams. Special query languages such as Continuous Query Language (CQL) can be used to formulate queries .

Data stream management systems are still relatively new in the database world. Some initial general purpose developments are:

There are also a growing number of smaller projects with different focuses. In contrast to non-flowing data, which is managed almost exclusively with universal database management systems, systems that are specially developed or adapted for the application are still used for flowing data.

Differences to DBMS

Data processing in a DBMS
Data processing in a DSMS

In conventional database systems, short-term queries are placed on a database that remains the same during the data evaluation (see transaction system ). The queries are started and remain in the system until the results have been calculated and output. After that, the requests are no longer available in the system. It is also said that the data is persistent and the requests are volatile. In a data stream management system, the requests are installed once and remain in the system until they are explicitly removed again. The inquiries are evaluated based on constantly changing data, namely on data streams. The results of the inquiries are also continuously updated, so they themselves also result in a data stream. One also speaks of the queries being persistent and the data being volatile. These two complementary principles are also known, for example, in information retrieval as ad-hoc requests (new requests for the same documents) and routing tasks (new documents for specified requests).

The following table compares the various characteristics of a Database Management System (DBMS) and a Data Stream Management System (DSMS):

Database Management System (DBMS) Data Stream Management System (DSMS)
Persistent data (relations) Volatile data streams
Random access Sequential access
One-time requests Continuous inquiries
(Theoretically) unlimited secondary storage Limited main memory
Only the current status is relevant Consideration of the incoming order
relatively low update rate possibly extremely high update rate
little or no time requirements Real-time requirements
Exact dates are assumed Out of date / inaccurate data
Plannable inquiry processing Variable data arrival and characteristics

Basic concepts

As can already be seen in the table above, a DSMS has some basic concepts that differ from a conventional DBMS. The most important concepts are continuous requests and windows.

Continuous inquiries

A continuous request is installed once in the system and runs until it is removed again. The request has one or more input data streams and one or more output data streams. The result of such a request is therefore not a single set of data, as is the case with a request in a DBMS, but a data stream itself. The results should be created in near real time, which means that the latency between the arrival of new data and the output of a new result is highly relevant.

With a continuous request, it is important to define when a new edition will be produced. A time-driven model generates new outputs based on the progress of a clock over time, for example the system time. A new issue could be generated once a minute. Another approach are event-driven models (Engl. Event-driven model ) in which new editions are produced when certain events occur in the data stream. So could z. B. each new data element in a stream generate a new output, since this data stream element can influence the result for this point in time. Then one speaks of a tuple-driven model .

window

Data streams are potentially infinite, so they generate a potentially infinite amount of data. However, only a limited amount of memory is available during the processing of continuous requests, which mostly happens in main memory. Windows are one way of limiting the amount of data that must be kept in memory. Another motivation for using windows is the use of continuous requests. These are intended to provide results for the current data that flow into the DSMS with the data stream. Therefore, only the current data is often relevant, while older data is no longer required for the current results. Windows are used to express a limitation on the validity of data elements.

Windows limit the view of the data stream to the newest elements of the stream. Time-based and element-based (also: tuple-based) windows are widespread. In time-based windows, the elements in the data stream are kept in the system for a certain, predetermined time, for example 30 minutes. In an element-based window, the window contains a maximum of a predetermined number of elements, for example the most recent 1000 elements. An example of a query with a time-based window is: "Calculate the average of the 'x' attribute of all data stream elements for the last 30 minutes."

Element and time-based windows can be defined differently. Here is mainly between sliding (engl. Sliding ) and tumbling or bouncing (engl. Tumbling ) distinguished windows. The difference is in the step size of the window, also called periodicity. A sliding window advances with the progress of the data stream so that the step size is minimal. In an element-based window, exactly one element would be removed for a new element that is added to the window. The step size can be changed as far as that it is the size of the window, this is called then a tumbling window (Engl. Tumbling window ). Here a window is filled up to the specified size. When the next element arrives, which would exceed the specified size of the window, all previous elements become invalid at the same time, and the new window is built up step by step until it has again reached the maximum size. This is done analogously in time-based windows. For example, a tumbling window would be a 30 minute window with a 30 minute increment.

One-pass paradigm

The resources in terms of computing time and storage space for calculating results on data streams are limited. Algorithms that process data streams therefore typically do not first save the data in full and then iterate over the entire data set to generate results, but process each individual element in the data stream only once. This is called the one-pass paradigm: a data element only passes through an algorithm once. If a new element reaches the algorithm, the result of the calculation is adjusted and no new access to the element is necessary at a later point in time. Therefore the algorithm does not have to save any old elements, only the current intermediate result.

This works for a simple counter, for example. The number of objects should be counted. If a new element arrives at the algorithm, the counter is increased by one, saved and the element can be deleted. Only the current count needs to be saved.

Processing of flows and relations

Structure of a DSMS

While the data is managed in tables ( relations ) in conventional (relational) database systems , data streams are added as fundamental data objects in a DSMS. Data streams can be understood as a continuous sequence of time-value pairs. Since data streams are basically infinite, they have to be converted into relations for processing. Conversely, relations can be converted back into data streams (see figure). The processing of pure relations can take place with conventional methods. The conversion of flows into other flows takes place via the detour of relations. The Continuous Query Language , which is based on SQL , offers various operators for this purpose.

Formulation, planning and optimization of inquiries

As in conventional database systems, queries are formulated in a declarative language and optimized for execution with the help of a query plan. Since as many inquiries as possible should be processed at the same time, the stored inquiries are combined as cleverly as possible so that partial inquiries can be used multiple times.

The components of a plan are operators, queues, and states. The operators correspond to the operators known from conventional databases such as filtering, sorting, join, mathematical operators, etc. as well as the input and output of data streams. The individual operators of a plan are linked by queues into which data objects are written sequentially and read out by the next operator in the same order. As intermediate results, there are states such as the content of a specified window.

example

DSMS-Anfrageplan.png

A news portal wants to display the latest news on the topics currently most discussed as well as the news volume for a day on its page. Messages arrive in one data stream and the currently important topics in another data stream as “ zeitgeist ”. Each message is assigned to a topic. Specifically, the message titles from the last hour on the last 10 topics as well as the number of all related messages within the last 24 hours should be displayed. Formulated in CQL, these are two queries:

Q1: SELECT Titel FROM Nachrichten N [Range 1 HOUR], Zeitgeist Z [RANGE 10] WHERE N.Thema = Z.Thema

Q2: SELECT COUNT(*) FROM Nachrichten N [RANGE 1 DAY], Zeitgeist Z [RANGE 10] WHERE N.Thema = Z.Thema

The DSMS now uses these requests to create a plan that is as efficient as possible, which could look like the one shown in the figure below. The titles and topics of the messages are first projected and placed in a queue. The topics are first placed in a queue and from there into a window of length 10. Messages and windows are linked by a JOIN operator and arrive in a window that contains all the messages for a day. The result of query Q2 is determined from this window using the COUNT operator. For query Q1, the larger window is followed by a smaller window of one hour.

literature

Web links

Individual evidence

  1. Data - English Test Questions (Topics) Files List ( English ) National Institute of Standards and Technology. Retrieved February 14, 2019.
  2. a b c d e Sandra Geisler: Data Stream Management Systems . In: Phokion G. Kolaitis and Maurizio Lenzerini and Nicole Schweikardt (Eds.): Dagstuhl Follow-Ups . tape 5 . Schloss Dagstuhl - Leibniz Center for Computer Science, Dagstuhl, Germany 2013, ISBN 978-3-939897-61-3 , p. 275–304 , doi : 10.4230 / DFU.Vol5.10452.275 ( dagstuhl.de ).