Unstructured data

from Wikipedia, the free encyclopedia

In business informatics and computational linguistics , unstructured data is digitized information that is available in a non-formalized structure and that cannot be accessed in aggregated form by computer programs via a single interface . Examples are digital text in natural language and digital sound recordings of human language.

classification

A distinction is made between unstructured data and structured and semi- structured data. If you look at an e-mail, it has a certain structure : It contains a recipient, a sender and possibly a title. This makes it part of the semi-structured data. However, the content of the email itself is structureless.

The automatic usability of unstructured data is limited by the fact that there is no data model for them and usually no metadata . Metadata and data are also mixed up in text documents. In order to gain structures from it, modeling is necessary. Furthermore, there is talk of unstructured data in connection with the storage of documents without existing data warehousing . As a result, they cannot be indexed and, accordingly, cannot be searched together.

meaning

Much data is unstructured at its origin. They gain structure by being brought into a schema through human intervention . The process of structuring can have disadvantages, since it is often associated with a loss of information. In the corporate environment, important information is often available in unstructured data, the non-recording of which can also cause legal problems. Therefore, the fields of knowledge management and data management deal with their integration and administration.

In order to provide the unstructured data with structures, the UIMA framework (Unstructured Information Management Architecture) exists in the Open Source area . This is a framework for building applications that process unstructured information.

Handling of unstructured data

The following methods can be considered especially for structuring the data:

  1. Text analysis and text mining have been on the market for many years. The products for this are solidly ready for the market. Various small specialized manufacturers have developed tools for this. Some business intelligence manufacturers have bought such technologies under pressure from the market. Text mining can be done manually, using statistical methods, using machine learning, or by processing natural languages. It can provide terms and concepts in thesauri that can become indispensable for additional business intelligence analyzes.
  2. Machine learning is based on statistical methods such as Bayesian classifiers , artificial neural networks , or latent semantic analysis (LSA). It is much more effective than the classic statistical methods, but not applicable everywhere. It requires monitoring and training of the machines, and as with data mining procedures, a deep knowledge of the subject is necessary.
  3. Linguistic techniques can be faster than machine learning, and sometimes more accurate. You can reduce ambiguity, but you still need human intervention. Here the models are easier to understand compared to LSA and machine learning.

Individual evidence

  1. Computer Week : Unstructured Data, The Unlifted Treasure
  2. Unstructured data: The bomb is ticking. In: www.cio.de. Archived from the original on September 2, 2013 ; accessed on January 12, 2017 .
  3. a b Computerwoche: Searching for data with text mining and web mining
  4. BI metrics need a context , beyenetwork , December 1, 2009 (English)

Web links