Semi-structured data

In database research ( computer science ), semi-structured data is information that is not subject to any general structure, but rather carries part of the structural information with it.

While structured data management must be based on a database model that contains the appearance of the data elements (objects), there is no such model for semi-structured data.

Semi-structured data do not have to be subjected to a type model; thus a data collection from semi-structured data can be expanded as required. A structural model can be implied below.

With the help of grammar and lexicons, semi-structured data can be brought into a form that has the following characteristics:

(E1) The data collection consists of one or more sequences of objects.

(E2) Objects can either be broken down into attributes (complex objects) or they are atomic objects.

(E3) Atomic objects contain values of a known, elementary data type .

Semi-structured data with the properties (E1), (E2) and (E3) are referred to as well-formed semi-structured data.

The Object Exchange Model (OE model) has de facto established itself as a model for semi-structured data. Data exhibiting these properties can also be described as well-formed XML documents.

Isn't semi-structured also structured?

With one exception described below, semi-structured data cannot be accommodated in a structured database model. However, there are methods with which data types can be recognized from semi-structured data.

If the data types (classes) and thus also the relations are known, one has an entity-relationship model . However, it applies to this model that afterwards it can only be filled with data in this structure and no longer with further semi-structured data.

In the case of semi-structured files that are formed in an OE model, it can also be said: The formal description of an OE model makes it possible to create a consistent, structured data model that can look like this:

Relational data model for mapping semi-structured objects

This data model contains only three basic types: the nodes, which represent the objects, the edges, which reference attributes or references, and leaves, which represent the properties of the reference.

This means that all semi-structured objects of an OEM model can also be written into this data model. This OEM DB model will be named below.

Semi-structured data cannot be written into any DB model, except in models that only hold one abstract data type for all objects.

ssd notation

Serge Abiteboul, Peter Buneman and Dan Suciu use the so-called ssd (semi-structured-data) notation1 in their edition “Data on the Web”, which is less well known than the XML notation. However, this notation offers a very short and clear representation for semi-structured data:

Data records with attribute-value tuples are noted as follows:

{Manufacturer: "Volkswagen" Model: "Passat" Mileage: "35,600"}

The values of the attributes can now in turn be defined on the basis of a sub-data record.

{Vehicle: {Manufacturer: {Name: "Volkswagen" Place: "Wolfsburg"} Model: "Passat" Mileage: "35,600"}}

Up to now it has been possible for an element to contain data or attribute-value pairs and for further elements to be subordinate to it. Thus, the notation presented so far enables the representation of data in trees. After describing the semi-structured data as an OEM model, at least the node elements can reference all other elements of the semi-structured data collection. This is possible because all elements are assigned a unique ID. E.g. vehicle: & o1. To reference from one element to another, an attribute is specified along with a unique ID, e.g. B: Manufacturer: & o2. All references that do not refer to elements subordinate to the element itself are referred to in this work as cross-references.

Because it is possible to move cyclically within the graph through the directed edges, such data collections are referred to as cyclic.2 A cyclic graph is shown below in the ssd notation:

{
Fahrzeug: &o1{Modell: „Passat“
      km-Stand: „35.600“,
      Erstzulassung: „02/2007“ ,
      Hersteller: &o2,
      Motor: &o3
},

Hersteller: &o2 {Name : „Volkswagen“,
       Ort: „Wolfsburg“
       Produkte: {Gebrauchtwagen : &o1,
       Motor: &o3 }
},

Motor: &o3 {Name: „OttoV2“,
       Kraftstoff: „Benzin“
       Hubraum : „2.0 Liter“
       PS : „120“
       }
}

XML

In contrast, the notation of semi-structured data with XML, which has been standardized by the W3 consortium , is very widespread . This serves as a data exchange format on the Internet and is also used as a data storage format in many applications.

In XML, attributes of so-called elements can be noted with the following notation, the names of which can be freely defined:

The ssd record

{Vehicle: {Model: "Passat"}}

looks like this in XML:

An element can contain further content and / or further sub-elements:

<element [attribute_1 = "value_1"] [attribute_2 = "value_2"] [attribute_n = "value_n"]> content1 <sub-element_1 /> <sub-element_2 /> .... </element>

Thus there are two possibilities within XML to specify properties of objects:

through XML attributes
through sub-elements

The ssd data record (see above) can also be described with a further sub-element:

<Vehicle> <Model> Passat </Model> </Vehicle>

Document Type Definitions

Another notation exists for XML documents, which is called DTD (Document Type Definition). This notation describes the structure of an XML document.

XML files with a DTD are more “structured” than XML files without a DTD. XML files without a DTD have no typing.

Elements or tags and their attributes can be freely defined within an XML document - without restrictions. In principle, it is possible for the DTD to define only some of the elements within the XML document. A DTD can be used to define which elements may exist and which attributes these elements may or must have; the number of possible values can also be restricted. In addition, the number of possible subordinate elements can be defined with DTDs. The types described in the DTD can be implied.

Although the XML document is subject to an object description, we cannot speak of structured data.

Despite the possibility of further structuring with DTDs, we are still on the semi-structured level of data management. The reason for this is that, from a technical point of view, structured data is subject to a so-called data dictionary , which describes the structure of the data.

The structure of the entities includes a. the relationships, attributes and values with their data types. Access to the stored data without a data dictionary is not possible.

It is different with semi-structured data, which are basically structured like a text file. The values of the attributes are also not defined with data structure specifications such as string, integer, float, date, number, etc., but are always represented as character strings.

This means that an XML file that has been validated with a DTD can be edited and changed independently of the DTD. Different XML files, which in turn can be validated with one and the same DTD, thus belong to the same equivalence class.

Since the structure of the DTD is derived from the processing algorithms, semi-structured data in XML with DTD can only be generated by one program in one version and further processed with one program and one version - unless semantically oriented queries or Processing methods used.

It is possible that DTDs can also be generated by type recognition processes such as simulation (Abiteboul), since with this process types of objects “classes” are recognized. Program changes - as here in the analysis system - also lead to the adaptation of the DTD.

In addition, the semi-structured conception offers the possibility that elements, which in this case describe words and sentence phrases, can follow one another as desired. The DTD notation offers parameter entities that allow any order and number of sub-elements of a higher-level element. With structured ER modeling, this is not possible in a direct way.

literature

Serge Abiteboul, Peter Buneman, Dan Suciu: Data on the Web . From Relations to Semistructured Data and XML. Morgan Kaufmann Publishers, San Francisco, California 2000, ISBN 1-55860-622-X .
Francois Bry, Michael Kraus, Dan Olteanu, Sebastian Schaffert: Current catchphrase "Semi-structured data" . 2001 ( PDF [accessed April 26, 2011]).