Metadata

Metadata or meta information is structured data that contains information about characteristics of other data.

The data described by metadata are often larger collections of data such as documents , books , databases or files . Information on the properties of an individual object (for example “ person's name ”) is also referred to as its metadata.

Computer users are often unaware that data has metadata that is not immediately recognizable and that this may be of greater use to cybercriminals or authorities than the data itself.

Introductory examples

Typical metadata for a book are, for example, the name of the author , the edition , the year of publication, the publisher and the ISBN . The metadata of a computer file includes the file name, the access rights and the date of the last modification.

Differentiation between data and metadata

While the concept of metadata is relatively new, the principle of reference and formal specifications has been used in library practice for centuries. However, a valid distinction between metadata and ordinary data only exists for the special case, since the designation is a question of point of view. For the reader of a book, the content is the actual data, while the name of the author or the number of the edition are metadata. For the publisher of a book catalog , on the other hand, these two details are properties of books in general, "Author" and "Edition number" are metadata, and the specific values ("Karl May", "17") are the actual data for him.

Intended use

When trying to differentiate between data and metadata , it is helpful to introduce the term “ purpose ”. The purpose determines the result; to be able to accomplish a specific purpose - achieve a specific result - metadata is needed. The result can consist of data, in particular metadata in its role as data can be part of the result.

Examples:

Purpose: Search within a library for all locations (call numbers) of available books by a specific author
Metadata: " Name of the author" and "Available"
Result: " Signature " (the location can be accessed via the signature)

use

In many cases there is no conscious separation between object and meta level . For example, one speaks of looking for a book in a catalog and not just its metadata. When using metadata, it is often expected that, through direct coupling with the user data, it is an inseparable part of a closed, self-describing system.

Metadata is often used to describe information resources and thereby make them easier to find and to establish relationships between the materials. As a rule, this requires cataloging with a certain degree of standardization (e.g. through library regulations ).

storage

There are various options for storing metadata :

In the document itself. In a book, the author and the year of publication are always recorded. In HTML documents, the element is used to specify language, author, company or keywords , for example .<meta>

In associated reference works , for example for a book in a library in the library catalog .

For computer files in the file attributes . Most file systems only allow well-defined metadata in file attributes; others ( HPFS using extended attributes) allow any data to be associated with a file. It is also common to include the meta information “ file type ” in the file name; typically in the filename extension or in magic numbers at the beginning of the file.

There are a number of data formats and data models for storing and transferring metadata , such as Dublin Core or EXIF , which can be transferred in different and thus also in human-readable formats.

Interoperable metadata

In technical terms, “operable” initially means “designed so that it can be used and operated on”. The prefix "inter" comes from Latin and means something like "between". Interoperable metadata are metadata from potentially different sources, between which (“inter”) there is a relationship in such a way that it is possible to work (“operate”) with them together.

Standards for interoperable metadata have the task of making metadata from different sources usable. To do this, they initially include the aspects of semantics , data model and syntax .

The semantics describe the meaning that is usually defined by standardization bodies (see Dublin Core ). The data model defines which structure the metadata can have. In connection with metadata, statements can be understood as “data” that are made about an object to be described (document, resource , ...). A “model” component of the term data model can be understood as a description of how the statements are structured (the term data model means something like “ grammar ” or “structure of statements” in the context of metadata ). Examples of data models of metadata are simple attribute / value combinations (e.g. HTML meta-elements ) or sentences with subject , predicate and object (e.g. triples in RDF ). The syntax eventually used to represent the corresponding generated the data model statements. An example of a representation format is XML (eXtensible Markup Language).

The following relationship now exists between these three aspects: The semantics are represented by constructs of the data model. The data model is in turn represented by syntactic constructs. The syntactic constructs are finally composed of characters from an agreed character set (as with Unicode ). These three aspects can be understood as hierarchically superimposed layers, since each layer is based on the layer below. The layers are independent of one another; H. the use of a specific standard in a layer is independent of the other layers (such as the layer models of network communication, for example the ISO / OSI layer model ). A certain semantics can be represented by constructs of different data models (e.g. attribute / value combination, triples), which in turn can be represented by different syntaxes ( graphs , XML formats).

The fourth aspect, orthogonal to these layers, is identification , which affects all three layers. In order to be able to process metadata from different sources in a meaningful way, it must be clearly identified (worldwide) which semantics, which data model and which syntax are involved. This requires an identification mechanism such as that provided by the URIs (Uniform Resource Identifier).

Generic framework

All four aspects - semantics , data model , syntax and identification - are required to create standards for interoperable metadata. They can therefore be grouped together in a framework . A framework therefore offers a kind of basic structure that already describes the most important elements or components of a system and their relationships, but without making precise specifications with regard to their design. It thus functions as a kind of "reference system" that enables the meaningful integration of new components. Since a framework shows elements and their relationships, this can be easily visualized through the graphic arrangement of elements. The figure “Generic Framework” shows a framework for metadata on a meta level . In contrast to specific forms of frameworks, i. H. i.e. the level of expression or instance , a framework describes a generalized framework on the meta level - recognizable by the generic names of the components.

An example of a concrete framework for metadata is RDF ( Resource Description Framework ) of the World Wide Web Consortium (W3C). RDF contains all of the above four aspects with specific characteristics, as shown in the figure.

RDF as a framework for metadata

The components in detail:

Semantics : Domain-specific semantics can be imported via namespaces , with which the semantics of an RDF vocabulary can be expanded as required
Data model : RDF has a fixed data model which allows statements about resources in the form of triples with subject, predicate and object
Syntax : Any syntax can be used to represent such statements, RDF / XML, graphs, or the N-triple notation; However, RDF / XML is the normative syntax
Identification : URIs are mandatory as a universal identification mechanism

Following the idea of a framework, RDF itself does not define any domain-specific semantics, but only specifies a mechanism for how further semantics can be integrated via namespaces with the help of a URI. RDF, on the other hand, defines a common data model in the form of triples and the universal use of URIs as an identification mechanism. These are used both to identify the individual components of a triple (subject, predicate, object) as well as their values and data types . The concrete syntax for representing the triples can, however, again be chosen freely, following the idea of a framework, with RDF / XML being provided as the standard. With RDF Schema, RDF also contains a schema language to define your own metadata vocabularies .

RDF schema is similar to RDF as XML schema is to XML. An RDF schema is also a valid RDF document, and an XML schema is also a valid XML document. In both cases, we are dealing with specialized subsets of a markup language . However, while XML Schema describes syntactic restrictions, e.g. B. element names, frequency of occurrence, etc., describes RDF schema semantic restrictions, so z. B. that an attribute "hasPublished" may only be used on instances of the class "human" or "legal person", but not on instances of the class "animal" - in schema language, the attribute "hasPublished" has the domain "human" or "legal entity".

Just as XML, following the principle of simplicity and extensibility, fundamentally changed the world of data, in which it made it possible to define data formats interchangeable between different systems and programs without any problems through a uniform syntax, a standardized type system and its text-based nature, RDF tries to introduce the world of metadata to change uniform data model. Due to the character of a framework, RDF also ties in with proven principles such as simplicity and expandability.

Examples in application areas

The following sections provide examples and standard formats for metadata in application areas.

Metadata in Statistics

In statistical databases, the data that does not directly represent the content of a statistic are referred to as metadata , such as branch or occupational titles, community directories and other catalogs. The statistical metadata also includes descriptions of the data fields in survey forms, possibly also complete form descriptions. The actual statistical data is referred to as microdata and macro data, as opposed to metadata .

In survey research , specific metadata about the survey is referred to as paradata .

Metadata for geospatial data

In the INSPIRE guideline and in the law on access to digital spatial data based on it (Geodata Access Act - GeoZG) there is a legal definition for metadata in the field of spatial information processing: "Metadata is information that describes spatial data or spatial data services and enables spatial data and spatial data services to be determined, included in directories and used. "(§ 3 Paragraph 2 GeoZG)

Metadata in software development

In software development , the term metadata is used for various purposes:

Components of a program source text are called metadata that are not evaluated by the actual translation tool , usually a compiler , but by additional tools. This metadata is mostly used for documentation or with the help of annotations for code generation . Examples are the annotations in Java or the attributes within the .NET framework .

A form that differs from classic programming is the use of metadata in universal software. Most of the required application functions are available precompiled and are called and parameterized via a metadata engine. The desired target application must be described declaratively beforehand using specific metadata. This approach is followed in particular by data warehouse and business intelligence products. Some manufacturers such as Tenfold , Data-Warehouse GmbH and Scopeland Technology also apply this principle to the creation of writing database applications.

Metadata is also understood to be the definition of data sets in a data dictionary of a database.

The information in the software version management can also be used as metadata . These often make it possible to identify the author of each line of program code. For this purpose, user data (the source code) and metadata from the version management archive are correlated. In many version management software (such as Git and SVN ) this built-in command is called blame ( English for accuse).

Metadata in music recordings

Typical metadata for music and other sound recordings are e.g. B. Title, artist, composer, publication date, music publisher or the ISRC number; With digital sound recordings it is possible to save this meta information directly in the file (for example in the ID3 tag of MP3 files).

In addition to the primary data required to create a conventional music library, there is much more complex content-related music metadata. This includes, for example, style, main and secondary instruments, genre, tempo, key, dynamics, vocal character and the description of moods and scenes. According to Wilbert Hirsch , composer and pioneer of music categorization , this content-related metadata is referred to as “secondary music metadata ”. Much more difficult in their indexing work, these secondary metadata form the basis for the content-related music categorization.

Digital image metadata

Metadata of digital photos, such as the date / time of the photo, focal length, aperture, exposure time and other technical parameters (possibly also the geographical coordinates of the photo location), are now stored by almost all digital cameras at the beginning of an image file in Exif format. Using suitable software, a digital image (photo, scan or graphic) can be enhanced with metadata in IPTC format; In doing so, information can essentially be given on the image title , image description, location ( GPS coordinates / location / state / country), author (photographer) or copyright holder, contact details of the copyright holder or licensor, copyright provisions and search terms (keywords ). Many image editing programs add or change the metadata when editing digital photos (or images in general) so that conclusions can be drawn about the image editing software.

Metadata when communicating on the Internet

The Internet protocol follows a layer model. This should be illustrated using the example of the standard for sending e-mails . The protocol commonly used to transfer e-mails is Simple Mail Transfer Protocol . Its position in the internet protocol layer can be specified precisely:

SMTP in the TCP / IP protocol stack :
application	SMTP
transport	TCP
Internet	IP ( IPv4 , IPv6 )
Network access	Ethernet	Token bus	Token ring	FDDI	...

From the point of view of the senders and recipients of e-mails , all layers below the application layer can be viewed as metadata. This is particularly noticeable when the application layer is encrypted. Even then, the transport layer (TCP) already encodes enough information to determine the name of the sending and receiving server (often the global part of an e-mail address ) as well as the length of the message and the time it was sent. In the case of frequent e-mail traffic between two parties, the mere frequency information can allow an investigating third party to draw conclusions about the content of the e-mails.

In principle, the same situation arises with other network protocols, such as instant messaging services or the World Wide Web . In general, one speaks in this context of traffic data or marginal data (when using electronic infrastructure) .

According to Section 206 (5) of the German Criminal Code , in addition to the content of the telecommunication, " your specific circumstances, in particular the fact whether someone is or was involved in a telecommunication process" counts as telecommunication secrecy .

Social criticism

The Italian philosopher and media theorist Matteo Pasquinelli put forward the thesis that the data explosion would make a new form of control possible: a “society of metadata”. With metadata, new forms of biopolitical control for controlling the masses and behavior control could be established, such as online activities in social networks or passenger flows in public transport. Pasquinelli does not see the problem in the fact that individuals are monitored at every turn, as in totalitarian systems, but are measured and society as an aggregate becomes predictable and controllable.

literature

Gunnar Auth: Metadata - Basics and Importance in Data Warehousing . In: Gunnar Auth: Process-oriented organization of metadata management for data warehouse systems . BoD, Norderstedt 2004, ISBN 978-3-8334-1926-3 , pp. 27-74.
Ingrid Schmidt: Modeling of metadata . In: Henning Lobin; Lothar Lemnitzer: Text technology. Perspectives and Applications . Stauffenburg, Tübingen 2004, ISBN 3-86057-287-3 , pp. 143-164.
Ulrich Hambuch: Success factor metadata management: The relevance of metadata management for data quality in business intelligence . Vdm, Saarbrücken 2008, ISBN 3-639-07879-9

Web links

Wiktionary: Metadata - explanations of meanings, word origins, synonyms, translations

Martin Warnke: data and metadata . - Online resources for image science ; zeitenblicke.de, 2003
Metadata Standards Crosswalk. - Getty Standards and Digital Resource Management Program (English)

Individual evidence

↑ wiretapping scandal: metadata is often more informative than the actual content. In: datensicherheit.de. September 23, 2013, accessed September 11, 2017 .
^ Adrian Lobe: Philosophy - The Society of Metadata. In: Süddeutsche.de . July 31, 2018, accessed September 3, 2018 .

[1] wiretapping scandal: metadata is often more informative than the actual content. In: datensicherheit.de. September 23, 2013, accessed September 11, 2017 .

[2] Adrian Lobe: Philosophy - The Society of Metadata. In: Süddeutsche.de . July 31, 2018, accessed September 3, 2018 .