XML retrieval

from Wikipedia, the free encyclopedia

XML Retrieval or XML Information Retrieval (XML IR) is the content-based retrieval of documents that are structured with the Extensible Markup Language (XML).

Requests

Most approaches to XML retrieval are based on techniques from the field of information retrieval (IR) and calculate, for example, the similarity between a query consisting of key words and the document. In XML retrieval, the request can also contain structural information. So-called content and structure (CAS) queries enable the user to specify the XML structure that should or can contain the desired search term.

Use of XML structure

The self-describing structure of XML documents can be used to improve the search for XML documents considerably. This includes the use and utilization of CAS queries, the assignment of different weights to different XML elements (e.g. so that a title element is weighted higher than a footnote), or the focused retrieval of partial documents.

Ranking

The ranking , i.e. the relevance evaluation of a document, can take into account both content and structure similarity in XML retrieval, i.e. the similarity between the structure specified in the CAS request and the structure in the document to be evaluated. In addition, the results of a structured query can either be complete documents or XML elements of a document that are nested at any depth. The aim is to find the smallest result that has the highest relevance, whereby relevance is also to be understood as specificity, i.e. as the extent to which the result is focused on the desired result.

XML search engines

The INitiative for the Evaluation of XML-Retrieval ( INEX ) was founded in 2002 and provides a platform for the evaluation of XML IR algorithms. Three areas affect XML retrieval:

  • XML query languages: Query languages ​​such as the W3C standard XQuery enable complex search queries, but only exact hits are possible, i.e. no relevance calculation and no ranking of the results. They must therefore be expanded so that the vague search using relevance calculation is possible. Most XML-based approaches require precise knowledge of the schema on which the documents are based ( XML Schema or DTD ).
  • Databases: Classic database systems meanwhile also offer the option of storing semi-structured data, which has led to the development of XML databases . Often such approaches are very formal, focus more on the search itself than on the ranking, and are intended for experienced users who can formulate complex queries.
  • Information retrieval: Classic information retrieval models such as the vector space model are based on relevance calculations, but they do not use a document structure, but only allow simple queries. They also rely on a static document concept so that the results usually consist of complete documents. However, they can be extended to enable structural information and dynamic document retrieval. Such approaches use document subtrees (index terms plus structure) as dimensions of the vector space.

literature

  • S. Amer-Yahia, M. Lalmas: XML Search: Languages, INEX and Scoring . SIGMOD Rec. Vol. 35, No. 4, 2006
  • S. Liu, Q. Zou, W. Chu: Configurable Indexing and Ranking for XML Information Retrieval . In: Proc. of the 27th Annual International ACM SIGIR Conference , ACM Press, 2004
  • S. Pal: XML Retrieval - A Survey. 2007, Technical Report , CVPR

Individual evidence

  1. J. Winter, O. Drobnik: An Architecture for XML Information Retrieval in a Peer-to-Peer Environment . ACM PIKM2007 at ACM 16th Conference on Information and Knowledge Management (CIKM 2007), Lisbon, Portugal, 2007.
  2. a b S. Malik, A. Trotman, M. Lalmas, N. Fuhr: Overview of INEX 2006 . In: Proc. of the Fifth Workshop of the INitiative for the Evaluation of XML Retrieval , Germany, 2007.
  3. a b N. Fuhr, N. Gövert, G. Kazai, M. Lalmas (eds.): INitiative for the Evaluation of XML Retrieval (INEX) . In: Proc. of the First INEX Workshop , Dagstuhl, Germany, 2002, ERCIM Workshop Proceedings, France, 2003
  4. a b Torsten Schlieder, H. Meuss: Querying and Ranking XML Documents . Journal of the American Society for Information Science and Technology, Vol. 53, No. 6, 2002