Document retrieval

from Wikipedia, the free encyclopedia

Document Retrieval (Engl. Document Retrieval ) is the computerized process of recovering documents (Engl. To retrieve , recover, retrieve) that for a user according to his need for information relevant could be. The user expresses his need for information in the form of a search query . Document Retrieval is also often referred to as Information Retrieval , in most cases the terms are used synonymously.

The entrepreneurial memory is hidden in documents. Poor access to the content of these documents means poor access to the knowledge that an organization has produced or possessed over time. Document retrieval is therefore of enormous importance, since information that is no longer accessible has to be worked out again.

history

Even before the Middle Ages, mankind organized information in such a way that it could be found and used again at a later date. The simplest example is the table of contents of a book: it consists of sets of words or terms that are linked to the pages on which information about those terms can be found. Such an index is part of every information system.

In 1945, in his article As we may think , Vannevar Bush described the vision of a system he called Memex , a kind of extension of the brain . An individual should store all information and records in it and be able to retrieve them quickly and flexibly.

Since the 1940's the problem of storing and efficiently retrieving information has received increasing attention. The reason for this was that there was a rapid increase in information to which faster access was desired. The space required to keep this information in paper form and in folders or offices was soon no longer sufficient. The digitization of data began, bringing the problems of efficient storage and retrieval to the fore. The invention of the CD opened up a new way of storing data compactly and also being able to distribute it easily. Methods of recovery have been researched, but tests in dimensions with commercial applicability have been few. With the publication of the Internet, it was finally possible for every user to publish information on the Internet. Modern search engines try to master this new flood of information. Since the first generation of document retrieval systems, research has been confronted with the central question of what the relevant information is. An understanding of this problem as well as the necessary tools to be able to design and operate document retrieval systems for such amounts of information are not yet fully available even at the beginning of the 21st century. Repeated incidents in companies that have lost large sums of money due to a lack of document control confirm this.

The first commercial document retrieval systems were:

  • DIALOG was designed by Lockheed and provided access to published research articles.
  • LexisNexis provided specialist databases.
  • STAIRS was developed by IBM and was intended for free text research.
  • FAIRS was developed by Fujitsu (Japan) and is similar to STAIRS.
  • GOLEM is an interactive database system from Siemens
  • GRIPS was developed by the German Institute for Medical Documentation and Information (DIMDI) .

definition

A Document Retrieval System (DRS) is understood to mean the entirety of the methodological principles, technical processes and facilities that enable the largely computer-aided provision of information. This information can consist of sound, image, video, and text. The interplay of the components of information indexing ( indexing ) and information retrieval ( retrieval ) is essential .

The representation of the content-related characteristics of a document in a form that can be used for document retrieval is referred to as the content-related document description. The extraction of such content characteristics is called indexing . According to DIN 31623, indexing is understood to mean all methods and their applications that lead to the assignment of descriptors and terms to documents for the purpose of their content development and targeted retrieval. The retrieval process is commonly referred to as research . The result of the search, i.e. the number of documents issued by the document retrieval system, is called the system proposal .

The parameters recall and precision are usually used as parameters for the quality of the document retrieval. Under Recall (completeness of the search) is the ratio of the number of relevant documents in the system proposed to the number of all relevant terms of the search query documents. The precision (accuracy of the search) is expressed by the proportion of relevant documents in all documents in the system proposal. Since these values ​​alone tell little, they are often summarized in so-called recall precision graphs.

The relevance is a key concept of the theory of IR systems. According to Saracevic, relevance is a measure of the correspondence between the document and the search query from the point of view of a neutral arbitrator. The user's ideas of relevance (also referred to as pertinence ) and those of the system rarely match. A central problem of document retrieval becomes clear here: it is not possible to determine before a search query (especially at the time of indexing) which information will be relevant for future users.

Further definitions

  • A DRS does not inform the user about the subject of his search query. It merely provides information about the existence or non-existence and the location of documents that could be relevant to its search query.
  • A DRS comprises the hardware and software that supports the user in providing the information he is looking for. The main goal of a DRS is to minimize the effort on the part of the user to find the information they are looking for.
  • Document retrieval describes the computer-aided process of retrieving documents. A user makes a request in the form of a query and receives a list of documents sorted by relevance. These documents may or may not contain the information he is looking for. The sorting of the system proposal does not have to correspond to the user's expectations of relevance.

Differentiation from data retrieval

The following table shows a comparison of some of the differences between document and classic data retrieval . For a detailed discussion of the differences and similarities, the interested reader is referred to.

Data retrieval Document retrieval
search exactly incomplete, "as good as possible"
Query language artificially Naturally
Query specification Completely incomplete
model deterministic probabilistic
Success Criterion correctness Benefits for the user

Data Retrieval normally searches for an exactly specified object, for example "Bob's address". The result of the search is either the searched object (Bob's address), or it does not exist in the database searched. A corresponding query for such a query in SQL might look like this: SELECT Adresse FROM Angestellte WHERE NAME = Bob. This search query is fully specified in an artificial language. It will be answered either with Bob's address or with a message that Bob's address does not exist in the database. The result of the search is only correct if Bob's correct address was returned. The outcome of the search is deterministic: either the correct data is available or not.

Document Retrieval does not look for Bob's address, but rather for information about the area in which Bob lives, for example. At first it is not clear what a query should look like that provides the user with this information. The Bob Adresse UmgebungDRS provides suggestions for a possible query , which the user can then search for information that is useful to him. The user's need for information is expressed here in natural language, but not fully specified. For a full specification the user would need to know what he is looking for. In addition, it is not clear what suggestions will be made by the DRS and whether it can and will provide the desired information. A probabilistic model is used here. Because of these uncertainties, a search result cannot be described as correct or incorrect. The documents presented to the user can be useful or useless to him. Accordingly, the success criterion of a search here is the benefit of the user.

Development of a document retrieval system

Greatly simplified representation of a document retrieval system.

Indexing

The object of indexing is to assign a set of index terms or keywords to documents. The index terms should:

  • reflect as fully as possible on the content of the document.
  • Describe the document in such a way that it differs as much as possible from documents with similar content.

These keywords can either be generated automatically or manually by an indexer. They offer a logical view of a document. The best way to present a document is with its full content. However, this leads to a high storage requirement for the index. It would then be the same size as the documents it is indexing. Therefore, a document representation has to be found that fulfills the two requirements listed above as completely as possible. This process usually consists of the following steps.

First, special characters and frequently occurring words such as B. Articles and connecting words removed using a stop list . A stop list contains all words that are irrelevant for a description of the content of the document and that are removed from the text. These are then not included in search queries and thus simplify the search process. In addition, this step reduces the size of the original document by 30–50%.

Then all words are reduced to their root by removing their suffixes (so-called stemming ). Thus, all words that are semantically equivalent are mapped to the same word stem, e.g. B. the terms driver , driving and driving school are shown on fahr . The stemming assumption is that words with the same root belong to the same word family and can therefore also be treated as the same. However, this simplification can also lead to errors, since there are definitely words with the same root but different meanings, such as Neutron and Neutralize. In addition, equivalent words can have different meanings in different contexts. The result of this processing step is a class for each word stem. If a word of a class occurs in a document, this class is assigned to the document as a keyword.

Finally, all index terms are weighted according to the model implemented in the DRS . An index is then created which enables a quick search in the set of index terms by linking them to the documents in which they are contained. If necessary, other important information such as the position of the term in the document or the author can be saved. A frequently encountered index structure is the inverted file . Further data structures and their descriptions such as sequential files, index-sequential files and multi-lists can be found in Chapter 4.

It can also clustering be used, and similar documents a cluster are assigned. The search in such a pre-classified information pool is called a cluster search and takes place in two steps. Initially, only clusters with high relevance are sought. Then the documents in these clusters are inspected and the most relevant are selected. Clustering is intended to increase the efficiency of document retrieval systems by reducing the number of document comparisons required. It is obvious that this can reduce the effectiveness .

Retrieval

The process of locating the information a user wants to obtain consists of several steps. First of all, he has to convert his need for information into a form that the search engine can understand, a so-called query. This query is then converted into a query representation. Most of the processes that the documents go through during indexing also go through a query. All the processes described below take place while the user waits for the answer to his search query. First, terms and characters that are irrelevant to the search, such as B. "I'm looking for information on:" removed. Then irrelevant terms are removed with the help of the stop list and stemming is carried out. Finally, the query representation is generated, whereby the logical operators required for the search algorithm can also be inserted. It is also possible to expand the terms of the query and thus include related terms that are related to the searched term in the search. These related terms can be synonymous terms that are found in electronic thesauri , or have a special connection with the query term due to semantic properties (e.g. certain word order). This processing step frees the user from the need to try out all variants of his query in order to get as many as possible relevant to him in the search result. Thus, the recall may be increased, but the precision will decrease if expanded terms lead to the recovery of irrelevant documents.

Finally, the actual search takes place. The search algorithms used are specified by the DRS model implemented. The index is searched for documents that contain terms of the query. The so-called similarity score is calculated with the query for each document . The calculation is carried out using an algorithm that is also specified by the DRS model implemented. The documents are then sorted or ranked according to their similarity scores . The sorted list is made available to the user (possibly with a brief description of each document). He can take a closer look at the list or the content of the documents. Some systems also offer the option of user-based relevance feedback so that the user can mark documents that are relevant for him. The system then initiates a new search process based on these ratings and provides a revised list of documents that (hopefully) contains more documents relevant to the user. The process of relevance feedback can be carried out as often as required.

Theoretical document retrieval models

The following theoretical models are implemented in document retrieval systems. The choice of the model has an impact on the search algorithms and the calculations of the rankings and scores. These are described in detail in Chapter 2.

Classic models:

Modern probabilistic models:

Alternative paradigms:

Individual evidence

  1. ^ A b Ricardo Baeza-Yates, Berthier de Araújo Neto Ribeiro, Berthier Ribeiro-Neto: Modern information retrieval. ACM Press, 1999, ISBN 0-201-39829-X .
  2. V. Bush: As We May Think. In: Atlantic Monthly. Volume 176 (1), Pages 101-108, 1945, doi: 10.1.1.128.2127 .
  3. a b c d e f Elizabeth D. Liddy: Automatic Document Retrieval. In: Encyclopedia of Language & Linguistics. 2nd edition, Elsevier Limited, 2005, CNLP ( Memento from 23 August 2012 in the Internet Archive ) (DOI not available).
  4. a b c d various authors: Handbook of modern data processing. Forkel-Verlag, issue 133, January 1987, ISSN  0723-5208 .
  5. ^ DC Blair: The challenge of commercial document retrieval, Part I: Major issues, and a framework based on search exhaustivity, determinacy of representation and document collection size. In: Information Processing and Management: an International Journal archive. Volume 38, Issue 2, Pages 273-291, Pergamon Press, Inc. Tarrytown, New York, March 2002, doi: 10.1016 / S0306-4573 (01) 00024-3 .
  6. ^ J. Panyr: Relevance problems in information retrieval systems. In: Nachr. F. Documents. Pp. 2-4, 1986.
  7. T. Saracevic: RELEVANCE: A Review if a framework for the Thinking on the Notion in Information Science. In: Journal of the ASIS. Pages 321-343, 1975.
  8. a b c d e f g C. J. van Rijsbergen: Information Retrieval. Butterworth-Heinemann, 1979, ISBN 0-408-70929-4 .
  9. Gerald Kowalski: Information Retrieval - Architecture and Algorithms. Springer, 2011, ISBN 978-1-4419-7715-1 .
  10. ^ A b D. C. Blair: The data-document distinction in information retrieval. In: Communications of the ACM. Volume 27, Issue 4, Pages 369-374, New York, April 1984, doi: 10.1145 / 358027.358049 .
  11. ^ DC Blair: The data-document distinction revisited. In: ACM SIGMIS Database. Volume 37, Issue 1, Pages 77-96, New York, Winter 2006, doi: 10.1145 / 1120501.1120507 .
  12. ^ WS Cooper, ME Marron: Foundations of Probabilistic and Utility-Theoretic Indexing. In: Journal of the ACM. Volume 25, Pages 67-80, 1978, doi: 10.1145 / 322047.322053 .
  13. ^ SE Robertson, ME Maron, WS Cooper: Probability of relevance: a Unification of Two Competing Models for Document Retrieval. In: Information Technology: Research and Development. Volume 1, Pages 1-21, 1982.
  14. ^ WS Cooper: On Selecting a Measure of Retrieval Effectiveness, Part I: The "Subjective" Philosophy of Evaluation. In: Journal of the American Society for Information Science. Volume 24, Pages 87-100, 1973, doi: 10.1002 / asi.4630240204 .
  15. ^ G. Salton: Automatic Information Organization and Retrieval. McGraw-Hill, New York, 1968, ISBN 0070544859 .
  16. ^ L. Goodman, W. Kruskal: Measures of association for cross-classifications. In: Journal of the American Statistical Ass. Volume 49, Pages 732-764, 1954, doi: 10.2307 / 2281536 .
  17. ^ L. Goodman, W. Kruskal: Measures of association for cross-classifications II: Further discussions and references. In: Journal of the American Statistical Ass. Volume 54, Pages 123-164, 1959, doi: 10.1080 / 01621459.1959.10501503 .
  18. ^ JL Kuhns: The continuum of coefficients of association. In: Statistical Association Methods for Mechanized Documentation. Pages 33-39, Washington, 1965, (doi not available).
  19. ^ RM Cormack: A review of classification. In: Journal of the Royal Statistical Society. Series A, volume 134, Pages 321-353, 1971, doi: 10.2307 / 2344237 .

Web links