Thematic search

from Wikipedia, the free encyclopedia

As a visual search (Engl. "Topic Search" or "Thematic Search") is in the field of information technology a special form of semantic search refer to operations of the user on the Web, in a digital library search or in a local archive for documents deal with a specific topic.

background knowledge

For example, users may be interested in finding all of the documents related to "Heart Disease" (or "Crime", "Astronomy", "Post War", etc.). Search engines that work purely “keyword-based” can only find such documents if the desired topic appears literally as a term in the text. However, this is often not the case: Many interesting texts deal with a specific area or partial aspect of the topic under consideration without explicitly mentioning it. A specialist article on atrial fibrillation provides interesting information on the topics of “health” or “heart disease”, even if these words do not appear in the text. Similarly, a report on galaxies belongs to the topic “astronomy”, even if that term is not mentioned. Usual full-text search engines are not able to automatically assign keywords to related topics. Many relevant documents are therefore simply not found. Special search techniques and environments offer better options, which systematically record thematic relationships between different terms and take them into account when describing document content in the search index and when answering user queries.

The thematic search transfers the traditional library search, which is characterized by systematics and catalogs , to the computer-controlled world of digital archives, libraries , forums and platforms and expands and enriches them with new forms of interaction. Almost all forms of research in electronic text stocks that go beyond a mere keyword search represent a form of thematic search, whereby a combination with keyword-based search can be quite useful. For example, it may be interesting for a corporation to find all reports on the subject of "Environment" in which it is mentioned. A political party might be interested in which articles in the press on the subject of “internet”, “economics” or “social” it was mentioned and which articles on the same subject mentioned other parties. If the connection between documents and the topics occurring there is electronically recorded in a search engine , users can be offered an overview of which topics occur with which relevance in the recorded document inventory; interesting documents can then be found by navigating through topic hierarchies. Thematic "Tag Clouds" (see below) represent a special form of this visual thematic access . If the messages or documents - as with news collections - are also provided with time stamps, the chronological sequence of the importance of the topics can also be shown. With a view to user interests, the more the focus is placed on gaining an overview, analyzing existing topics from different perspectives and recognizing relationships between topics, different documents and sources, the less the interaction represents a "search" in the actual sense. more generally one can therefore speak of a “thematic access” to content.

Thematic keywording, tagging and term clouds

In order to enable a thematic search, posts and articles are manually indexed and tagged according to topic in many Internet forums . With " social tagging " the users assign the tags themselves. For visual navigation in the document inventory, users are often presented with “tag clouds” that display frequently assigned topics. Clicking on a topic then leads to relevant documents. If all documents are provided with a sufficient number of good quality tags, this results in an interesting and intuitively easy-to-understand form of thematic search. In practice, however, manual keywording often proves to be inadequate, as many documents remain untagged. If tag clouds are nevertheless used, the result usually falls short of expectations.

In order to be independent of manually assigned labels and to take all texts into account, more primitive types of term clouds only represent the most common or most conspicuous terms of the underlying text collection. However, only terms that appear literally in the text are recorded. Different terms that often appear together are placed closer to each other in the clouds. However, the “world knowledge” that this creates, which arises from the coincidences of the document collection, often turns out to be questionable on closer inspection.

Fully automatic thematic keywording and annotation of documents based on real world knowledge is associated with greater effort. It can be achieved through the use of special semantic networks with a computational linguistic foundation. In such networks, keywords, names and phrases are explicitly assigned to thematic areas, whereby these are arranged in the form of an extensive topic hierarchy according to main and sub-topics. With the occurrence of the keywords in the texts, the subjects of the documents are then recognized using the knowledge stored in the network. To be generally applicable, recorded keywords and the hierarchy of topics must have an encyclopedic coverage. There are already services on the Internet for fully automatic thematic indexing of text documents based on this principle.

Related procedures

The thematic search, or more generally the thematic access, represents a special form of the "semantic search" . The following are primarily to be mentioned as related processes or problems:

Procedure for determining the semantic proximity of terms
These procedures determine the relationship of keywords without, however, associating the terms with a topic hierarchy. A well-known example is "Latent Semantic Indexing" . Newer approaches are based on automatically extracting the knowledge implicit in Wikipedia about the relationship between different topics and terms and making it usable. Some popular approaches are:
Classic thesauri
order the vocabulary of a subject according to general and subordinate concepts and similar relationships; they often also contain a simple thematic taxonomy . However, most thesauri are too limited in terms of subject and subject for use in general search engines.
Formal ontology
Formal ontologies are used in medical informatics and in many other areas for the automated analysis of texts. They capture special relationships between concepts and instances, the selection of these relationships depending on the subject area being modeled.
Text classification
With document classification, documents are automatically sorted into different classes. The given classes often correspond to certain topics (sport, politics ...), but typically a relatively small selection of topics that are not hierarchically organized is used.
Story tracking
With story tracking, articles and posts that deal with a very specific message are tracked across media over a longer period of time.

literature

Web links