Luhn's extraction algorithm

from Wikipedia, the free encyclopedia

Automatic Creation of Literature Abstracts is a work by Hans Peter Luhn from 1958. It describes the first implementation of an algorithm for sentence extraction . The aim of this sentence extraction was the automatic selection of sentences for the preparation of an abstract .

Emergence

Luhn saw the frequency with which a word appears in the text as a useful measure of the relevance of words . He sees the justification for this assumption in the fact that an author repeats certain words that are connected with the topic in his argument and description of various aspects. He was also of the opinion that the position of relevant words within a sentence says something about the importance of that sentence. From a combination of these two values, Luhn wanted to determine the relevance of the sentences.

principle

When the algorithm is carried out, a kind of “ inventory list ” is first created with all the words that occur and their frequency. Since Luhn was of the opinion that only words with medium frequency say something about the significance of a sentence and words with very high frequency are rather meaningless because they are too general, he wanted to exclude these general words with very high occurrences in the text. He saw two ways to do this:

  • Compare these high-frequency words to a list of general words and exclude those words that are considered general from the relevance calculation
  • Define an upper and a lower limit value for the frequency in order to exclude words that are too general or that occur too rarely.

Luhn decided on the second, simpler variant. In order to find the optimal limit values, one had to rely on the experience gained from many sample articles.

The significance value of a sentence is not simply calculated from the relevant words it contains. Since Luhn also wanted to take into account the position and relationship of relevant words, only parts of sentences that contained relevant words should be considered. It was determined that a relevant word only belongs to a group of words (called a cluster ) if there are no more than four or five unimportant words between it and the next relevant word. The significance factor is therefore calculated as follows:

After the sentences have been sorted according to their relevance, the sentence or sentences with the highest relevance values ​​should be selected for the summary.

Luhn's balance sheet

According to Luhn, the results, i.e. the automatically generated extracts, show that it is possible with his algorithm to automatically generate summaries that reflect the main topic of the original almost as well as conventional summaries.

Per

One advantage of summaries created in this way is their reliability, consistency, and persistence. This is because the different abilities and orientations of people have no influence on the summary. In Luhn's opinion, users of summary systems will gradually learn how to interpret the generated summaries. In this way, users will see that some words relate to comments from previous, unextracted sentences.

Contra

However, he also sees some disadvantages that the automatically generated summaries have with them. He mentions, for example, the loss of fluency in summaries. He also sees problems if the style of an author deviates significantly from the general public, since this may mean that lower-value sentences can be selected.

outlook

Despite the disadvantages, Luhn is of the opinion that considerable and worthwhile savings in human effort can be achieved with the automatic creation of summaries (see HP Luhn: Automatic Creation of Literature Abstracts. In: IBM Journal of Research & Development 2 (2), April 1958, pages 159-165.)

However, Luhn also saw opportunities to improve his algorithm. On the one hand, its approach could be changed to include summaries of text on specific topics or areas of study. On the other hand, he saw the need to generate summaries of variable length. For example, summaries could be created that are tailored to the needs of the individual user. If the significance values ​​of the individual sentences do not exceed a certain limit, the article can be rejected as "too general" for the user's interests.

Web links