Apriori algorithm

The Apriori algorithm is a method for association analysis , a field of data mining . It is used to find meaningful and useful relationships in transaction-based databases, which are represented in the form of so-called association rules. A frequent application of the a priori algorithm is the shopping basket analysis . Items are offered products and a purchase represents a transaction that contains the items purchased. The algorithm now determines correlations of the form:

When shampoo and aftershave were purchased, 90% of the time, shaving cream was also purchased .

A suitable database consists of a table of transactions (rows) in which any binary items (columns) are summarized. The a priori algorithm finds relationships between sets of items that occur in a large part of the transactions. The association rules that are output have the following form , there are and quantities of items and the rule states that if the item quantity occurs in a large part of the transactions , then the item quantity is also often included. ${\ displaystyle A \ rightarrow B}$ ${\ displaystyle A}$ ${\ displaystyle B}$ ${\ displaystyle A}$ ${\ displaystyle B}$

requirements

The a priori algorithm is used for databases of a certain form. The form of the database must be as follows:

${\ displaystyle {\ mathcal {I}}}$ is a lot of possible items
${\ displaystyle {\ mathcal {D}} = \ {t_ {1}, t_ {2}, ..., t_ {n} \}}$ is the database consisting of transactions ${\ displaystyle t_ {x}}$
a transaction combines a number of items ${\ displaystyle t_ {x} \ subseteq {\ mathcal {I}}}$

Typically, a set of more than 500,000 transactions is analyzed on a very large item basis. The database is displayed in a de-normalized database table, which has a column for each possible item. The lines contained each represent a transaction, with items contained in the transaction being marked with a 1, and items not contained with a 0. A transaction can therefore also be viewed as a vector with dimensions. ${\ displaystyle \ mid {\ mathcal {I}} \ mid}$

An association rule is of the form

{\ displaystyle X \ rightarrow \! \, Y}

where applies

{\ displaystyle X \ subseteq {\ mathcal {I}}}

, and

{\ displaystyle Y \ subseteq {\ mathcal {I}}}

{\ displaystyle X \ cap Y = \ emptyset}

Evaluation of rules

Association rules are assessed using two probabilistic metrics: support and confidence. The apriori algorithm expects the values and , which represent the minimum support and the minimum confidence of a rule, so that they are taken into account. ${\ displaystyle minsupp}$ ${\ displaystyle minconf}$

Support

The support of a set of items is the probability that this set of items will occur in a transaction.

Be item set. ${\ displaystyle X = {A_ {1}, A_ {2}, ..., A_ {n}}}$

{\ displaystyle {\ begin {aligned} \ operatorname {Support} (X) & = \! \, P (A_ {1}, A_ {2}, ..., A_ {n}) \\ & = {\ frac {\ mid \ {t \ in {\ mathcal {D}} \ mid X \ subseteq t \} \ mid} {\ mid D \ mid}} \ end {aligned}}}

The support of an association rule , with and , is defined as ${\ displaystyle X \ rightarrow Y}$ ${\ displaystyle X = \ {A_ {1}, A_ {2}, ..., A_ {n} \}}$ ${\ displaystyle Y = \ {B_ {1}, B_ {2}, ..., B_ {m} \}}$

{\ displaystyle {\ begin {aligned} \ operatorname {Support} (X \ rightarrow Y) & = \! \, \ operatorname {Support} (X \ cup Y) \\ & = \! \, P (A_ {1 }, A_ {2}, ..., A_ {n}, B_ {1}, B_ {2}, ..., B_ {m}) \\ & = {\ frac {\ mid \ {t \ in {\ mathcal {D}} \ mid X \ cup Y \ subseteq t \} \ mid} {\ mid D \ mid}} \ end {aligned}}}

The support of a rule therefore indicates the relative frequency with which the rule occurs in the database. A high level of support is usually desirable in order to find statements about majorities.

Confidence

Be an association rule, with and . ${\ displaystyle X \ rightarrow \! \, Y}$ ${\ displaystyle X = \ {A_ {1}, A_ {2}, ..., A_ {n} \}}$ ${\ displaystyle Y = \ {B_ {1}, B_ {2}, ..., B_ {m} \}}$

The confidence of a rule corresponds to the probability of the conclusion under the condition of the premise:

{\ displaystyle \ operatorname {Conf} (X \ rightarrow \! \, Y) = P (B_ {1}, B_ {2}, ..., B_ {m} \ mid A_ {1}, A_ {2} ,...,On})}

so

{\ displaystyle {\ begin {aligned} \ operatorname {Conf} (X \ rightarrow Y) & = {\ frac {\ operatorname {Support} (X \ rightarrow Y)} {\ operatorname {Support} (X)}} \ \ & = {\ frac {\ mid \ {t \ in {\ mathcal {D}} \ mid X \ cup Y \ subseteq t \} \ mid} {\ mid \ {t \ in {\ mathcal {D}} \ mid X \ subseteq t \} \ mid}} \ end {aligned}}}

Confidence measures the relative frequency of occurrence of the conclusion under the condition of the premise. A high value is also desirable for the confidence. Or to put it more simply: The confidence measures for what proportion of the transactions in which occurs, also occurs. To calculate the confidence, the number of all rule-compliant transactions (i.e. support) is divided by the number of transactions that contain. ${\ displaystyle X}$ ${\ displaystyle Y}$ ${\ displaystyle X}$

The algorithm

The a priori algorithm receives as inputs

The database ${\ displaystyle {\ mathcal {D}}}$
The minimal support ${\ displaystyle {\ textit {minsupp}}}$
The minimum confidence ${\ displaystyle {\ textit {minconf}}}$

and outputs a set of association rules that satisfy both and . ${\ displaystyle {\ textit {minsupp}}}$ ${\ displaystyle {\ textit {minconf}}}$

The algorithm works in two steps, both of which use a common step Apriori-Gen:

Finding frequent quantities
Creation of association rules

Finding more frequent item sets

The search for frequent item sets starts with 1-element sets and is continued iteratively with n-element sets until no item sets with sufficient support are found. In each iteration, a set of candidate sets is generated by means of the apriori gene and each set is checked for the property. If no new quantities can be found, the algorithm stops and outputs the quantities found. ${\ displaystyle {\ textit {minsupp}}}$

Calculate all 1-item Itemmengen with Support> : . ${\ displaystyle {\ textit {minsupp}}}$ ${\ displaystyle L_ {1}}$
For . ${\ displaystyle k-1 \ rightarrow k}$ ${\ displaystyle k-1 \ rightarrow k}$
1. Calculate amount of candidate out by means of a priori gene. ${\ displaystyle C_ {k}}$ ${\ displaystyle L_ {k-1}}$
2. Calculate the actual support from all amounts . ${\ displaystyle C_ {k}}$
3. Take the quantities with enough support in on. ${\ displaystyle {\ textit {minsupp}}}$ ${\ displaystyle L_ {k}}$
4. Is , break off. ${\ displaystyle L_ {k} = \ emptyset}$
Give back. ${\ displaystyle \ bigcup {} {L_ {k}}}$

The returned amount contains all common item amounts.

A priori gene

The sub-routine Apriori-Gen is used for the calculation of frequent quantities as well as for the generation of association rules. Instead of directly calculating the support for all possible item quantities, Apriori-Gen generates a number of candidates for further verification on the basis of frequently found quantities.

The routine receives as input a set of frequently- item sets ( ) and returns a set of -item sets ( ) as possible candidates. It is based on the principle that all subsets of a frequent item set are frequent, but all supersets of a non-frequent item set are also non-frequent. This avoids unnecessary support calculations. ${\ displaystyle k-1}$ ${\ displaystyle L_ {k-1}}$ ${\ displaystyle k}$ ${\ displaystyle C_ {k}}$

Generate item sets by merging two item sets that each have in common and add them . This step guarantees that only one element is added to the new set at a time. ${\ displaystyle k}$ ${\ displaystyle k-1}$ ${\ displaystyle k-2}$ ${\ displaystyle C_ {k}}$
For each set in , check whether all subsets are included in. If not, remove it . ${\ displaystyle X}$ ${\ displaystyle C_ {k}}$ ${\ displaystyle k-1}$ ${\ displaystyle L_ {k-1}}$ ${\ displaystyle X}$ ${\ displaystyle C_ {k}}$

example

The input to Apriori-Gen is:

{\ displaystyle L_ {3} = \ {\ {a, b, c \}, \ {a, b, d \}, \ {a, b, e \}, \ {a, c, d \}, \ {a, c, e \}, \ {b, c, d \} \}}

Step 1 of the a priori gene routine now calculates the following candidate set:

{\ displaystyle C_ {4} = \ {\ {a, b, c, d \}, \ {a, b, c, e \}, \ {a, b, d, e \}, \ {a, c, d, e \} \}}

Step 2 removes the quantities and back off because and are not included in. So both sets are not frequent and your supersets don't need to be considered. ${\ displaystyle \ {a, b, c, e \}, \ {a, b, d, e \}}$ ${\ displaystyle \ {a, c, d, e \}}$ ${\ displaystyle C_ {4}}$ ${\ displaystyle \ {b, c, e \}, \ {b, d, e \}}$ ${\ displaystyle \ {c, d, e \}}$ ${\ displaystyle L_ {3}}$

So the result of Apriori-Gen is

{\ displaystyle C_ {4} = \ {\ {a, b, c, d \} \}}

Generation of association rules

Only item sets that are inherently frequent have to be taken into account for this step of the algorithm. Such item sets were calculated by step 1 of the a priori algorithm. The Apriori-Gen routine used in step 1 is used again when generating association rules.

An attempt is made to generate association rules for each frequent item set found. It starts with the shortest possible (1-element) conclusions, which are iteratively enlarged. The following pseudocode is executed for each item set found : ${\ displaystyle Z}$

For each item set Z: calculate association rules of the form with and with . ${\ displaystyle X \ rightarrow Y}$ ${\ displaystyle \ mid Y \ mid = 1}$ ${\ displaystyle X = ZY}$ ${\ displaystyle \ operatorname {Conf} (X \ rightarrow Y)> {\ textit {minconf}}}$
Generate with item sets each consisting of a found conclusion. ${\ displaystyle H_ {1}}$
${\ displaystyle H_ {k-1} \ rightarrow H_ {k}}$ ${\ displaystyle H_ {k-1} \ rightarrow H_ {k}}$
1. Generate by Apriori gene. ${\ displaystyle H_ {k}}$
2. For each conclusion check from . If not, remove from . ${\ displaystyle h_ {k} \ in H_ {k}}$ ${\ displaystyle {\ textit {minconf}}}$ ${\ displaystyle (Z-h_ {k}) \ rightarrow h_ {k}}$ ${\ displaystyle {\ textit {minconf}}}$ ${\ displaystyle h_ {k}}$ ${\ displaystyle H_ {k}}$
3. If , stop. ${\ displaystyle H_ {k} = \ emptyset}$
Give back. ${\ displaystyle \ bigcup H_ {k}}$

The generated rules satisfy all and . ${\ displaystyle {\ textit {minsupp}}}$ ${\ displaystyle {\ textit {minconf}}}$

literature

Rakesh Agrawal, Tomasz Imieliński, Arun Swami: Mining Association Rules between Sets of Items in Large Databases. In: Peter Buneman; Sushil Jajodia (Ed.): Proceedings of the 1993 ACM SIGMOD international conference on Management of data (= SIGMOD Record. Vol. 22, No. 2, June 1993). ACM, New York NY 1993, ISBN 0-89791-592-5 , pp. 207-216, doi : 10.1145 / 170035.170072
Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules. In: Jorge Bocca, Matthias Jarke , Carlo Zaniolo (eds.): Very large data bases. Proceedings of the 20th International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc., Hove et al. 1994, ISBN 1-55860-153-8 , pp. 487-499, online (PDF; 282 kB) .
Jean-Marc Adamo: Data Mining for Association Rules and Sequential Patterns. Sequential and Parallel Algorithms . Springer, New York NY et al. 2001, ISBN 0-387-95048-6 (English).
Christoph Beierle, Gabriele Kern-Isberner: Methods of knowledge-based systems: fundamentals, algorithms, applications. , 4th edition. Vieweg + Teubner Verlag , 2008, p. 147ff. ISBN 3834805041

Web links

Gabriele Kern-Isberner: Knowledge acquisition and knowledge discovery . (PDF; 664 kB). In: Presentation, processing and acquisition of knowledge 2007/2008 .
Katharina Morik: Association Rules . (PDF; 1.6 MB). In: Knowledge Discovery in Databases 2008 .