Association analysis

The association analysis is the search for strong rules . These association rules that follow from this describe correlations between things that occur together. The purpose of an association analysis is therefore to determine items (elements of a set , such as individual items of a shopping cart) that imply the occurrence of other items within a transaction . Such a revealed relationship between two or more items can then be represented as a rule of the form "If item (quantity) A, then item (quantity) B" or A → B.

Fields of application

A typical field of application is the context of shopping, the so-called shopping basket analysis , in order to initiate targeted advertising measures. For example: 80 percent of purchases that buy beer include potato chips. Both products appear in 10 percent of purchases. These findings are often used in cross-marketing .

Parameters

Parameters of association rules are:

Support: relative frequency of examples in which the rule is applicable.

${\ displaystyle \ mathrm {supp} (X \ Rightarrow Y) = {\ frac {\ mathrm {supp} (X \ cup Y)} {N}}}$ , with the cardinality of the total amount of data. It should be noted that the support is defined by a set of items. This corresponds to the absolute frequency of the item set in the total data. At this point, we use the union of the two rule pages in order to display all elements of the total data, which contain both the item set and the item set . ${\ textstyle N}$ ${\ textstyle {\ text {supp}} (X \ cup Y)}$ ${\ textstyle X \ cup Y}$ ${\ textstyle X}$ ${\ textstyle Y}$

Confidence: relative frequency of examples in which the rule is correct.

{\ displaystyle \ mathrm {confidence} (X \ Rightarrow Y) = {\ frac {\ mathrm {supp} (X \ cup Y)} {\ mathrm {supp} (X)}}}

The confidence of a rule describes the relative proportion of all elements of the total set that contain both and in relation to those elements that contain.

{\ textstyle X}

{\ displaystyle Y}

{\ textstyle X}

Lift: The lift indicates how much the confidence value for the rule exceeds the expected value, i.e. it shows the general meaning of a rule.

{\ displaystyle \ mathrm {lift} (X \ Rightarrow Y) = {\ frac {\ mathrm {supp} (X \ cup Y)} {\ mathrm {supp} (X) \ times \ mathrm {supp} (Y) }}}

, where:

{\ displaystyle {\ begin {aligned} {\ text {lift}} (X \ Rightarrow Y)> 1 & \ rightarrow X {\ text {and}} Y {\ text {are positively correlated}} \\ {\ text { lift}} (X \ Rightarrow Y) <1 & \ rightarrow X {\ text {and}} Y {\ text {are negatively correlated}} \\ {\ text {lift}} (X \ Rightarrow Y) = 1 & \ rightarrow X {\ text {and}} Y {\ text {are independent}} \ end {aligned}}}

example

Given is an association rule {toothbrush} → {toothpaste}.

Support : The support is used to calculate the proportion of all transactions for which the {toothbrush} → {toothpaste} rule applies. For the calculation, the number of transactions in which both item sets of interest occur is divided by the number of all transactions.
Confidence : For which proportion of the transactions in which {toothbrush} occurs also appears {toothpaste}? To calculate the confidence, the number of all rule-compliant transactions is divided by the number of transactions that contain {toothbrush}.

{\ displaystyle {\ text {confidence}} (\ {Zahnb {\ ddot {u}} rste \} \ rightarrow \ {Toothpaste \}) = {\ frac {{\ text {supp}} (\ {Zahnb {\ ddot {u}} rste, toothpaste \})} {{\ text {supp}} (\ {Zahnb {\ ddot {u}} rste \})}}}

Lift : Assume 10 percent of all customers buy {toothbrush, toothpaste}, 20 percent of all customers buy {toothbrush}, and 40 percent of all customers buy {toothpaste}. Then the rule has a lift of 1.25.

Procedure

Algorithms must be designed in such a way that all association rules are found with a minimum confidence and minimum support to be determined in advance. The methods should not require any assumptions about the characteristics to be analyzed. This would also be inconceivable in a mail order business with many thousands of items, for example.

The first algorithm for association analysis is the AIS algorithm (named after its developers Agrawal, Imielinski and Swami) from which the Apriori algorithm was developed. This is being replaced more and more by the much more efficient FPGrowth algorithm.

Web links

Enrico Lüdecke: Determination of association rules from large amounts of data . Schmalkalden University of Applied Sciences. Retrieved June 25, 2014.

Individual evidence

^ R. Agrawal, T. Imieliński, A. swami: Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD '93 . In: Mining association rules between sets of items in large databases . 1993, p. 207. doi : 10.1145 / 170035.170072 .
↑ R. Agrawal, T. Imielinski, A. Swami: Database Mining: A Performance Perspective . In: IEEE Transactions on Knowledge and Data Engineering, Special issue on Learning and Discovery in Knowledge-Based Databases . 5 (6), December 1993.
↑ Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach ( Memento from October 31, 2008 in the Internet Archive )

[1] R. Agrawal, T. Imieliński, A. swami: Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD '93 . In: Mining association rules between sets of items in large databases . 1993, p. 207. doi : 10.1145 / 170035.170072 .

[2] R. Agrawal, T. Imielinski, A. Swami: Database Mining: A Performance Perspective . In: IEEE Transactions on Knowledge and Data Engineering, Special issue on Learning and Discovery in Knowledge-Based Databases . 5 (6), December 1993.

[3] Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach ( Memento from October 31, 2008 in the Internet Archive )