The Apriori Algorithm for Association Rule Mining

 
Apriori Algorithm
 
2
 
Apriori: A Candidate Generation & Test Approach
 
Apriori pruning principle
: If there is any itemset which is
infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
Method:
Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be
generated
 
Steps In Apriori
 
Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset in the
given database. This data mining technique follows the join and the prune steps iteratively until the
most frequent itemset is achieved. A minimum support threshold is given in the problem or it is
assumed by the user.
#1)
 In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The
algorithm will count the occurrences of each item.
#2)
 Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets whose
occurrence is satisfying the min sup are determined. Only those candidates which count more than
or equal to min_sup, are taken ahead for the next iteration and the others are pruned.
#3)
 Next, 2-itemset frequent items with min_sup are discovered. For this in the join step, the 2-
itemset is generated by forming a group of 2 by combining items with itself.
#4)
 The 2-itemset candidates are pruned using min-sup threshold value. Now the table will have 2 –
itemsets with min-sup only.
#5)
 The next iteration will form 3 –itemsets using join and prune step. This iteration will follow
antimonotone property where the subsets of 3-itemsets, that is the 2 –itemset subsets of each
group fall in min_sup. If all 2-itemset subsets are frequent then the superset will be frequent
otherwise it is pruned.
#6)
 Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its subset
does not meet the min_sup criteria. The algorithm is stopped when the most frequent itemset is
achieved.
 
4
 
The Apriori Algorithm—An Example
 
Database TDB
 
1
st
 scan
 
C
1
 
L
1
 
L
2
 
C
2
 
C
2
 
2
nd
 scan
 
C
3
 
L
3
 
3
rd
 scan
 
Sup
min
 = 2
 
5
 
The Apriori Algorithm (
Pseudo-Code
)
 
C
k
: Candidate itemset of size k
L
k
 : frequent itemset of size k
 
L
1
 = {frequent items};
for
 
(
k
 = 1; 
L
k
 !=
; 
k
++) 
do begin
    
C
k+1
 = candidates generated from 
L
k
;
    
for each
 transaction 
t
 in database do
  increment the count of all candidates in 
C
k+1
 that are
contained in 
t
    
L
k+1
  = candidates in 
C
k+1
 with min_support
   
 end
return
 
k
 
L
k
;
 
6
 
Implementation of Apriori
 
How to generate candidates?
Step 1: self-joining 
L
k
Step 2: pruning
Example of Candidate-generation
L
3
=
{
abc, abd, acd, ace, bcd
}
Self-joining: 
L
3
*L
3
abcd 
from 
abc
 and 
abd
acde
 from 
acd
 and 
ace
Pruning:
acde
 is removed because 
ade
 is not in 
L
3
C
4 
= {
abcd
}
 
7
 
How to Count Supports of Candidates?
 
Why counting supports of candidates a problem?
The total number of candidates can be very huge
 One transaction may contain many candidates
Method:
Candidate itemsets are stored in a 
hash-tree
Leaf 
node 
of hash-tree contains a list of itemsets and
counts
Interior 
node
 contains a hash table
Subset function
: finds all the candidates contained in a
transaction
 
8
 
Counting Supports of Candidates Using Hash Tree
 
Transaction: 1 2 3 5 6
 
1 + 2 3 5 6
 
1 2 + 3 5 6
 
1 3 + 5 6
 
9
 
Candidate Generation: An SQL Implementation
 
SQL Implementation of candidate generation
Suppose the items in 
L
k-1
 are listed in an order
Step 1: self-joining 
L
k-1
insert into
 
C
k
select 
p.item
1
, p.item
2
, …, p.item
k-1
, q.item
k-1
from 
L
k-1
 p, L
k-1 
q
where 
p.item
1
=q.item
1
, …, p.item
k-2
=q.item
k-2
, p.item
k-1 
< q.item
k-1
Step 2: pruning
forall 
itemsets c in C
k
 
do
forall 
(k-1)-subsets s of c 
do
if 
(s is not in L
k-1
) 
then delete 
c
 from 
C
k
Use object-relational extensions like UDFs, BLOBs, and Table functions for efficient
implementation [See: S. Sarawagi, S. Thomas, and R. Agrawal. 
Integrating association
rule mining with relational database systems: Alternatives and implications
.
SIGMOD’98]
 
Methods To Improve Apriori
Efficiency
 
Many methods are available for improving the efficiency of the algorithm.
Hash-Based Technique:
 This method uses a hash-based structure called a hash
table for generating the k-itemsets and its corresponding count. It uses a hash
function for generating the table.
Transaction Reduction:
 This method reduces the number of transactions scanning
in iterations. The transactions which do not contain frequent items are marked or
removed.
Partitioning:
 This method requires only two database scans to mine the frequent
itemsets. It says that for any itemset to be potentially frequent in the database, it
should be frequent in at least one of the partitions of the database.
Sampling:
 This method picks a random sample S from Database D and then
searches for frequent itemset in S. It may be possible to lose a global frequent
itemset. This can be reduced by lowering the min_sup.
Dynamic Itemset Counting:
 This technique can add new candidate itemsets at any
marked start point of the database during the scanning of the database.
 
Applications Of Apriori Algorithm
 
Some fields where Apriori is used:
In Education Field:
 Extracting association rules in
data mining of admitted students through
characteristics and specialties.
In the Medical field:
 For example Analysis of the
patient's database.
In Forestry:
 Analysis of probability and intensity
of forest fire with the forest fire data.
Apriori is used by many companies like Amazon in
the 
Recommender System
 and by Google for the
auto-complete feature.
 
References
 
https://www.geeksforgeeks.org/apriori-
algorithm/
https://www.softwaretestinghelp.com/apriori-
algorithm/
https://www.edureka.co/blog/apriori-
algorithm/
Slide Note
Embed
Share

The Apriori algorithm is a popular method in data mining for finding frequent itemsets in a dataset. It involves steps like candidate generation, testing, and pruning to iteratively identify the most frequent itemset. By setting a minimum support threshold, the algorithm efficiently discovers patterns in transaction databases by following a systematic approach.

  • Data Mining
  • Association Rule
  • Apriori Algorithm
  • Candidate Generation
  • Frequent Itemsets

Uploaded on Jul 22, 2024 | 2 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Apriori Algorithm

  2. Apriori: A Candidate Generation & Test Approach Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB 94, Mannila, et al. @ KDD 94) Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated 2

  3. Steps In Apriori Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset in the given database. This data mining technique follows the join and the prune steps iteratively until the most frequent itemset is achieved. A minimum support threshold is given in the problem or it is assumed by the user. #1) In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The algorithm will count the occurrences of each item. #2) Let there be some minimum support, min_sup ( eg 2). The set of 1 itemsets whose occurrence is satisfying the min sup are determined. Only those candidates which count more than or equal to min_sup, are taken ahead for the next iteration and the others are pruned. #3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join step, the 2- itemset is generated by forming a group of 2 by combining items with itself. #4) The 2-itemset candidates are pruned using min-sup threshold value. Now the table will have 2 itemsets with min-sup only. #5) The next iteration will form 3 itemsets using join and prune step. This iteration will follow antimonotone property where the subsets of 3-itemsets, that is the 2 itemset subsets of each group fall in min_sup. If all 2-itemset subsets are frequent then the superset will be frequent otherwise it is pruned. #6) Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its subset does not meet the min_sup criteria. The algorithm is stopped when the most frequent itemset is achieved.

  4. The Apriori AlgorithmAn Example Supmin = 2 Itemset {A} {B} {C} {D} {E} sup 2 3 3 1 3 Itemset {A} {B} {C} {E} sup 2 3 3 3 Database TDB L1 C1 Tid 10 20 30 40 Items A, C, D B, C, E A, B, C, E B, E 1st scan Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} sup 1 2 1 2 3 2 C2 C2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} L2 2nd scan Itemset {A, C} {B, C} {B, E} {C, E} sup 2 2 3 2 Itemset {B, C, E} sup 2 Itemset {B, C, E} L3 C3 3rd scan 4

  5. The Apriori Algorithm (Pseudo-Code) Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk != ; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return kLk; 5

  6. Implementation of Apriori How to generate candidates? Step 1: self-joining Lk Step 2: pruning Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4 = {abcd} 6

  7. How to Count Supports of Candidates? Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction 7

  8. Counting Supports of Candidates Using Hash Tree Subset function Transaction: 1 2 3 5 6 3,6,9 1,4,7 2,5,8 1 + 2 3 5 6 2 3 4 5 6 7 1 3 + 5 6 3 6 7 3 6 8 1 4 5 3 5 6 3 5 7 6 8 9 3 4 5 1 3 6 1 2 + 3 5 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 8

  9. Candidate Generation: An SQL Implementation SQL Implementation of candidate generation Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert intoCk select p.item1, p.item2, , p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning forall itemsets c in Ckdo forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck Use object-relational extensions like UDFs, BLOBs, and Table functions for efficient implementation [See: S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD 98] 9

  10. Methods To Improve Apriori Efficiency Many methods are available for improving the efficiency of the algorithm. Hash-Based Technique: This method uses a hash-based structure called a hash table for generating the k-itemsets and its corresponding count. It uses a hash function for generating the table. Transaction Reduction: This method reduces the number of transactions scanning in iterations. The transactions which do not contain frequent items are marked or removed. Partitioning: This method requires only two database scans to mine the frequent itemsets. It says that for any itemset to be potentially frequent in the database, it should be frequent in at least one of the partitions of the database. Sampling: This method picks a random sample S from Database D and then searches for frequent itemset in S. It may be possible to lose a global frequent itemset. This can be reduced by lowering the min_sup. Dynamic Itemset Counting: This technique can add new candidate itemsets at any marked start point of the database during the scanning of the database.

  11. Applications Of Apriori Algorithm Some fields where Apriori is used: In Education Field: Extracting association rules in data mining of admitted students through characteristics and specialties. In the Medical field: For example Analysis of the patient's database. In Forestry: Analysis of probability and intensity of forest fire with the forest fire data. Apriori is used by many companies like Amazon in the Recommender System and by Google for the auto-complete feature.

  12. References https://www.geeksforgeeks.org/apriori- algorithm/ https://www.softwaretestinghelp.com/apriori- algorithm/ https://www.edureka.co/blog/apriori- algorithm/

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#