Understanding the Apriori Algorithm for Association Rule Mining
The Apriori algorithm is a popular method in data mining for finding frequent itemsets in a dataset. It involves steps like candidate generation, testing, and pruning to iteratively identify the most frequent itemset. By setting a minimum support threshold, the algorithm efficiently discovers patterns in transaction databases by following a systematic approach.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Apriori: A Candidate Generation & Test Approach Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB 94, Mannila, et al. @ KDD 94) Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated 2
Steps In Apriori Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset in the given database. This data mining technique follows the join and the prune steps iteratively until the most frequent itemset is achieved. A minimum support threshold is given in the problem or it is assumed by the user. #1) In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The algorithm will count the occurrences of each item. #2) Let there be some minimum support, min_sup ( eg 2). The set of 1 itemsets whose occurrence is satisfying the min sup are determined. Only those candidates which count more than or equal to min_sup, are taken ahead for the next iteration and the others are pruned. #3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join step, the 2- itemset is generated by forming a group of 2 by combining items with itself. #4) The 2-itemset candidates are pruned using min-sup threshold value. Now the table will have 2 itemsets with min-sup only. #5) The next iteration will form 3 itemsets using join and prune step. This iteration will follow antimonotone property where the subsets of 3-itemsets, that is the 2 itemset subsets of each group fall in min_sup. If all 2-itemset subsets are frequent then the superset will be frequent otherwise it is pruned. #6) Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its subset does not meet the min_sup criteria. The algorithm is stopped when the most frequent itemset is achieved.
The Apriori AlgorithmAn Example Supmin = 2 Itemset {A} {B} {C} {D} {E} sup 2 3 3 1 3 Itemset {A} {B} {C} {E} sup 2 3 3 3 Database TDB L1 C1 Tid 10 20 30 40 Items A, C, D B, C, E A, B, C, E B, E 1st scan Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} sup 1 2 1 2 3 2 C2 C2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} L2 2nd scan Itemset {A, C} {B, C} {B, E} {C, E} sup 2 2 3 2 Itemset {B, C, E} sup 2 Itemset {B, C, E} L3 C3 3rd scan 4
The Apriori Algorithm (Pseudo-Code) Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk != ; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return kLk; 5
Implementation of Apriori How to generate candidates? Step 1: self-joining Lk Step 2: pruning Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4 = {abcd} 6
How to Count Supports of Candidates? Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction 7
Counting Supports of Candidates Using Hash Tree Subset function Transaction: 1 2 3 5 6 3,6,9 1,4,7 2,5,8 1 + 2 3 5 6 2 3 4 5 6 7 1 3 + 5 6 3 6 7 3 6 8 1 4 5 3 5 6 3 5 7 6 8 9 3 4 5 1 3 6 1 2 + 3 5 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 8
Candidate Generation: An SQL Implementation SQL Implementation of candidate generation Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert intoCk select p.item1, p.item2, , p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning forall itemsets c in Ckdo forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck Use object-relational extensions like UDFs, BLOBs, and Table functions for efficient implementation [See: S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD 98] 9
Methods To Improve Apriori Efficiency Many methods are available for improving the efficiency of the algorithm. Hash-Based Technique: This method uses a hash-based structure called a hash table for generating the k-itemsets and its corresponding count. It uses a hash function for generating the table. Transaction Reduction: This method reduces the number of transactions scanning in iterations. The transactions which do not contain frequent items are marked or removed. Partitioning: This method requires only two database scans to mine the frequent itemsets. It says that for any itemset to be potentially frequent in the database, it should be frequent in at least one of the partitions of the database. Sampling: This method picks a random sample S from Database D and then searches for frequent itemset in S. It may be possible to lose a global frequent itemset. This can be reduced by lowering the min_sup. Dynamic Itemset Counting: This technique can add new candidate itemsets at any marked start point of the database during the scanning of the database.
Applications Of Apriori Algorithm Some fields where Apriori is used: In Education Field: Extracting association rules in data mining of admitted students through characteristics and specialties. In the Medical field: For example Analysis of the patient's database. In Forestry: Analysis of probability and intensity of forest fire with the forest fire data. Apriori is used by many companies like Amazon in the Recommender System and by Google for the auto-complete feature.
References https://www.geeksforgeeks.org/apriori- algorithm/ https://www.softwaretestinghelp.com/apriori- algorithm/ https://www.edureka.co/blog/apriori- algorithm/