Understanding FP-Growth Algorithm for Association Analysis in Data Warehousing and Data Mining
FP-Growth algorithm is a powerful method for discovering frequent itemsets in data sets. It utilizes a compact data structure called an FP-tree to efficiently mine frequent patterns. By encoding data into an FP-tree representation, the algorithm can identify frequent itemsets directly from memory, making it more efficient compared to traditional methods that require repeated passes over disk-stored data.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
FP Growth Algorithm Association Analysis (Unit - IV) DWDM
FP Growth Algorithm FP-growth algorithm takes a radically different approach for discovering frequent itemsets. The algorithm encodes the data set using a compact data structure called an FP frequent itemsets directly from this structure FP- -Tree Representation Tree Representation An FP An FP- -tree is a compressed representation of the input data tree is a compressed representation of the input data. It is constructed by reading the data set one transaction at a time and mapping each transaction onto a path in the FP-tree. As different transactions can have several items in common, their paths may overlap. The more the paths overlap with one another, the more compression we can achieve using the FP-tree structure. If the size of the FP-tree is small enough to fit into main memory, this will allow us to extract frequent itemsets directly from the structure in memory instead of making repeated passes over the data stored on disk. FP- -tree tree and extracts FP
FP Tree Representation Figure 6.24 shows a data set that contains ten transactions and five items. The structures of the FP-tree after reading the first three transactions are also depicted in the diagram. Each node in the tree contains the label label of an item along with a counter counter that shows the number of transactions mapped onto the given path. Initially, the FP-tree contains only the root node root node represented by the null null symbol.
FP Tree Representation 1. The data set is scanned once to determine the support count of each item. Infrequent items are discarded, while the frequent items are sorted in decreasing support counts. For the data set shown in Figure, a is the most frequent item, followed by b, c, d, and e.
FP Tree Representation 2. The algorithm makes a second pass over the data to construct the FP-tree. After reading the first transaction, {a, b}, the nodes labeled as a and b are created. A path is then formed from null a a b b to encode the transaction. Every node along the path has a frequency count of 1. null
FP Tree Representation 3. After reading the second transaction, {b,c,d}, a new set of nodes is created for items b, c, and d. A path is then formed to represent the transaction by connecting the nodes null b c d. Every node along this path also has a frequency count equal to one. 4. The third transaction, {a,c,d,e}, shares a common prefix item (which is a) with the first transaction. As a result, the path for the third transaction, null a c d e, overlaps with the path for the first transaction, null a b. Because of their overlapping path, the frequency count for node a is incremented to two, while the frequency counts for the newly created nodes, c, d, and e, are equal to one. 5. This process continues until every transaction has been mapped onto one of the paths given in the FP-tree. The resulting FP-tree after reading all the transactions is shown in Figure 6.24.
FP Tree Representation The size of an FP-tree is typically smaller than the size of the uncompressed data because many transactions in market basket data often share a few items in common. In the best-case scenario, where all the transactions have the same set of items, the FP-tree contains only a single branch of nodes. The worst-case scenario happens when every transaction has a unique set of items.
FP Tree Representation The size of an FP-tree also depends on how the items are ordered. If the ordering scheme in the preceding example is reversed, i.e., from lowest to highest support item, the resulting FP- tree is shown in Figure 6.25. An FP-tree also contains a list of pointers connecting pointers connecting between nodes that have the between nodes that have the same items. same items. These pointers, represented as dashed lines in Figures 6.24 and 6.25, help to facilitate the rapid access of individual items in the tree.
Frequent Itemset Generation using FP-Growth Algorithm Steps in FP Steps in FP- -Growth Algorithm: Growth Algorithm: Step Step- -1: 1: Scan the database to build Frequent 1 the elements whose frequency is greater than or equal to the minimum support. These elements are stored in descending order of their respective frequencies. Step Step- -2: 2: For each transaction, the respective Ordered Step Step- -3: 3: Construct the FP tree. by scanning each Ordered Step Step- -4: 4: For each item, the Conditional Pattern Base Conditional Pattern Base is computed which is path labels of all the paths which lead to any node of the given item in the frequent-pattern tree. Step Step- -5: 5: For each item, the Conditional Frequent Pattern Tree is built. Conditional Frequent Pattern Tree is built. Step Step- -6: Frequent Pattern rules 6: Frequent Pattern rules are generated by pairing the items of the Conditional Frequent Pattern Tree set to each corresponding item. Frequent 1- -item set item set which will contain all Ordered- -Item set Ordered- -Item set Item set is built. Item set
Frequent Itemset Generation in FP-Growth Algorithm Example: Given Database: min_support=3 The frequency of each individual item is computed:-
Frequent Itemset Generation in FP-Growth Algorithm A Frequent Pattern set Frequent Pattern set is built which will contain all the elements whose frequency is greater than or equal to the minimum support. These elements are stored in descending order descending order of their respective frequencies. L = {K : 5, E : 4, M : 3, O : 3, Y : 3} L = {K : 5, E : 4, M : 3, O : 3, Y : 3} Now, for each transaction, the respective Ordered iterating the Frequent Pattern set and checking if the current item is contained in the transaction. The following table is built for all the transactions: Ordered- -Item set Item set is built. It is done by
Frequent Itemset Generation in FP-Growth Algorithm Now, all the Ordered-Item sets are inserted into a Trie Data Structure. a) a) Inserting the set {K, E, M, O, Inserting the set {K, E, M, O, Y}: Y}: All the items are simply linked one after the other in the order of occurrence in the set and initialize the support count for each item as 1.
Frequent Itemset Generation in FP-Growth Algorithm b) Inserting the set {K, E, O, Y}: Inserting the set {K, E, O, Y}: Till the insertion of the elements K and E, simply the support count is increased by 1. There is no direct link between E and O, therefore a new node for the item O is initialized with the support count as 1 and item E is linked to this new node. On inserting Y, we first initialize a new node for the item Y with support count as 1 and link the new node of O with the new node of Y.
Frequent Itemset Generation in FP-Growth Algorithm c) Inserting the set {K, E, M}: Inserting the set {K, E, M}: Here simply the support count of each element is increased by 1.
Frequent Itemset Generation in FP-Growth Algorithm d) Inserting the set {K, M, Y}: Inserting the set {K, M, Y}: Similar to step b), first the support count of K is increased, then new nodes for M and Y are initialized and linked accordingly.
Frequent Itemset Generation in FP-Growth Algorithm e) Inserting the set {K, E, O}: Inserting the set {K, E, O}: Here simply the support counts of the respective elements are increased.
Frequent Itemset Generation in FP-Growth Algorithm Now, for each item starting from leaf which is path labels of all the paths which lead to any node of the given item in the frequent-pattern tree. leaf, the Conditional Pattern Base Conditional Pattern Base is computed
Frequent Itemset Generation in FP-Growth Algorithm Now for each item, the Conditional Frequent Pattern Tree is built. Conditional Frequent Pattern Tree is built. It is done by taking the set of elements that is common in all the paths in the Conditional Pattern Base of that item and calculating its support count by summing the paths in the Conditional Pattern Base. summing the support counts of all The itemsets whose support count >= min_support value are retained in the Conditional Frequent Pattern Tree and the rest are discarded.
Frequent Itemset Generation in FP-Growth Algorithm From the Conditional Frequent Pattern tree, the Frequent Pattern rules generated by pairing the items of the Conditional Frequent Pattern Tree set to the corresponding to the item as given in the below table. Frequent Pattern rules are For each row, two types of association rules can be inferred for example for the first row which contains the element, the rules K -> Y and Y -> K can be inferred. To determine the valid rule, the confidence of both the rules is calculated and the one with confidence greater than or equal to the minimum confidence value is retained.