Introduction to Association Analysis in Data Mining

Slide Note

Association analysis in data mining involves discovering rules based on transaction data to predict the occurrence of one item given the presence of others. This process helps identify relationships and patterns within datasets, leading to insights on consumer behavior, such as product associations in retail or viewing preferences on platforms like Netflix. Definitions of key terms like itemset, support count, frequent itemset, and association rule evaluation metrics are also covered in this chapter.

anikenbe Follow

Uploaded on Sep 11, 2024 | 5 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Stats 202: Statistical Aspects of Data Mining Professor Rajan Patel Lecture 5 = Start Chapter 6 Agenda: 1) Reminder: Midterm is on Monday, July 14th 2) Lecture over Chapter 6

Announcement Midterm Exam: The midterm exam will be Monday, July 14 during the scheduled class time The best thing will be to take it in the classroom (even SCPD students) For remote students who absolutely can not come to the classroom that day please make arrangements with SCPD to take the exam with your proctor. You will submit the exam through Scoryst. You are allowed one 8.5 x 11 inch sheet (front and back) containing notes. No books or computers are allowed, but please bring a hand held calculator The exam will cover the material that we covered in

Introduction to Data Mining by Tan, Steinbach, Kumar Chapter 6: Association Analysis

What is Association Analysis: Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based on the occurrences of other items in the transaction Examples: {Diaper} {Beer}, {Milk, Bread} {Eggs,Coke} {Beer, Bread} {Milk} Implication means co-occurrence, not causality! Industry Examples: Netflix, Amazon related videos Safeway: coupons for products

Definitions: Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset = An itemset that contains k items Support count ( ) Frequency of occurrence of an itemset E.g. ({Milk, Bread,Diaper}) = 2 Support (s) Fraction of transactions that contain an itemset E.g. s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold

Another Definition: Association Rule An implication expression of the form X Y, where X and Y are itemsets Example: {Milk, Diaper} {Beer}

Even More Definitions: Association Rule Evaluation Metrics Support (s) Fraction of transactions that contain both X and Y Confidence (c) Measures how often each item in Y appears in transactions that contain X Example: {Milk, Diaper} {Beer}

In class exercise #19: Compute the support for itemsets {a}, {b, d}, and {a,b,d} by treating each transaction ID as a market basket.

In class exercise #20: Use the results in the previous problem to compute the confidence for the association rules {b, d} {a} and {a} {b, d}. State what these values mean in plain English.

In class exercise #21: Compute the support for itemsets {a}, {b, d}, and {a,b,d} by treating each customer ID as a market basket.

In class exercise #22: Use the results in the previous problem to compute the confidence for the association rules {b, d} {a} and {a} {b, d}. State what these values mean in plain English.

An Association Rule Mining Task: Given a set of transactions T, find all rules having both - support minsup threshold - confidence minconf threshold Brute-force approach: - List all possible association rules - Compute the support and confidence for each rule - Prune rules that fail the minsup and minconf thresholds - Problem: this is computationally prohibitive!

The Support and Confidence Requirements can be Decoupled {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements

Two Step Approach: 1) Frequent Itemset Generation = Generate all itemsets whose support minsup 2) Rule Generation = Generate high confidence (confidence minconf ) rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Note: Frequent itemset generation is still computationally expensive and your book discusses algorithms that can be used

In class exercise #23: Use the two step approach to generate all rules having support .4 and confidence .6 for the transactions below.

In class exercise #23: Use the two step approach to generate all rules having support .4 and confidence .6 for the transactions below. 1) Create a CSV file: one row per transaction, one column per item Milk Beer Diapers Butter Cookies Bread 1 1 1 0 0 0 1 0 0 1 0 1 1 0 1 0 1 0 2) Find itemsets of size 2 that have support >= 0.4 data = read.csv("ice23.csv") num_transactions = dim(data)[1] num_items = dim(data)[2] item_labels = labels(data)[[2]] for (col in 1:(num_items-1)) { for (col2 in (col+1):num_items) { sup = sum(data[,col] * data[,col2]) / num_transactions if (sup >= 0.4) { print(item_labels[c(col, col2)]) } } }

Drawback of Confidence Coffee Coffee Tea Tea 15 75 90 5 5 20 80 100 10 Association Rule: Tea Coffee Confidence(Tea Coffee) = P(Coffee|Tea) = 0.75

Drawback of Confidence Coffee Coffee Tea Tea 15 75 90 5 5 20 80 100 10 Association Rule: Tea Coffee Confidence(Tea Coffee) = P(Coffee|Tea) = 0.75 but support(Coffee) = P(Coffee) = 0.9 Although confidence is high, rule is misleading confidence(Tea Coffee) = P(Coffee|Tea) = 0.9375

Other Proposed Metrics:

Introduction to Association Analysis in Data Mining

Download Presentation

Presentation Transcript

Related

More Related Content