Clustering Algorithms in Data Science

Final Review
Lei Chen
Clustering Algorithms
K-Means
Partitioning Algorithms: Basic Concept
Partitioning method:
 Partitioning a database 
D
 of 
n
 objects into a set of 
k
clusters, such that the sum of squared distances is minimized (where c
i
 is the
centroid or medoid of cluster C
i
)
Given 
k
, find a partition of 
k clusters 
that optimizes the chosen partitioning
criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: 
k-means
 and 
k-medoids
 algorithms
k-means
 (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the
center of the cluster
k-medoids
 or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in the
cluster
3
Clustering Algorithms
K-Means
K-Medoids
5
PAM (Partitioning Around Medoids) (1987)
PAM (Kaufman and Rousseeuw, 1987), built in Splus
Use real object to represent the cluster
Select 
k
 representative objects arbitrarily
For each pair of non-selected object 
h
 and selected object 
i
,
calculate the total swapping cost 
TC
ih
For each pair of 
i
 and 
h
,
If 
TC
ih
 < 0, 
i
 is replaced by 
h
Then assign each non-selected object to the most similar
representative object
repeat steps 2-3 until there is no change
Clustering Algorithms
K-Means
K-Medoids
Hierarchical 
Clustering
Hierarchical Clustering
Use distance matrix as clustering criteria.  This method does
not require the number of clusters 
k
 as an input, but needs a
termination condition
7
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical packages, e.g., Splus
Use the 
single-link
 method and the dissimilarity matrix
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
8
Distance between Clusters
Single link:  
smallest distance between an element in one cluster and an
element in the other, i.e.,  dist(K
i
, K
j
) = min(t
ip
, t
jq
)
Complete link: 
largest distance between an element in one cluster and
an element in the other, i.e.,  dist(K
i
, K
j
) = max(t
ip
, t
jq
)
Average: 
avg distance between an element in one cluster and an element
in the other, i.e.,  dist(K
i
, K
j
) = avg(t
ip
, t
jq
)
Centroid: 
distance between the centroids of two clusters, i.e.,  dist(K
i
, K
j
) =
dist(C
i
, C
j
)
Medoid: 
distance between the medoids of two clusters, i.e.,  dist(K
i
, K
j
) =
dist(M
i
, M
j
)
Medoid: a chosen, centrally located object in the cluster
9
Clustering Algorithms
K-Means
K-Medoids
Hierarchical 
Clustering
Density-based Clustering
Density-Based Clustering: Basic Concepts
Two parameters
:
Eps
: Maximum radius of the neighbourhood
MinPts
: Minimum number of points in an Eps-
neighbourhood of that point
N
Eps
(q)
: {p belongs to D | dist(p,q) ≤ Eps}
Directly density-reachable
: A point 
p
 is directly density-
reachable from a point 
q
 w.r.t. 
Eps
, 
MinPts
 if 
 
p
 belongs to 
N
Eps
(q)
core point condition:
              |
N
Eps
 (q)
| ≥ 
MinPts
11
Density-Reachable and Density-Connected
Density-reachable:
A point 
p
 is 
density-reachable
 from a
point 
q
 w.r.t. 
Eps
, 
MinPts
 if there is a
chain of points 
p
1
, 
, 
p
n
, 
p
1
 = 
q
, 
p
n
 =
p
 such that 
p
i+1
 is directly density-
reachable from 
p
i
 
Density-connected
A point 
p
 is 
density-connected
 to a
point 
q
 w.r.t. 
Eps
, 
MinPts
 if there is a
point 
o 
such that both, 
p
 and 
q
 are
density-reachable from 
o
 w.r.t. 
Eps
and 
MinPts
p
q
p
1
12
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
Relies on a 
density-based
 notion of cluster:  A 
cluster
 is defined as a
maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases with noise
A point is a 
core point
 if it has more than a specified number of points
(MinPts) within Eps
These are points that are at the interior of a cluster
A 
border point
 has fewer than MinPts within Eps, but is in the
neighborhood of a core point
13
DBSCAN: The Algorithm
Arbitrary select a point 
p
Retrieve all points density-reachable from 
p
 w.r.t. 
Eps
 and
MinPts
If 
p
 is a core point, a cluster is formed
If 
p
 is a border point, no points are density-reachable from 
p
and DBSCAN visits the next point of the database
Continue the process until all of the points have been
processed
14
Clustering Algorithms
K-Means
K-Medoids
Hierarchical 
Clustering
Density-based Clustering
Fuzzy set-based Clustering
Measuring Clustering Quality
Fuzzy (Soft) Clustering
Example: Let cluster features be
C
1
 :“digital camera” and “lens”
C
2
: “computer“
Fuzzy clustering
k fuzzy clusters C
1
, …,C
k
 ,represented as a partition matrix M = [w
ij
]
P1: for each object 
o
i
 
and cluster 
C
j
, 0 ≤ 
w
ij
 
 
1 (fuzzy set)
P2: for each object 
o
i
,                , equal participation
 
in the clustering
P3: for each cluster 
C
j
 
,                    ensures there is no empty cluster
Let c
1
, …, c
k
 as the center of the k clusters
For an object o
i
, sum of the squared error (SSE), p is a parameter:
For a cluster C
i
, SSE:
Measure how well a clustering fits the data:
16
The EM (Expectation Maximization) Algorithm
The k-means algorithm has two steps at each iteration:
Expectation Step
 
(E-step): Given the current cluster centers, each object is
assigned to the cluster whose center is closest to the object: An object is
expected to belong to the closest cluster
Maximization Step 
(M-step): Given the cluster assignment, for each
cluster, the algorithm 
adjusts the center
 so that 
the sum of distance
 from
the objects assigned to this cluster and the new center is minimized
The (EM) algorithm:
 A framework to approach maximum likelihood or
maximum a posteriori estimates of parameters in statistical models.
E-step 
assigns objects to clusters according to the current fuzzy clustering
or parameters of probabilistic clusters
M-step 
finds the new clustering or parameters that maximize the sum of
squared error (SSE) or the expected likelihood
17
Fuzzy Clustering Using the EM Algorithm
Initially, let c
1
 = a and c
2
 = b
1
st
 E-step: assign o to c
1
,w. wt =
1
st
 M-step:  recalculate the centroids according to the partition matrix,
minimizing the sum of squared error (SSE)
Iteratively calculate this until the cluster centers converge or the change
is small enough
Clustering Algorithms
K-Means
K-Medoids
Hierarchical 
Clustering
Density-based Clustering
Fuzzy set-based Clustering
Probabilistic Model-Based Clustering
Measuring Clustering Quality
20
Model-Based Clustering
A set 
C 
of 
k 
probabilistic clusters 
C
1
, …,C
k
 
with probability density functions
f
1
, …, f
k
, respectively, and their probabilities 
ω
1
, …, 
ω
k
.
Probability of an object 
o 
generated by cluster 
C
j
 
is
Probability of 
o 
generated by the set of cluster 
C
 
is
Since objects are assumed to be generated
independently, for a data set D = {o
1
, …, o
n
}, we have,
T
a
s
k
:
 
F
i
n
d
 
a
 
s
e
t
 
C
 
o
f
 
k
 
p
r
o
b
a
b
i
l
i
s
t
i
c
 
c
l
u
s
t
e
r
s
 
s
.
t
.
 
P
(
D
|
C
)
 
i
s
 
m
a
x
i
m
i
z
e
d
H
o
w
e
v
e
r
,
 
m
a
x
i
m
i
z
i
n
g
 
P
(
D
|
C
)
 
i
s
 
o
f
t
e
n
 
i
n
t
r
a
c
t
a
b
l
e
 
s
i
n
c
e
 
t
h
e
 
p
r
o
b
a
b
i
l
i
t
y
d
e
n
s
i
t
y
 
f
u
n
c
t
i
o
n
 
o
f
 
a
 
c
l
u
s
t
e
r
 
c
a
n
 
t
a
k
e
 
a
n
 
a
r
b
i
t
r
a
r
i
l
y
 
c
o
m
p
l
i
c
a
t
e
d
 
f
o
r
m
To make it computationally feasible (as a compromise), assume the
probability density functions being some parameterized distributions
21
Univariate Gaussian Mixture Model
O = {o
1
, …, o
n
} (n observed objects), 
Θ
 = 
{
θ
1
, …, 
θ
k
} (parameters of the k
distributions), and P
j
(o
i
| 
θ
j
) is the probability that o
i
 is generated from the j-th
distribution using parameter 
θ
j
, we have
Univariate Gaussian mixture model
Assume the probability density function of each cluster follows a 1-
d Gaussian distribution.  Suppose that there are k clusters.
The probability density function of each cluster are centered at 
μ
j
with standard deviation 
σ
j
, 
θ
j
, = (
μ
j
, 
σ
j
), we have
22
Computing Mixture Models with EM
Given n objects O = {o
1
, …, o
n
}, we want to mine a set of parameters 
Θ
 = 
{
θ
1
, …,
θ
k
} s.t.,P(
O
|
Θ
) is maximized, where 
θ
j
 = (
μ
j
, 
σ
j
) are the mean and standard
deviation of the j-th univariate Gaussian distribution
We initially assign random values to parameters 
θ
j
, then iteratively conduct
the E- and M- steps until converge or sufficiently small change
At the E-step, for each object o
i
, calculate the probability that o
i
 belongs to
each distribution,
A
t
 
t
h
e
 
M
-
s
t
e
p
,
 
a
d
j
u
s
t
 
t
h
e
 
p
a
r
a
m
e
t
e
r
s
 
θ
j
 
=
 
(
μ
j
,
 
σ
j
)
 
s
o
 
t
h
a
t
 
t
h
e
 
e
x
p
e
c
t
e
d
l
i
k
e
l
i
h
o
o
d
 
P
(
O
|
Θ
)
 
i
s
 
m
a
x
i
m
i
z
e
d
Frequent Item Sets
Brute-Force Solution
Frequent Itemset Generation
Brute-force approach:
Each itemset in the lattice is a 
candidate
 frequent itemset
Count the support of each candidate by scanning the
database
Match each transaction against every candidate
Complexity ~ O(NMw) => 
Expensive since M = 2
d
 
!!!
Frequent Itemset Generation Strategies
Reduce the 
number of candidates
 (M)
Complete search: M=2
d
Use pruning techniques to reduce M
Reduce the 
number of transactions 
(N)
Reduce size of N as the size of itemset increases
Used by DHP and vertical-based mining algorithms
Reduce the 
number of comparisons
 (NM)
Use efficient data structures to store the candidates or
transactions
No need to match every candidate against every
transaction
Frequent Item Sets
Brute-Force Solution
Apriori Property and Algorithm
Reducing Number of Candidates
Apriori principle
:
If an itemset is frequent, then all of its subsets must
also be frequent
Apriori principle holds due to the following
property of the support measure:
Support of an itemset never exceeds the support of its
subsets
This is known as the 
anti-monotone
 property of
support
Apriori Algorithm
Method:
Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Prune candidate itemsets containing subsets of length k that
are infrequent
Count the support of each candidate by scanning the DB
Eliminate candidates that are infrequent, leaving only those
that are frequent
Frequent Item Sets
Brute-Force Solution
Apriori Property and Algorithm
Hashing Tree
Reducing Number of Comparisons
Candidate counting:
Scan the database of transactions to determine the
support of each candidate itemset
To reduce the number of comparisons, store the
candidates in a hash structure
 
Instead of matching each transaction against every candidate, match
it against candidates contained in the hashed buckets
Generate Hash Tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7},
{3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
 Hash function
 Max leaf size: max number of itemsets stored in a leaf node (if number of
candidate itemsets exceeds max leaf size, split the node)
Maximal Frequent Itemset
Border
Infrequent
Itemsets
Maximal
Itemsets
An itemset is maximal frequent if none of its immediate supersets is
frequent
Closed Itemset
An itemset is closed if none of its immediate supersets has the
same support as the itemset
Frequent Item Sets
Brute-Force Solution
Apriori Property and Algorithm
Hashing Tree
FP-tree
35
FP-tree
Scan the database once to store all essential
information in a data structure called FP-tree
(Frequent Pattern Tree)
The FP-tree is concise and is used in directly
generating large itemsets
36
FP-tree
 
Step 1:
 Deduce the ordered frequent items. For items with
the same frequency, the order is given by the alphabetical
order.
Step 2:
 Construct the FP-tree from the above data
Step 3:
 From the FP-tree above, construct the FP-
conditional tree for each item (or itemset).
Step 4:
 Determine the frequent patterns.
Frequent Item Sets
Brute-Force Solution
Apriori Property and Algorithm
Hashing Tree
FP-tree
Continuous and Categorical Attributes
Frequent Item Sets
Brute-Force Solution
Apriori Property and Algorithm
Hashing Tree
FP-tree
Continuous and Categorical Attributes
Sequence Pattern Mining
Sequential Pattern Mining: Definition
Given:
a database of sequences
a user-specified minimum support threshold,
minsup
Task:
Find all subsequences with support 
minsup
Sequential Pattern Mining: Challenge
Sequential Pattern Mining: Example
Minsup
 = 50%
Examples of Frequent Subsequences:
< {1,2} >       
 
s=60%
< {2,3} > 
  
s=60%
< {2,4}>
  
s=80%
< {3} {5}>
  
s=80%
< {1} {2} >
  
s=80%
< {2} {2} >
  
s=60%
< {1} {2,3} >
 
s=60%
< {2} {2,3} >
 
s=60%
< {1,2} {2,3} >
 
s=60%
Generalized Sequential Pattern (GSP)
Step 1
:
Make the first pass over the sequence database D to yield all the 1-element
frequent sequences
Step 2
:
 
Repeat until no new frequent sequences are found
Candidate Generation
:
Merge pairs of frequent subsequences found in the (k-1)
th
 pass to generate
candidate sequences that contain k items
Candidate Pruning
:
Prune candidate 
k
-sequences that contain infrequent (
k-1)
-subsequences
Support Counting
:
Make a new pass over the sequence database D to find the support for these
candidate sequences
Candidate Elimination
:
Eliminate candidate 
k
-sequences whose actual support is less than 
minsup
Candidate Generation
Base case (k=2):
Merging two frequent 1-sequences <{i
1
}>  and <{i
2
}> will produce two
candidate 2-sequences:  <{i
1
} {i
2
}>  and   <{i
1 
i
2
}>
General case (k>2):
A frequent (
k-1)
-sequence w
1
 is merged with another frequent
(
k-1)
-sequence w
2
 to produce a candidate 
k
-sequence if the subsequence
obtained by removing the first event in w
1
 is the same as the subsequence
obtained by removing the last event in w
2
 The resulting candidate after merging is given by the sequence w
1
 extended
with the last event of w
2
.
If the last two events in w
2
 belong to the same element, then the last
event in w
2
 becomes part of the last element in w
1
Otherwise, the last event in w
2
 becomes a separate element appended to
the end of w
1
Candidate Generation Examples
Merging the sequences
w
1
=<{1} {2 3} {4}> and w
2
 =<{2 3} {4 5}>
will produce the candidate sequence < {1} {2 3} {4 5}> because the last two
events in w
2 
(4 and 5) belong to the same element
Merging the sequences
w
1
=<{1} {2 3} {4}> and w
2
 =<{2 3} {4} {5}>
will produce the candidate sequence < {1} {2 3} {4} {5}> because the last
two events in w
2 
(4 and 5) do not belong to the same element
We do not have to merge the sequences
w
1
 =<{1} {2 6} {4}> and w
2
 =<{1} {2} {4 5}>
to produce the candidate < {1} {2 6} {4 5}> because if the latter is a viable
candidate, then it can be obtained by merging w
1
 with
< {1} {2 6} {5}>
GSP Example
Frequent Item Sets
Brute-Force Solution
Apriori Property and Algorithm
Hashing Tree
FP-tree
Continuous and Categorical Attributes
Sequence Pattern Mining
Time Constraint-based Sequence Pattern Mining
Timing Constraints (I)
Mining Sequential Patterns with Timing Constraints
Approach 1:
Mine sequential patterns without timing constraints
Postprocess the discovered patterns
Approach 2:
Modify GSP to directly prune candidates that violate
timing constraints
Question:
 Does Apriori principle still hold?
Apriori Principle for Sequence Data
Contiguous Subsequences
s is a contiguous subsequence of
 
w = <e
1
>< e
2
>…< e
k
>
if any of the following conditions hold:
1.
s is obtained from w by deleting an item from either 
e
1
 or 
e
k
2.
s is obtained from w by deleting an item from any element 
e
i
 that
contains more than 2 items
3.
s is a contiguous subsequence of s’ and s’ is a contiguous
subsequence of w (recursive definition)
Examples: s = < {1} {2} >
is a contiguous subsequence of
      < {1} {2 3}>, < {1 2} {2} {3}>, and < {3 4} {1 2} {2 3} {4} >
is not a contiguous subsequence of
      < {1} {3} {2}> and < {2} {1} {3} {2}>
Modified Candidate Pruning Step
Without maxgap constraint:
A candidate k-sequence is pruned if at least one of
its (k-1)-subsequences is infrequent
With maxgap constraint:
A candidate 
k
-sequence is pruned if at least one of
its 
contiguous
 (
k-1
)-subsequences is infrequent
Outlier Detection
Statistical Methods
Outlier Detection (1): Statistical Methods
Statistical methods (also known as model-based methods) assume that the
normal data follow some statistical model (a stochastic model)
The data not following the model are outliers.
53
Effectiveness of statistical methods: highly depends on whether the
assumption of statistical model holds in the real data
There are rich alternatives to use various statistical models
E.g., parametric vs. non-parametric
Example (right figure): First use Gaussian distribution
to model the normal data
For each object y in region R, estimate g
D
(y), the
probability of y fits the Gaussian distribution
If g
D
(y) is very low, y is unlikely generated by the
Gaussian model, thus an outlier
Statistical Approaches
Statistical approaches assume that the objects in a data set are
generated by a stochastic process (a generative model)
Idea: learn a generative model fitting the given data set, and then
identify the objects in low probability regions of the model as outliers
Methods are divided into two categories: 
parametric
 vs. 
non-parametric
Parametric method
Assumes that the normal data is generated by a parametric
distribution with parameter 
θ
The probability density function of the parametric distribution 
f
(
x, 
θ
)
gives the probability that object 
x
 is generated by the distribution
The smaller this value, the more likely x is an outlier
Non-parametric method
Not assume an a-priori statistical model and determine the model
from the input data
Not completely parameter free but consider the number and nature
of the parameters are flexible and not fixed in advance
Examples: histogram and kernel density estimation
54
Parametric Methods I: Detection Univariate
Outliers Based on Normal Distribution
Univariate data: A data set involving only one attribute or variable
Often assume that data are generated from a normal distribution, learn
the parameters from the input data, and identify the points with low
probability as outliers
Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
Use the maximum likelihood method to estimate μ and 
σ
55
Taking derivatives with respect to μ and 
σ
2
, we derive the following
maximum likelihood estimates
For the above data with n = 10, we have
Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is an outlier since
56
Outlier Discovery:
Statistical Approaches
Assume a model underlying distribution that generates data
set (e.g. normal distribution)
Use discordancy tests depending on
data distribution
distribution parameter (e.g., mean, variance)
number of expected outliers
Drawbacks
most tests are for single attribute
In many cases, data distribution may not be known
Parametric Methods II: Detection of
Multivariate Outliers
Multivariate data: A data set involving two or more attributes or
variables
Transform the multivariate outlier detection task into a univariate
outlier detection problem
Method 1. Compute Mahalaobis distance
Let 
ō be the mean vector for a multivariate data set. Mahalaobis
distance for an object o to 
ō is MDist(o, ō) = (o – ō )
T
 S 
–1
(o – ō)
where S is the covariance matrix
Use the Grubb's test on this measure to detect outliers
Method 2. Use 
χ
2 
–statistic:
where 
E
i
 is the mean of the 
i
-dimension among all objects, and n is
the dimensionality
If 
χ
2 
–statistic is large, then object 
o
i
 is an outlier
57
 
Non-Parametric Methods: Detection Using Histogram
The model of normal data is learned from the
input data without any 
a priori
 structure.
Often makes fewer assumptions about the data,
and thus can be applicable in more scenarios
Outlier detection using histogram:
58
 
Figure shows the histogram of purchase amounts in transactions
A transaction in the amount of $7,500 is an outlier, since only 0.2%
transactions have an amount higher than $5,000
Problem: Hard to choose an appropriate bin size for histogram
Too small bin size 
normal objects in empty/rare bins, false positive
Too big bin size → outliers in some frequent bins, false negative
Solution: Adopt kernel density estimation to estimate the probability
density distribution of the data.  If the estimated density function is high,
the object is likely normal.  Otherwise, it is likely an outlier.
Non-Parametric Methods: Detection Using Histogram
The model of normal data is learned from the input
data without any 
a priori
 structure.
Often makes fewer assumptions about the data, and
thus can be applicable in more scenarios
Outlier detection using histogram:
59
 
Figure shows the histogram of purchase amounts in transactions
A transaction in the amount of $7,500 is an outlier, since only 0.2%
transactions have an amount higher than $5,000
Problem: Hard to choose an appropriate bin size for histogram
Too small bin size 
normal objects in empty/rare bins, false positive
Too big bin size → outliers in some frequent bins, false negative
Solution: Adopt kernel density estimation to estimate the probability
density distribution of the data.  If the estimated density function is high,
the object is likely normal.  Otherwise, it is likely an outlier.
Outlier Detection
Statistical Methods
Proximity-based Methods
Distance-based
Density-based
Outlier Detection (2): Proximity-Based Methods
An object is an outlier if the nearest neighbors of the object are far away, i.e., the
proximity
 of the object is 
significantly deviates
 from the proximity of most of the
other objects in the same data set
61
The effectiveness of proximity-based methods highly relies on the
proximity measure.
In some applications, proximity or distance measures cannot be
obtained easily.
Often have a difficulty in finding a group of outliers which stay close to
each other
Two major types of proximity-based outlier detection
Distance-based vs. density-based
Example (right figure):  Model the proximity of an
object using its 3 nearest neighbors
Objects in region R are substantially different
from other objects in the data set.
Thus the objects in R are outliers
Proximity-Based Approaches: Distance-Based vs.
Density-Based Outlier Detection
Intuition: Objects that are far away from the others are
outliers
Assumption of proximity-based approach: The proximity of
an outlier deviates significantly from that of most of the
others in the data set
Two types of proximity-based outlier detection methods
Distance-based outlier detection: An object o is an
outlier if its neighborhood does not have enough other
points
Density-based outlier detection: An object o is an outlier
if its density is relatively much lower than that of its
neighbors
62
Distance-Based Outlier Detection
F
o
r
 
e
a
c
h
 
o
b
j
e
c
t
 
o
,
 
e
x
a
m
i
n
e
 
t
h
e
 
#
 
o
f
 
o
t
h
e
r
 
o
b
j
e
c
t
s
 
i
n
 
t
h
e
 
r
-
n
e
i
g
h
b
o
r
h
o
o
d
o
f
 
o
,
 
w
h
e
r
e
 
r
 
i
s
 
a
 
u
s
e
r
-
s
p
e
c
i
f
i
e
d
 
d
i
s
t
a
n
c
e
 
t
h
r
e
s
h
o
l
d
A
n
 
o
b
j
e
c
t
 
o
 
i
s
 
a
n
 
o
u
t
l
i
e
r
 
i
f
 
m
o
s
t
 
(
t
a
k
i
n
g
 
π
 
a
s
 
a
 
f
r
a
c
t
i
o
n
 
t
h
r
e
s
h
o
l
d
)
 
o
f
t
h
e
 
o
b
j
e
c
t
s
 
i
n
 
D
 
a
r
e
 
f
a
r
 
a
w
a
y
 
f
r
o
m
 
o
,
 
i
.
e
.
,
 
n
o
t
 
i
n
 
t
h
e
 
r
-
n
e
i
g
h
b
o
r
h
o
o
d
 
o
f
 
o
An object o is a DB(r, 
π
) outlier if
Equivalently, one can check the distance between 
o
 and its 
k
-th
nearest neighbor 
o
k
, where                       . 
o
 is an outlier if dist(
o, o
k
) > r
Efficient computation: Nested loop algorithm
For any object o
i
, calculate its distance from other objects, and
count the # of other objects in the r-neighborhood.
If  
π
n other objects are within r distance, terminate the inner loop
Otherwise, o
i
 is a DB(r, 
π
) outlier
Efficiency: Actually CPU time is not O(n
2
) but linear to the data set size
since for most non-outlier objects, the inner loop terminates early
63
64
Outlier Discovery: Distance-Based Approach
Introduced to counter the main limitations imposed by
statistical methods
We need multi-dimensional analysis without knowing
data distribution
Distance-based outlier: A DB(p, D)-outlier is an object O in
a dataset T such that at least a fraction p of the objects in T
lies at a distance greater than D from O
Algorithms for mining distance-based outliers [Knorr & Ng,
VLDB’98]
Index-based algorithm
Nested-loop algorithm
Cell-based algorithm
Density-Based Outlier Detection
Local outliers: Outliers comparing to their local
neighborhoods, instead of the global data
distribution
In Fig., o
1
 and o2 are local outliers to C
1
, o
3
 is a
global outlier, but o
4
 is not an outlier.  However,
proximity-based clustering cannot find o
1
 and o
2
are outlier (e.g., comparing with O
4
).
65
Intuition (density-based outlier detection): The density around 
an outlier
object is 
significantly different from
 the density around its neighbors
Method: Use the relative density of an object against its neighbors as
the indicator of the degree of the object being outliers
k-distance
 of an object o, dist
k
(o): distance between o and its k-th NN
k-distance neighborhood
 of o, N
k
(o) = {o’| o’ in D, dist(o, o’) 
 dist
k
(o)}
N
k
(o) could be bigger than k since multiple objects may have
identical distance to o
Local Outlier Factor: LOF
Reachability distance from 
o’
 to 
o
:
where k is a user-specified parameter
Local reachability density of 
o
:
66
LOF (Local outlier factor) of an object o is the average of the ratio of
local reachability of 
o
 and those of 
o
’s k-nearest neighbors
The lower the local reachability density of o, and the higher the local
reachability density of the kNN of o, the higher LOF
This captures a local outlier whose local density is relatively low
comparing to the local densities of its kNN
67
Density-Based Local
Outlier Detection
M. M. Breunig, H.-P. Kriegel, R. Ng, J.
Sander. 
LOF: Identifying Density-Based
Local Outliers. SIGMOD 2000.
Distance-based outlier detection is based
on global distance distribution
It encounters difficulties to identify outliers
if data is not uniformly distributed
Ex. C
1
 contains 400 loosely distributed
points, C
2
 has 100 tightly condensed
points, 2 outlier points o
1
, o
2
Distance-based method cannot identify o
2
as an outlier
Need the concept of local
outlier
Local outlier factor (LOF)
Assume outlier is not
crisp
Each point has a LOF
Data Cube and OLAP
Data Cube
69
Typical OLAP Operations
Roll up (drill-up):
 summarize data
by climbing up hierarchy or by dimension reduction
Drill down (roll down):
 reverse of roll-up
from higher level summary to lower level summary or detailed
data, or introducing new dimensions
Slice and dice:
 
project and select
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes
Other operations
drill across:
 involving (across) more than one fact table
drill through:
 through the bottom level of the cube to its back-
end relational tables (using SQL)
70
Fig. 3.10 Typical OLAP
Operations
Data Cube and OLAP
Data Cube
General Method to build up Data Cube
lecture 10
Efficient Computation of Data Cubes
General cube computation heuristics (Agarwal et
al.’96)
Computing full/iceberg cubes:
Top-Down: 
Multi-Way
 array aggregation
Bottom-Up: 
Bottom-up 
computation: BUC
Integrating Top-Down and Bottom-Up:
Measures
Clustering Measures
Frequent Item Set Measures
Web Databases
PageRank
Hits
Slide Note
Embed
Share

This content discusses clustering algorithms such as K-Means, K-Medoids, and Hierarchical Clustering. It explains the concepts, methods, and applications of partitioning and clustering objects in a dataset for data analysis. The text covers techniques like PAM (Partitioning Around Medoids) and AGNES (Agglomerative Nesting) in detail, providing insights into how these algorithms work to group similar data points together.

  • Data Science
  • Clustering Algorithms
  • K-Means
  • K-Medoids
  • Hierarchical Clustering

Uploaded on Oct 03, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Final Review Lei Chen

  2. Clustering Algorithms K-Means

  3. Partitioning Algorithms: Basic Concept Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ciis the centroid or medoid of cluster Ci) = = 2 k i ( ( , )) E d p c 1 p C i i Given k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen 67, Lloyd 57/ 82): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw 87): Each cluster is represented by one of the objects in the cluster 3

  4. Clustering Algorithms K-Means K-Medoids

  5. PAM (Partitioning Around Medoids) (1987) PAM (Kaufman and Rousseeuw, 1987), built in Splus Use real object to represent the cluster Select k representative objects arbitrarily For each pair of non-selected object h and selected object i, calculate the total swapping cost TCih For each pair of i and h, If TCih< 0, i is replaced by h Then assign each non-selected object to the most similar representative object repeat steps 2-3 until there is no change 5

  6. Clustering Algorithms K-Means K-Medoids Hierarchical Clustering

  7. Hierarchical Clustering Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 1 Step 2 Step 3 Step 4 Step 0 agglomerative (AGNES) a a b b a b c d e c c d e d d e e divisive (DIANA) Step 3 Step 2 Step 1 Step 0 Step 4 7

  8. AGNES (Agglomerative Nesting) Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical packages, e.g., Splus Use the single-link method and the dissimilarity matrix Merge nodes that have the least dissimilarity Go on in a non-descending fashion Eventually all nodes belong to the same cluster 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 8

  9. Distance between Clusters X X Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq) Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq) Average: avg distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq) Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj) Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj) Medoid: a chosen, centrally located object in the cluster 9

  10. Clustering Algorithms K-Means K-Medoids Hierarchical Clustering Density-based Clustering

  11. Density-Based Clustering: Basic Concepts Two parameters: Eps: Maximum radius of the neighbourhood MinPts: Minimum number of points in an Eps- neighbourhood of that point NEps(q): {p belongs to D | dist(p,q) Eps} Directly density-reachable: A point p is directly density- reachable from a point q w.r.t. Eps, MinPts if p belongs to NEps(q) core point condition: p MinPts = 5 Eps = 1 cm |NEps(q)| MinPts q 11

  12. Density-Reachable and Density-Connected Density-reachable: A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, , pn, p1= q, pn= p such that pi+1is directly density- reachable from pi Density-connected p p1 q A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts p q o 12

  13. DBSCAN: Density-Based Spatial Clustering of Applications with Noise Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape in spatial databases with noise Outlier Border Eps = 1cm MinPts = 5 Core A point is a core point if it has more than a specified number of points (MinPts) within Eps These are points that are at the interior of a cluster A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point 13

  14. DBSCAN: The Algorithm Arbitrary select a point p Retrieve all points density-reachable from p w.r.t. Eps and MinPts If p is a core point, a cluster is formed If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database Continue the process until all of the points have been processed 14

  15. Clustering Algorithms K-Means K-Medoids Hierarchical Clustering Density-based Clustering Fuzzy set-based Clustering Measuring Clustering Quality

  16. Fuzzy (Soft) Clustering Example: Let cluster features be C1: digital camera and lens C2: computer Fuzzy clustering k fuzzy clusters C1, ,Ck,represented as a partition matrix M = [wij] P1: for each object oiand cluster Cj, 0 wij 1 (fuzzy set) P2: for each object oi, , equal participation in the clustering P3: for each cluster Cj, ensures there is no empty cluster Let c1, , ckas the center of the k clusters For an object oi, sum of the squared error (SSE), p is a parameter: For a cluster Ci, SSE: Measure how well a clustering fits the data: 16

  17. The EM (Expectation Maximization) Algorithm The k-means algorithm has two steps at each iteration: Expectation Step (E-step): Given the current cluster centers, each object is assigned to the cluster whose center is closest to the object: An object is expected to belong to the closest cluster Maximization Step (M-step): Given the cluster assignment, for each cluster, the algorithm adjusts the center so that the sum of distance from the objects assigned to this cluster and the new center is minimized The (EM) algorithm: A framework to approach maximum likelihood or maximum a posteriori estimates of parameters in statistical models. E-step assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clusters M-step finds the new clustering or parameters that maximize the sum of squared error (SSE) or the expected likelihood 17

  18. Fuzzy Clustering Using the EM Algorithm Initially, let c1= a and c2= b 1stE-step: assign o to c1,w. wt = 1stM-step: recalculate the centroids according to the partition matrix, minimizing the sum of squared error (SSE) Iteratively calculate this until the cluster centers converge or the change is small enough

  19. Clustering Algorithms K-Means K-Medoids Hierarchical Clustering Density-based Clustering Fuzzy set-based Clustering Probabilistic Model-Based Clustering Measuring Clustering Quality

  20. Model-Based Clustering A set C of k probabilistic clusters C1, ,Ckwith probability density functions f1, , fk, respectively, and their probabilities 1, , k. Probability of an object o generated by cluster Cjis Probability of o generated by the set of cluster C is Since objects are assumed to be generated independently, for a data set D = {o1, , on}, we have, Task: Find a set C of k probabilistic clusters s.t. P(D|C) is maximized However, maximizing P(D|C) is often intractable since the probability density function of a cluster can take an arbitrarily complicated form To make it computationally feasible (as a compromise), assume the probability density functions being some parameterized distributions 20

  21. Univariate Gaussian Mixture Model O = {o1, , on} (n observed objects), = { 1, , k} (parameters of the k distributions), and Pj(oi| j) is the probability that oiis generated from the j-th distribution using parameter j, we have Univariate Gaussian mixture model Assume the probability density function of each cluster follows a 1- d Gaussian distribution. Suppose that there are k clusters. The probability density function of each cluster are centered at j with standard deviation j, j, = ( j, j), we have 21

  22. Computing Mixture Models with EM Given n objects O = {o1, , on}, we want to mine a set of parameters = { 1, , k} s.t.,P(O| ) is maximized, where j= ( j, j) are the mean and standard deviation of the j-th univariate Gaussian distribution We initially assign random values to parameters j, then iteratively conduct the E- and M- steps until converge or sufficiently small change At the E-step, for each object oi, calculate the probability that oibelongs to each distribution, At the M-step, adjust the parameters j= ( j, j) so that the expected likelihood P(O| ) is maximized 22

  23. Frequent Item Sets Brute-Force Solution

  24. Frequent Itemset Generation Brute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database Transactions List of Candidates TID Items 1 2 3 4 5 Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke w M N Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d!!!

  25. Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2d Use pruning techniques to reduce M Reduce the number of transactions (N) Reduce size of N as the size of itemset increases Used by DHP and vertical-based mining algorithms Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction

  26. Frequent Item Sets Brute-Force Solution Apriori Property and Algorithm

  27. Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: ( : , X Y X ) ( ) ( ) Y s X s Y Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support

  28. Apriori Algorithm Method: Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent

  29. Frequent Item Sets Brute-Force Solution Apriori Property and Algorithm Hashing Tree

  30. Reducing Number of Comparisons Candidate counting: Scan the database of transactions to determine the support of each candidate itemset To reduce the number of comparisons, store the candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets Transactions Hash Structure TID Items 1 2 3 4 5 Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke k N Buckets

  31. Generate Hash Tree Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) Hash function 2 3 4 5 6 7 3,6,9 1,4,7 3 6 7 3 6 8 1 4 5 3 5 6 3 5 7 6 8 9 3 4 5 2,5,8 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9

  32. Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent null Maximal Itemsets A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Infrequent Itemsets Border ABCD E

  33. Closed Itemset An itemset is closed if none of its immediate supersets has the same support as the itemset Itemset {A} {B} {C} {D} {A,B} {A,C} {A,D} {B,C} {B,D} {C,D} Support 4 5 3 4 4 2 3 3 4 3 TID 1 2 3 4 5 Items {A,B} {B,C,D} {A,B,C,D} {A,B,D} {A,B,C,D} Itemset {A,B,C} {A,B,D} {A,C,D} {B,C,D} {A,B,C,D} Support 2 3 2 3 2

  34. Frequent Item Sets Brute-Force Solution Apriori Property and Algorithm Hashing Tree FP-tree

  35. FP-tree Scan the database once to store all essential information in a data structure called FP-tree (Frequent Pattern Tree) The FP-tree is concise and is used in directly generating large itemsets 35

  36. FP-tree Step 1: Deduce the ordered frequent items. For items with the same frequency, the order is given by the alphabetical order. Step 2: Construct the FP-tree from the above data Step 3: From the FP-tree above, construct the FP- conditional tree for each item (or itemset). Step 4: Determine the frequent patterns. 36

  37. Frequent Item Sets Brute-Force Solution Apriori Property and Algorithm Hashing Tree FP-tree Continuous and Categorical Attributes

  38. Frequent Item Sets Brute-Force Solution Apriori Property and Algorithm Hashing Tree FP-tree Continuous and Categorical Attributes Sequence Pattern Mining

  39. Sequential Pattern Mining: Definition Given: a database of sequences a user-specified minimum support threshold, minsup Task: Find all subsequences with support minsup

  40. Sequential Pattern Mining: Challenge

  41. Sequential Pattern Mining: Example Object A A A B B C C C D D D E E Timestamp 1 2 3 1 2 1 2 3 1 2 3 1 2 Events 1,2,4 2,3 5 1,2 2,3,4 1, 2 2,3,4 2,4,5 2 3, 4 4, 5 1, 3 2, 4, 5 Minsup = 50% Examples of Frequent Subsequences: < {1,2} > < {2,3} > < {2,4}> < {3} {5}> < {1} {2} > < {2} {2} > < {1} {2,3} > < {2} {2,3} > < {1,2} {2,3} > s=60% s=60% s=80% s=80% s=80% s=60% s=60% s=60% s=60%

  42. Generalized Sequential Pattern (GSP) Step 1: Make the first pass over the sequence database D to yield all the 1-element frequent sequences Step 2: Repeat until no new frequent sequences are found Candidate Generation: Merge pairs of frequent subsequences found in the (k-1)th pass to generate candidate sequences that contain k items Candidate Pruning: Prune candidate k-sequences that contain infrequent (k-1)-subsequences Support Counting: Make a new pass over the sequence database D to find the support for these candidate sequences Candidate Elimination: Eliminate candidate k-sequences whose actual support is less than minsup

  43. Candidate Generation Base case (k=2): Merging two frequent 1-sequences <{i1}> and <{i2}> will produce two candidate 2-sequences: <{i1} {i2}> and <{i1 i2}> General case (k>2): A frequent (k-1)-sequence w1is merged with another frequent (k-1)-sequence w2to produce a candidate k-sequence if the subsequence obtained by removing the first event in w1is the same as the subsequence obtained by removing the last event in w2 The resulting candidate after merging is given by the sequence w1extended with the last event of w2. If the last two events in w2belong to the same element, then the last event in w2becomes part of the last element in w1 Otherwise, the last event in w2becomes a separate element appended to the end of w1

  44. Candidate Generation Examples Merging the sequences w1=<{1} {2 3} {4}> and w2=<{2 3} {4 5}> will produce the candidate sequence < {1} {2 3} {4 5}> because the last two events in w2 (4 and 5) belong to the same element Merging the sequences w1=<{1} {2 3} {4}> and w2=<{2 3} {4} {5}> will produce the candidate sequence < {1} {2 3} {4} {5}> because the last two events in w2 (4 and 5) do not belong to the same element We do not have to merge the sequences w1=<{1} {2 6} {4}> and w2=<{1} {2} {4 5}> to produce the candidate < {1} {2 6} {4 5}> because if the latter is a viable candidate, then it can be obtained by merging w1with < {1} {2 6} {5}>

  45. GSP Example Frequent 3-sequences Candidate Generation < {1} {2} {3} > < {1} {2 5} > < {1} {5} {3} > < {2} {3} {4} > < {2 5} {3} > < {3} {4} {5} > < {5} {3 4} > Candidate Pruning < {1} {2} {3} {4} > < {1} {2 5} {3} > < {1} {5} {3 4} > < {2} {3} {4} {5} > < {2 5} {3 4} > < {1} {2 5} {3} >

  46. Frequent Item Sets Brute-Force Solution Apriori Property and Algorithm Hashing Tree FP-tree Continuous and Categorical Attributes Sequence Pattern Mining Time Constraint-based Sequence Pattern Mining

  47. Timing Constraints (I)

  48. Mining Sequential Patterns with Timing Constraints Approach 1: Mine sequential patterns without timing constraints Postprocess the discovered patterns Approach 2: Modify GSP to directly prune candidates that violate timing constraints Question: Does Apriori principle still hold?

  49. Apriori Principle for Sequence Data

  50. Contiguous Subsequences s is a contiguous subsequence of w = <e1>< e2> < ek> if any of the following conditions hold: 1. s is obtained from w by deleting an item from either e1or ek 2. s is obtained from w by deleting an item from any element eithat contains more than 2 items 3. s is a contiguous subsequence of s and s is a contiguous subsequence of w (recursive definition) Examples: s = < {1} {2} > is a contiguous subsequence of < {1} {2 3}>, < {1 2} {2} {3}>, and < {3 4} {1 2} {2 3} {4} > is not a contiguous subsequence of < {1} {3} {2}> and < {2} {1} {3} {2}>

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#