Correlated Histograms Clustering

undefined
Correlated Histograms Clustering
A novel unsupervised learning technique that leverages the underlying statistics
of a dataset across its different dimensions to identify cluster centroids
B
r
i
c
e
 
B
r
o
s
i
g
,
 
L
o
n
e
 
S
t
a
r
 
A
n
a
l
y
s
i
s
C
o
-
A
u
t
h
o
r
:
 
R
a
n
d
y
 
A
l
l
e
n
,
 
L
o
n
e
 
S
t
a
r
 
A
n
a
l
y
s
i
s
1
Agenda
Background
Motivation
Correlated Histograms Clustering
Application – Training Syllabus Outcomes
Application – Robustness Against Noisy Data
Summary
Future Work
Acknowledgements
2
Background
Supervised learning
Dataset with known, “ground truth” labels.
The Task is to fit a model to the data such that you fit new, unseen data well.
Unsupervised learning (
clustering
)
The practitioner has a dataset 
without
 labels.
The task is typically to learn one or more of the following:
The number of classes each instance falls into.
Where those categories are in the domain of the dataset (borders and/or centroids).
What instances fall into which categories.
Semi-Supervised learning
Combination of the two – some of the data is labeled and unsupervised learning tasks can
aid in the supervised learning.
3
Motivation
Unlabeled and / or noisy data
Practitioners mostly deal with data that is “messy”. From sensors to surveys, we must
make use of data that is not easily visualized or categorized – if at all.
No a priori knowledge of the number of clusters
We often can’t assume things about the data before hand. Often, the number of clusters is
one of the things we want to find out when using a clustering technique.
An interest in the centroids
Centroids express the actual characteristics of the different clusters. These characteristics
can be more useful than just knowing which instances fall into what category.
New clustering / neighborhood metric
Almost all other clustering techniques use 
distance
 as the metric to build clusters. This
requires consideration and normalization of the data.
4
Motivation
5
Correlated Histograms Clustering
6
x-values
y-values
A
B
C
Correlated Histograms Clustering
7
x-values
y-values
A
B
C
Correlated Histograms Clustering
8
x-values
y-values
A
B
C
Correlated Histograms Clustering
9
x-values
y-values
What is needed:
 
Discovery of the number of clusters (how many outcomes).
The centroids of each cluster (characteristics of outcomes).
Ability to do so with n-dimensional data (more than 3 metrics).
Application – Training Syllabus Outcomes
Suppose we have implemented a pilot training syllabus and evaluated trainees on
several metrics.
Knowing the 
number of outcomes
 and the 
characteristics of those outcomes
 can
be useful.
It is likely that one has more than 3 metrics to evaluate and therefore a
visualization is difficult.
10
Correlated Histograms
does all of these!
11
Application – Training Syllabus Outcomes
Our dataset of training metrics scattered and histogrammed
12
Application – Training Syllabus Outcomes
Our dataset of training metrics scattered and histogrammed
13
Application – Training Syllabus Outcomes
Identification of the cluster
centroids gives us:
The number of outcomes.
A datapoint associated with
each outcome:
(1.654, 1.120)
(4.533, 0.441)
(8.324, 0.240)
We walk away knowing the number of 
trainees our syllabus produces 
and
 
ways to describe each type! 
14
Application – Robustness Against Noisy Data
Another Scenario
Centroids:
(-8.638, -5.119)
(1.845, 0.537)
Low Sensitivity!
15
Application – Robustness Against Noisy Data
Another Scenario
Centroids:
(-8.476, -5.357)
(-4.506, 0.483)
(1.847, 0.483)
High Sensitivity!
Identifies an
additional centroid!
16
Summary
Correlated Histograms is an unsupervised learning technique that has
applications anywhere that clustering is the task at hand.
Differences being:
You get centroids of clusters rather than classification of instances.
These centroids are derived from the underlying statistics of the data rather than distances
between points.
Key take-away: Statistics is largely underutilized as a metric in
classical clustering techniques!
Correlated Histograms leverages statistics and
can lead to great insights into m
essy, unfamiliar data.
17
Future Work
Handling data that is “oddly shaped” with respect to the orthogonal vectors.
Swapping out the Harrel-Davis QRDE for other density estimates, classic
histograms, or “adaptive histograms” (also from Andrey Akinshin).
Other Modality detection techniques.
Checking agreement between
      some number of nearest points.
18
Acknowledgements
Dr. Randy Allen for mentoring me and for co-authoring this paper.
Slide Note

Hello, today we are going to take a look at a new unsupervised learning technique that uses statistics

Embed
Share

Correlated Histograms Clustering is a novel unsupervised learning technique that utilizes underlying statistics of a dataset across multiple dimensions to identify cluster centroids. This approach is effective for identifying cluster patterns in unlabeled or noisy data, offering insights into the data structure without requiring prior knowledge of the number of clusters. By focusing on cluster centroids, practitioners can extract meaningful characteristics from datasets that may be challenging to visualize or categorize. Different clustering methods and parameters play crucial roles in the clustering process, emphasizing the importance of selecting the appropriate techniques for specific datasets.

  • Unsupervised Learning
  • Clustering Technique
  • Dataset Statistics
  • Cluster Centroids
  • Noise Robustness

Uploaded on Feb 24, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Correlated Histograms Clustering A novel unsupervised learning technique that leverages the underlying statistics of a dataset across its different dimensions to identify cluster centroids Brice Brosig, Lone Star Analysis Co-Author: Randy Allen, Lone Star Analysis 1 @IITSEC NTSAToday

  2. Agenda Background Motivation Correlated Histograms Clustering Application Training Syllabus Outcomes Application Robustness Against Noisy Data Summary Future Work Acknowledgements 2 @IITSEC NTSAToday

  3. Background Supervised learning Dataset with known, ground truth labels. The Task is to fit a model to the data such that you fit new, unseen data well. Unsupervised learning (clustering) The practitioner has a dataset without labels. The task is typically to learn one or more of the following: The number of classes each instance falls into. Where those categories are in the domain of the dataset (borders and/or centroids). What instances fall into which categories. Semi-Supervised learning Combination of the two some of the data is labeled and unsupervised learning tasks can aid in the supervised learning. 3 @IITSEC NTSAToday

  4. Motivation Unlabeled and / or noisy data Practitioners mostly deal with data that is messy . From sensors to surveys, we must make use of data that is not easily visualized or categorized if at all. No a priori knowledge of the number of clusters We often can t assume things about the data before hand. Often, the number of clusters is one of the things we want to find out when using a clustering technique. An interest in the centroids Centroids express the actual characteristics of the different clusters. These characteristics can be more useful than just knowing which instances fall into what category. New clustering / neighborhood metric Almost all other clustering techniques use distance as the metric to build clusters. This requires consideration and normalization of the data. 4 @IITSEC NTSAToday

  5. Motivation Method name K-Means Parameters number of clusters Geometry (metric used) Distances between points Graph distance (e.g., nearest- neighbor graph) Distances between points Graph distance (e.g., nearest- neighbor graph) Distances between points Affinity propagation damping, sample preference Mean-shift bandwidth Spectral clustering number of clusters Ward hierarchical clustering number of clusters or distance threshold Agglomerative clustering number of clusters or distance threshold, linkage type, distance Any pairwise distance DBSCAN OPTICS Gaussian mixtures neighborhood size minimum cluster membership many Distances between nearest points Distances between points Mahalanobis distances to centers BIRCH branching factor, threshold, optional global clusterer. Euclidean distance between points Bisecting K-Means number of clusters Distances between points 5 @IITSEC NTSAToday

  6. Correlated Histograms Clustering x-values y-values Correlation Modes A. Create a histogram or density of each dimension A. Prefer Harrel-Davis Quantile respective Density Estimate B. Use those histograms to compute the modality and locations of each mode A. Prefer the Lowland Modality Technique to identify modes A. Pick up a mode from some dimension and find the nearest point Look at the other components of that point Find the nearest modes to the other components Repeat step A for all modes and dimensions B. C. D. C A x1 y1 x2 y2 x3 y3 B 6 @IITSEC NTSAToday

  7. Correlated Histograms Clustering x-values y-values Correlation Modes A. Create a histogram or density of each dimension A. Prefer Harrel-Davis Quantile respective Density Estimate B. Use those histograms to compute the modality and locations of each mode A. Prefer the Lowland Modality Technique to identify modes A. Pick up a mode from some dimension and find the nearest point Look at the other components of that point Find the nearest modes to the other components Repeat step A for all modes and dimensions B. C. D. A C x1 y1 x2 y2 x3 y3 B 7 @IITSEC NTSAToday

  8. Correlated Histograms Clustering x-values y-values Correlation Modes A. Create a histogram or density of each dimension A. Prefer Harrel-Davis Quantile respective Density Estimate B. Use those histograms to compute the modality and locations of each mode A. Prefer the Lowland Modality Technique to identify modes A. Pick up a mode from some dimension and find the nearest point Look at the other components of that point Find the nearest modes to the other components Repeat step A for all modes and dimensions B. C. D. C A x1 y1 x2 y2 x3 y3 B 8 @IITSEC NTSAToday

  9. Correlated Histograms Clustering x-values y-values Correlation Modes A. Create a histogram or density of each dimension A. Prefer Harrel-Davis Quantile respective Density Estimate B. Use those histograms to compute the modality and locations of each mode A. Prefer the Lowland Modality Technique to identify modes A. Pick up a mode from some dimension and find the nearest point Look at the other components of that point Find the nearest modes to the other components Repeat step A for all modes and dimensions B. C. D. x1 y1 x2 y2 x3 y3 * To find the modes we make use of Andrey Akinshin s Lowland Modality technique alongside the Harrel-Davis Quantile Respective Density Estimate. 9 @IITSEC NTSAToday

  10. Application Training Syllabus Outcomes Suppose we have implemented a pilot training syllabus and evaluated trainees on several metrics. Knowing the number of outcomes and the characteristics of those outcomes can be useful. It is likely that one has more than 3 metrics to evaluate and therefore a visualization is difficult. What is needed: Discovery of the number of clusters (how many outcomes). The centroids of each cluster (characteristics of outcomes). Ability to do so with n-dimensional data (more than 3 metrics). Correlated Histograms does all of these! 10 @IITSEC NTSAToday

  11. Application Training Syllabus Outcomes Our dataset of training metrics scattered and histogrammed 11 @IITSEC NTSAToday

  12. Application Training Syllabus Outcomes Our dataset of training metrics scattered and histogrammed 12 @IITSEC NTSAToday

  13. Application Training Syllabus Outcomes Identification of the cluster centroids gives us: The number of outcomes. A datapoint associated with each outcome: (1.654, 1.120) (4.533, 0.441) (8.324, 0.240) We walk away knowing the number of trainees our syllabus produces and ways to describe each type! 13 @IITSEC NTSAToday

  14. Application Robustness Against Noisy Data Another Scenario Centroids: (-8.638, -5.119) (1.845, 0.537) Low Sensitivity! Finds 2 centroids amongst the noise! 14 @IITSEC NTSAToday

  15. Application Robustness Against Noisy Data Another Scenario Centroids: (-8.476, -5.357) (-4.506, 0.483) (1.847, 0.483) High Sensitivity! Identifies an additional centroid! 15 @IITSEC NTSAToday

  16. Summary Correlated Histograms is an unsupervised learning technique that has applications anywhere that clustering is the task at hand. Differences being: You get centroids of clusters rather than classification of instances. These centroids are derived from the underlying statistics of the data rather than distances between points. Key take-away: Statistics is largely underutilized as a metric in classical clustering techniques! Correlated Histograms leverages statistics and can lead to great insights into messy, unfamiliar data. 16 @IITSEC NTSAToday

  17. Future Work Handling data that is oddly shaped with respect to the orthogonal vectors. Swapping out the Harrel-Davis QRDE for other density estimates, classic histograms, or adaptive histograms (also from Andrey Akinshin). Other Modality detection techniques. Checking agreement between some number of nearest points. 17 @IITSEC NTSAToday

  18. Acknowledgements Dr. Randy Allen for mentoring me and for co-authoring this paper. 18 @IITSEC NTSAToday

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#