Correlated Histograms Clustering

undefined

Correlated Histograms Clustering

A novel unsupervised learning technique that leverages the underlying statistics

of a dataset across its different dimensions to identify cluster centroids

Agenda



Background



Motivation



Correlated Histograms Clustering



Application – Training Syllabus Outcomes



Application – Robustness Against Noisy Data



Summary



Future Work



Acknowledgements

Background



Supervised learning



Dataset with known, “ground truth” labels.



The Task is to fit a model to the data such that you fit new, unseen data well.



Unsupervised learning (

clustering



The practitioner has a dataset

without

 labels.



The task is typically to learn one or more of the following:



The number of classes each instance falls into.



Where those categories are in the domain of the dataset (borders and/or centroids).



What instances fall into which categories.



Semi-Supervised learning



Combination of the two – some of the data is labeled and unsupervised learning tasks can

aid in the supervised learning.

Motivation



Unlabeled and / or noisy data



Practitioners mostly deal with data that is “messy”. From sensors to surveys, we must

make use of data that is not easily visualized or categorized – if at all.



No a priori knowledge of the number of clusters



We often can’t assume things about the data before hand. Often, the number of clusters is

one of the things we want to find out when using a clustering technique.



An interest in the centroids



Centroids express the actual characteristics of the different clusters. These characteristics

can be more useful than just knowing which instances fall into what category.



New clustering / neighborhood metric



Almost all other clustering techniques use

distance

 as the metric to build clusters. This

requires consideration and normalization of the data.

Motivation

Correlated Histograms Clustering

x-values

y-values

Correlated Histograms Clustering

x-values

y-values

Correlated Histograms Clustering

x-values

y-values

Correlated Histograms Clustering

x-values

y-values

What is needed:

•

Discovery of the number of clusters (how many outcomes).

•

The centroids of each cluster (characteristics of outcomes).

•

Ability to do so with n-dimensional data (more than 3 metrics).

Application – Training Syllabus Outcomes



Suppose we have implemented a pilot training syllabus and evaluated trainees on

several metrics.



Knowing the

number of outcomes

 and the

characteristics of those outcomes

can

be useful.



It is likely that one has more than 3 metrics to evaluate and therefore a

visualization is difficult.

Correlated Histograms

does all of these!

Application – Training Syllabus Outcomes



Our dataset of training metrics scattered and histogrammed

Application – Training Syllabus Outcomes



Our dataset of training metrics scattered and histogrammed

Application – Training Syllabus Outcomes



Identification of the cluster

centroids gives us:



The number of outcomes.



A datapoint associated with

each outcome:



(1.654, 1.120)



(4.533, 0.441)



(8.324, 0.240)

We walk away knowing the number of

trainees our syllabus produces

and

ways to describe each type!

Application – Robustness Against Noisy Data



Another Scenario



Centroids:



(-8.638, -5.119)



(1.845, 0.537)

Low Sensitivity!

Application – Robustness Against Noisy Data



Another Scenario



Centroids:



(-8.476, -5.357)



(-4.506, 0.483)



(1.847, 0.483)

High Sensitivity!

Identifies an

additional centroid!

Summary



Correlated Histograms is an unsupervised learning technique that has

applications anywhere that clustering is the task at hand.



Differences being:



You get centroids of clusters rather than classification of instances.



These centroids are derived from the underlying statistics of the data rather than distances

between points.

Key take-away: Statistics is largely underutilized as a metric in

classical clustering techniques!

Correlated Histograms leverages statistics and

can lead to great insights into m

essy, unfamiliar data.

Future Work



Handling data that is “oddly shaped” with respect to the orthogonal vectors.



Swapping out the Harrel-Davis QRDE for other density estimates, classic

histograms, or “adaptive histograms” (also from Andrey Akinshin).



Other Modality detection techniques.



Checking agreement between

      some number of nearest points.

Acknowledgements



Dr. Randy Allen for mentoring me and for co-authoring this paper.

Slide Note

Hello, today we are going to take a look at a new unsupervised learning technique that uses statistics

Embed Share

Download

Correlated Histograms Clustering is a novel unsupervised learning technique that utilizes underlying statistics of a dataset across multiple dimensions to identify cluster centroids. This approach is effective for identifying cluster patterns in unlabeled or noisy data, offering insights into the data structure without requiring prior knowledge of the number of clusters. By focusing on cluster centroids, practitioners can extract meaningful characteristics from datasets that may be challenging to visualize or categorize. Different clustering methods and parameters play crucial roles in the clustering process, emphasizing the importance of selecting the appropriate techniques for specific datasets.

evenyae Follow

Uploaded on Feb 24, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Correlated Histograms Clustering A novel unsupervised learning technique that leverages the underlying statistics of a dataset across its different dimensions to identify cluster centroids Brice Brosig, Lone Star Analysis Co-Author: Randy Allen, Lone Star Analysis 1 @IITSEC NTSAToday

Agenda Background Motivation Correlated Histograms Clustering Application Training Syllabus Outcomes Application Robustness Against Noisy Data Summary Future Work Acknowledgements 2 @IITSEC NTSAToday

Background Supervised learning Dataset with known, ground truth labels. The Task is to fit a model to the data such that you fit new, unseen data well. Unsupervised learning (clustering) The practitioner has a dataset without labels. The task is typically to learn one or more of the following: The number of classes each instance falls into. Where those categories are in the domain of the dataset (borders and/or centroids). What instances fall into which categories. Semi-Supervised learning Combination of the two some of the data is labeled and unsupervised learning tasks can aid in the supervised learning. 3 @IITSEC NTSAToday

Motivation Unlabeled and / or noisy data Practitioners mostly deal with data that is messy . From sensors to surveys, we must make use of data that is not easily visualized or categorized if at all. No a priori knowledge of the number of clusters We often can t assume things about the data before hand. Often, the number of clusters is one of the things we want to find out when using a clustering technique. An interest in the centroids Centroids express the actual characteristics of the different clusters. These characteristics can be more useful than just knowing which instances fall into what category. New clustering / neighborhood metric Almost all other clustering techniques use distance as the metric to build clusters. This requires consideration and normalization of the data. 4 @IITSEC NTSAToday

Motivation Method name K-Means Parameters number of clusters Geometry (metric used) Distances between points Graph distance (e.g., nearest- neighbor graph) Distances between points Graph distance (e.g., nearest- neighbor graph) Distances between points Affinity propagation damping, sample preference Mean-shift bandwidth Spectral clustering number of clusters Ward hierarchical clustering number of clusters or distance threshold Agglomerative clustering number of clusters or distance threshold, linkage type, distance Any pairwise distance DBSCAN OPTICS Gaussian mixtures neighborhood size minimum cluster membership many Distances between nearest points Distances between points Mahalanobis distances to centers BIRCH branching factor, threshold, optional global clusterer. Euclidean distance between points Bisecting K-Means number of clusters Distances between points 5 @IITSEC NTSAToday

Correlated Histograms Clustering x-values y-values Correlation Modes A. Create a histogram or density of each dimension A. Prefer Harrel-Davis Quantile respective Density Estimate B. Use those histograms to compute the modality and locations of each mode A. Prefer the Lowland Modality Technique to identify modes A. Pick up a mode from some dimension and find the nearest point Look at the other components of that point Find the nearest modes to the other components Repeat step A for all modes and dimensions B. C. D. C A x1 y1 x2 y2 x3 y3 B 6 @IITSEC NTSAToday

Correlated Histograms Clustering x-values y-values Correlation Modes A. Create a histogram or density of each dimension A. Prefer Harrel-Davis Quantile respective Density Estimate B. Use those histograms to compute the modality and locations of each mode A. Prefer the Lowland Modality Technique to identify modes A. Pick up a mode from some dimension and find the nearest point Look at the other components of that point Find the nearest modes to the other components Repeat step A for all modes and dimensions B. C. D. A C x1 y1 x2 y2 x3 y3 B 7 @IITSEC NTSAToday

Correlated Histograms Clustering x-values y-values Correlation Modes A. Create a histogram or density of each dimension A. Prefer Harrel-Davis Quantile respective Density Estimate B. Use those histograms to compute the modality and locations of each mode A. Prefer the Lowland Modality Technique to identify modes A. Pick up a mode from some dimension and find the nearest point Look at the other components of that point Find the nearest modes to the other components Repeat step A for all modes and dimensions B. C. D. C A x1 y1 x2 y2 x3 y3 B 8 @IITSEC NTSAToday

Correlated Histograms Clustering x-values y-values Correlation Modes A. Create a histogram or density of each dimension A. Prefer Harrel-Davis Quantile respective Density Estimate B. Use those histograms to compute the modality and locations of each mode A. Prefer the Lowland Modality Technique to identify modes A. Pick up a mode from some dimension and find the nearest point Look at the other components of that point Find the nearest modes to the other components Repeat step A for all modes and dimensions B. C. D. x1 y1 x2 y2 x3 y3 * To find the modes we make use of Andrey Akinshin s Lowland Modality technique alongside the Harrel-Davis Quantile Respective Density Estimate. 9 @IITSEC NTSAToday

Application Training Syllabus Outcomes Suppose we have implemented a pilot training syllabus and evaluated trainees on several metrics. Knowing the number of outcomes and the characteristics of those outcomes can be useful. It is likely that one has more than 3 metrics to evaluate and therefore a visualization is difficult. What is needed: Discovery of the number of clusters (how many outcomes). The centroids of each cluster (characteristics of outcomes). Ability to do so with n-dimensional data (more than 3 metrics). Correlated Histograms does all of these! 10 @IITSEC NTSAToday

Application Training Syllabus Outcomes Our dataset of training metrics scattered and histogrammed 11 @IITSEC NTSAToday

Application Training Syllabus Outcomes Our dataset of training metrics scattered and histogrammed 12 @IITSEC NTSAToday

Application Training Syllabus Outcomes Identification of the cluster centroids gives us: The number of outcomes. A datapoint associated with each outcome: (1.654, 1.120) (4.533, 0.441) (8.324, 0.240) We walk away knowing the number of trainees our syllabus produces and ways to describe each type! 13 @IITSEC NTSAToday

Application Robustness Against Noisy Data Another Scenario Centroids: (-8.638, -5.119) (1.845, 0.537) Low Sensitivity! Finds 2 centroids amongst the noise! 14 @IITSEC NTSAToday

Application Robustness Against Noisy Data Another Scenario Centroids: (-8.476, -5.357) (-4.506, 0.483) (1.847, 0.483) High Sensitivity! Identifies an additional centroid! 15 @IITSEC NTSAToday

Summary Correlated Histograms is an unsupervised learning technique that has applications anywhere that clustering is the task at hand. Differences being: You get centroids of clusters rather than classification of instances. These centroids are derived from the underlying statistics of the data rather than distances between points. Key take-away: Statistics is largely underutilized as a metric in classical clustering techniques! Correlated Histograms leverages statistics and can lead to great insights into messy, unfamiliar data. 16 @IITSEC NTSAToday

Future Work Handling data that is oddly shaped with respect to the orthogonal vectors. Swapping out the Harrel-Davis QRDE for other density estimates, classic histograms, or adaptive histograms (also from Andrey Akinshin). Other Modality detection techniques. Checking agreement between some number of nearest points. 17 @IITSEC NTSAToday