Correlated Histograms Clustering
Correlated Histograms Clustering is a novel unsupervised learning technique that utilizes underlying statistics of a dataset across multiple dimensions to identify cluster centroids. This approach is effective for identifying cluster patterns in unlabeled or noisy data, offering insights into the data structure without requiring prior knowledge of the number of clusters. By focusing on cluster centroids, practitioners can extract meaningful characteristics from datasets that may be challenging to visualize or categorize. Different clustering methods and parameters play crucial roles in the clustering process, emphasizing the importance of selecting the appropriate techniques for specific datasets.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Correlated Histograms Clustering A novel unsupervised learning technique that leverages the underlying statistics of a dataset across its different dimensions to identify cluster centroids Brice Brosig, Lone Star Analysis Co-Author: Randy Allen, Lone Star Analysis 1 @IITSEC NTSAToday
Agenda Background Motivation Correlated Histograms Clustering Application Training Syllabus Outcomes Application Robustness Against Noisy Data Summary Future Work Acknowledgements 2 @IITSEC NTSAToday
Background Supervised learning Dataset with known, ground truth labels. The Task is to fit a model to the data such that you fit new, unseen data well. Unsupervised learning (clustering) The practitioner has a dataset without labels. The task is typically to learn one or more of the following: The number of classes each instance falls into. Where those categories are in the domain of the dataset (borders and/or centroids). What instances fall into which categories. Semi-Supervised learning Combination of the two some of the data is labeled and unsupervised learning tasks can aid in the supervised learning. 3 @IITSEC NTSAToday
Motivation Unlabeled and / or noisy data Practitioners mostly deal with data that is messy . From sensors to surveys, we must make use of data that is not easily visualized or categorized if at all. No a priori knowledge of the number of clusters We often can t assume things about the data before hand. Often, the number of clusters is one of the things we want to find out when using a clustering technique. An interest in the centroids Centroids express the actual characteristics of the different clusters. These characteristics can be more useful than just knowing which instances fall into what category. New clustering / neighborhood metric Almost all other clustering techniques use distance as the metric to build clusters. This requires consideration and normalization of the data. 4 @IITSEC NTSAToday
Motivation Method name K-Means Parameters number of clusters Geometry (metric used) Distances between points Graph distance (e.g., nearest- neighbor graph) Distances between points Graph distance (e.g., nearest- neighbor graph) Distances between points Affinity propagation damping, sample preference Mean-shift bandwidth Spectral clustering number of clusters Ward hierarchical clustering number of clusters or distance threshold Agglomerative clustering number of clusters or distance threshold, linkage type, distance Any pairwise distance DBSCAN OPTICS Gaussian mixtures neighborhood size minimum cluster membership many Distances between nearest points Distances between points Mahalanobis distances to centers BIRCH branching factor, threshold, optional global clusterer. Euclidean distance between points Bisecting K-Means number of clusters Distances between points 5 @IITSEC NTSAToday
Correlated Histograms Clustering x-values y-values Correlation Modes A. Create a histogram or density of each dimension A. Prefer Harrel-Davis Quantile respective Density Estimate B. Use those histograms to compute the modality and locations of each mode A. Prefer the Lowland Modality Technique to identify modes A. Pick up a mode from some dimension and find the nearest point Look at the other components of that point Find the nearest modes to the other components Repeat step A for all modes and dimensions B. C. D. C A x1 y1 x2 y2 x3 y3 B 6 @IITSEC NTSAToday
Correlated Histograms Clustering x-values y-values Correlation Modes A. Create a histogram or density of each dimension A. Prefer Harrel-Davis Quantile respective Density Estimate B. Use those histograms to compute the modality and locations of each mode A. Prefer the Lowland Modality Technique to identify modes A. Pick up a mode from some dimension and find the nearest point Look at the other components of that point Find the nearest modes to the other components Repeat step A for all modes and dimensions B. C. D. A C x1 y1 x2 y2 x3 y3 B 7 @IITSEC NTSAToday
Correlated Histograms Clustering x-values y-values Correlation Modes A. Create a histogram or density of each dimension A. Prefer Harrel-Davis Quantile respective Density Estimate B. Use those histograms to compute the modality and locations of each mode A. Prefer the Lowland Modality Technique to identify modes A. Pick up a mode from some dimension and find the nearest point Look at the other components of that point Find the nearest modes to the other components Repeat step A for all modes and dimensions B. C. D. C A x1 y1 x2 y2 x3 y3 B 8 @IITSEC NTSAToday
Correlated Histograms Clustering x-values y-values Correlation Modes A. Create a histogram or density of each dimension A. Prefer Harrel-Davis Quantile respective Density Estimate B. Use those histograms to compute the modality and locations of each mode A. Prefer the Lowland Modality Technique to identify modes A. Pick up a mode from some dimension and find the nearest point Look at the other components of that point Find the nearest modes to the other components Repeat step A for all modes and dimensions B. C. D. x1 y1 x2 y2 x3 y3 * To find the modes we make use of Andrey Akinshin s Lowland Modality technique alongside the Harrel-Davis Quantile Respective Density Estimate. 9 @IITSEC NTSAToday
Application Training Syllabus Outcomes Suppose we have implemented a pilot training syllabus and evaluated trainees on several metrics. Knowing the number of outcomes and the characteristics of those outcomes can be useful. It is likely that one has more than 3 metrics to evaluate and therefore a visualization is difficult. What is needed: Discovery of the number of clusters (how many outcomes). The centroids of each cluster (characteristics of outcomes). Ability to do so with n-dimensional data (more than 3 metrics). Correlated Histograms does all of these! 10 @IITSEC NTSAToday
Application Training Syllabus Outcomes Our dataset of training metrics scattered and histogrammed 11 @IITSEC NTSAToday
Application Training Syllabus Outcomes Our dataset of training metrics scattered and histogrammed 12 @IITSEC NTSAToday
Application Training Syllabus Outcomes Identification of the cluster centroids gives us: The number of outcomes. A datapoint associated with each outcome: (1.654, 1.120) (4.533, 0.441) (8.324, 0.240) We walk away knowing the number of trainees our syllabus produces and ways to describe each type! 13 @IITSEC NTSAToday
Application Robustness Against Noisy Data Another Scenario Centroids: (-8.638, -5.119) (1.845, 0.537) Low Sensitivity! Finds 2 centroids amongst the noise! 14 @IITSEC NTSAToday
Application Robustness Against Noisy Data Another Scenario Centroids: (-8.476, -5.357) (-4.506, 0.483) (1.847, 0.483) High Sensitivity! Identifies an additional centroid! 15 @IITSEC NTSAToday
Summary Correlated Histograms is an unsupervised learning technique that has applications anywhere that clustering is the task at hand. Differences being: You get centroids of clusters rather than classification of instances. These centroids are derived from the underlying statistics of the data rather than distances between points. Key take-away: Statistics is largely underutilized as a metric in classical clustering techniques! Correlated Histograms leverages statistics and can lead to great insights into messy, unfamiliar data. 16 @IITSEC NTSAToday
Future Work Handling data that is oddly shaped with respect to the orthogonal vectors. Swapping out the Harrel-Davis QRDE for other density estimates, classic histograms, or adaptive histograms (also from Andrey Akinshin). Other Modality detection techniques. Checking agreement between some number of nearest points. 17 @IITSEC NTSAToday
Acknowledgements Dr. Randy Allen for mentoring me and for co-authoring this paper. 18 @IITSEC NTSAToday