Efficient Anomaly Detection for Batch Systems Using Machine Learning
Explore a lightning talk session focusing on using Collectd metrics and job data in HTCondor batch systems for anomaly detection. Challenges with raw historical data are addressed through data collection, manipulation, and application of anomaly detection techniques using ML. Various algorithms such as Isolation Forest, OneClass SVM, K-means, and PCA are employed to predict anomalies in hostgroups effectively.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Batch Anomaly Batch Anomaly Detection Detection Lightning Talks Session 2 Eya ABID Supervisors: Martin ADAM & Jaroslava Schovancova 15 / 09 / 2022 1
Topic Topic Problem HTCondor batch system monitoring data is too big to be monitored to catch anomalies the traditional way. Solution Collectd metrics + HTC job data to spot the options for anomaly detection Eya ABID 2 2
Overview Overview Progress MONIT s raw historical data was a challenge to deal with Data is collectable, clean, consumable Some anomaly detection techniques are applied on the metrics data Eya ABID 3 3
Steps Steps Data collection and manipulation Pyspark within the SWAN services Data mining ML techniques are used to get a better view and understanding of the Data, Eya ABID 4 4
Job Data Job Data Procedure to create time-series from JobEvent data failed Eya ABID 5 5
HW Data HW Data Network, CPU, Memory, and various metrics usage across the different collectd plugins for each hostgroup We tried using the ADMON Python API, but it was restricting our data manipulation process, we got back to manual processing. Eya ABID 6 6
Data Samples Data Samples Eya ABID 7 7
Algorithms and results Algorithms and results IsolationForest predicted 17 anomalies in one day for a single hostgroup OneClass SVM more than 30 anomalies in one day for a single hostgroup Eya ABID 8 8
Results Results Eya ABID 9 9
Algorithms and results Algorithms and results K-means + PCA predicted 1 anomaly in one day for a single hostgroup Eya ABID 10 10
Algorithms and results Algorithms and results Autoencoders - Tensorflow Since we don t have a solid labeling to our data, we can not establish a good anomaly detection judgment. One more problem was the uniformity of the data - overfitting But the model was quite able to have a good encoding-decoding accuracy. Eya ABID 11 11
Eya ABID 12 12
Next steps Next steps Get a solid ground truth for the job data Merge the datasets for better anomaly results Finetune the algorithms Eya ABID 13 13
QUESTIONS? QUESTIONS? eyaabid@insat.u-carthage.tn LinkedIn: https://www.linkedin.com/in/eya-abid/ Eya ABID 14 14