Efficient Anomaly Detection for Batch Systems Using Machine Learning

 
B
a
t
c
h
 
A
n
o
m
a
l
y
D
e
t
e
c
t
i
o
n
 
Lightning Talks Session 2
 
15 / 09 / 2022
 
Eya ABID
 
Supervisors: Martin ADAM & Jaroslava Schovancova
 
T
o
p
i
c
 
HTCondor batch system monitoring data is too “big” to be monitored to catch
anomalies the traditional way.
 
Problem
 
Eya ABID
 
Solution
 
Collectd metrics + HTC job data to spot the options for anomaly detection
 
2
 
Progress
 
3
 
Eya ABID
 
O
v
e
r
v
i
e
w
 
MONIT’s raw historical data was a challenge to deal with
Data is collectable, clean, consumable
Some anomaly detection techniques are applied on the metrics data
 
D
a
t
a
 
c
o
l
l
e
c
t
i
o
n
 
a
n
d
 
m
a
n
i
p
u
l
a
t
i
o
n
 
4
 
Eya ABID
 
S
t
e
p
s
 
Pyspark
 within the 
SWAN
 services
 
D
a
t
a
 
m
i
n
i
n
g
 
ML techniques are used to get a better view and understanding
of the Data,
 
Procedure to create time-series from JobEvent data failed
 
5
 
Eya ABID
 
J
o
b
 
D
a
t
a
 
6
 
Eya ABID
 
H
W
 
D
a
t
a
 
Network, CPU, Memory, and various metrics usage across the different collectd
plugins for each hostgroup
 
We tried using the ADMON Python API, but it was restricting our data
manipulation process, we got back to manual processing.
 
7
 
Eya ABID
 
D
a
t
a
 
S
a
m
p
l
e
s
 
I
s
o
l
a
t
i
o
n
F
o
r
e
s
t
 
8
 
Eya ABID
 
A
l
g
o
r
i
t
h
m
s
 
a
n
d
 
r
e
s
u
l
t
s
 
predicted 17 anomalies in one day for a single hostgroup
 
O
n
e
C
l
a
s
s
 
S
V
M
 
more than 30 anomalies in one day for a single hostgroup
 
9
 
Eya ABID
 
R
e
s
u
l
t
s
 
K
-
m
e
a
n
s
 
+
 
P
C
A
 
10
 
Eya ABID
 
A
l
g
o
r
i
t
h
m
s
 
a
n
d
 
r
e
s
u
l
t
s
 
predicted 1 anomaly in one day for a single hostgroup
 
A
u
t
o
e
n
c
o
d
e
r
s
 
-
 
T
e
n
s
o
r
f
l
o
w
 
11
 
Eya ABID
 
A
l
g
o
r
i
t
h
m
s
 
a
n
d
 
r
e
s
u
l
t
s
 
Since we don’t have a solid ‘labeling’ to our data, we can not establish a good
anomaly detection judgment.
One more problem was the uniformity of the data - overfitting
But the model was quite able to have a good encoding-decoding accuracy.
 
12
 
Eya ABID
 
13
 
Eya ABID
 
N
e
x
t
 
s
t
e
p
s
 
Get a solid ground truth for the job data
Merge the datasets for better anomaly results
Finetune the algorithms
 
Q
U
E
S
T
I
O
N
S
?
 
e
yaabid@insat.u-carthage.tn
 
L
i
n
k
e
d
I
n
:
 
h
t
t
p
s
:
/
/
w
w
w
.
l
i
n
k
e
d
i
n
.
c
o
m
/
i
n
/
e
y
a
-
a
b
i
d
/
 
Eya ABID
 
14
Slide Note
Embed
Share

Explore a lightning talk session focusing on using Collectd metrics and job data in HTCondor batch systems for anomaly detection. Challenges with raw historical data are addressed through data collection, manipulation, and application of anomaly detection techniques using ML. Various algorithms such as Isolation Forest, OneClass SVM, K-means, and PCA are employed to predict anomalies in hostgroups effectively.

  • Anomaly Detection
  • Batch Systems
  • Machine Learning
  • HTCondor
  • Data Mining

Uploaded on Aug 12, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Batch Anomaly Batch Anomaly Detection Detection Lightning Talks Session 2 Eya ABID Supervisors: Martin ADAM & Jaroslava Schovancova 15 / 09 / 2022 1

  2. Topic Topic Problem HTCondor batch system monitoring data is too big to be monitored to catch anomalies the traditional way. Solution Collectd metrics + HTC job data to spot the options for anomaly detection Eya ABID 2 2

  3. Overview Overview Progress MONIT s raw historical data was a challenge to deal with Data is collectable, clean, consumable Some anomaly detection techniques are applied on the metrics data Eya ABID 3 3

  4. Steps Steps Data collection and manipulation Pyspark within the SWAN services Data mining ML techniques are used to get a better view and understanding of the Data, Eya ABID 4 4

  5. Job Data Job Data Procedure to create time-series from JobEvent data failed Eya ABID 5 5

  6. HW Data HW Data Network, CPU, Memory, and various metrics usage across the different collectd plugins for each hostgroup We tried using the ADMON Python API, but it was restricting our data manipulation process, we got back to manual processing. Eya ABID 6 6

  7. Data Samples Data Samples Eya ABID 7 7

  8. Algorithms and results Algorithms and results IsolationForest predicted 17 anomalies in one day for a single hostgroup OneClass SVM more than 30 anomalies in one day for a single hostgroup Eya ABID 8 8

  9. Results Results Eya ABID 9 9

  10. Algorithms and results Algorithms and results K-means + PCA predicted 1 anomaly in one day for a single hostgroup Eya ABID 10 10

  11. Algorithms and results Algorithms and results Autoencoders - Tensorflow Since we don t have a solid labeling to our data, we can not establish a good anomaly detection judgment. One more problem was the uniformity of the data - overfitting But the model was quite able to have a good encoding-decoding accuracy. Eya ABID 11 11

  12. Eya ABID 12 12

  13. Next steps Next steps Get a solid ground truth for the job data Merge the datasets for better anomaly results Finetune the algorithms Eya ABID 13 13

  14. QUESTIONS? QUESTIONS? eyaabid@insat.u-carthage.tn LinkedIn: https://www.linkedin.com/in/eya-abid/ Eya ABID 14 14

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#