Hardware Monitoring Evolution at CERN: Lemon vs. Collectd

H
a
r
d
w
a
r
e
 
m
o
n
i
t
o
r
i
n
g
 
w
i
t
h
 
c
o
l
l
e
c
t
d
Luca Gardi - luca.gardi@cern.ch
Introduction
explain the differences between Lemon and collectd
summarize needed changes for hardware monitoring
explain the choices made during the process
provide a status update
explain current issues and proposed fixes
1 - 
Lemon and collectd
L
e
m
o
n
d
e
v
e
l
o
p
e
d
 
b
y
 
C
E
R
N
in production since 20
06 (at least)
old monitoring infrastructure has been replaced
retirement efforts started mid-2017
c
o
l
l
e
c
t
d
o
p
e
n
 
s
o
u
r
c
e
 
p
r
o
j
e
c
t
collects system and service metrics
optimized to handle thousands of metrics
modular and portable with community plugins
e
a
s
y
 
t
o
 
d
e
v
e
l
o
p
 
n
e
w
 
p
l
u
g
i
n
s
 
i
n
 
P
y
t
h
o
n
/
J
a
v
a
/
C
/
P
e
r
l
continuously improving and well documented
m
2 - Why collectd?
 
P
r
o
s
:
community-driven and rich ecosystem
a
l
a
r
m
s
 
a
n
d
 
p
l
u
g
i
n
s
 
d
e
fi
n
i
ti
o
n
s
 
a
r
e
 
p
u
p
p
e
t
-
b
a
s
e
d
better reusability, documentation
easier to set up for quick metric collection
easier metric dispatch in plugins
C
o
n
s
:
alarms generated on transition
e
x
i
s
ti
n
g
 
p
l
u
g
i
n
s
 
r
e
q
u
i
r
e
 
r
e
-
w
r
i
ti
n
g
MONIT provides a lemon-sensor wrapper but is deprecated
3 - 
HW monitoring in the Lemon era
 
A
g
e
n
t
 
s
e
n
s
o
r
s
:
l
e
m
o
n
-
s
e
n
s
o
r
-
s
m
a
r
t
:
 
S
M
A
R
T
 
l
o
g
s
 
m
o
n
i
t
o
r
i
n
g
l
e
m
o
n
-
s
e
n
s
o
r
-
t
w
:
 
3
w
a
r
e
 
R
A
I
D
 
c
o
n
t
r
o
l
l
e
r
s
l
e
m
o
n
-
s
e
n
s
o
r
-
m
e
g
a
r
a
i
d
s
a
s
:
 
L
S
I
 
M
e
g
a
R
A
I
D
 
c
o
n
t
r
o
l
l
e
r
s
l
e
m
o
n
-
s
e
n
s
o
r
-
a
d
a
p
t
e
c
:
 
A
d
a
p
t
e
c
 
R
A
I
D
 
c
o
n
t
r
o
l
l
e
r
s
l
e
m
o
n
-
s
e
n
s
o
r
-
s
a
s
a
r
r
a
y
:
 
J
B
O
D
s
 
m
o
n
i
t
o
r
i
n
g
l
e
m
o
n
-
s
e
n
s
o
r
-
b
l
o
c
k
d
e
v
i
c
e
-
d
r
i
v
e
s
:
 
l
o
g
 
p
a
r
s
e
r
 
f
o
r
 
S
C
S
I
 
e
r
r
o
r
s
l
e
m
o
n
-
s
e
n
s
o
r
-
i
p
m
i
:
 
I
P
M
I
 
m
o
n
i
t
o
r
i
n
g
O
n
-
b
e
h
a
l
f
 
m
o
n
i
t
o
r
i
n
g
 
(
c
e
n
t
r
a
l
i
z
e
d
)
:
p
d
u
-
x
m
a
s
:
 
c
e
n
t
r
a
l
i
z
e
d
 
o
u
t
-
o
f
-
b
a
n
d
 
P
D
U
 
m
o
n
i
t
o
r
i
n
g
 
(
S
N
M
P
v
2
)
4 - 
Moving towards a collectd era
 
v
e
r
y
 
s
p
e
c
i
f
i
c
 
a
n
d
 
c
o
m
p
l
e
x
 
n
e
e
d
s
heterogeneity of hardware and configurations
hardware RAID controllers
intense use of IPMI
n
o
 
c
o
m
m
u
n
i
t
y
 
s
e
n
s
o
r
s
 
w
e
 
c
o
u
l
d
 
a
d
o
p
t
g
o
o
d
 
n
e
w
s
!
 
c
o
d
e
 
c
a
n
 
b
e
 
p
o
r
t
e
d
 
f
r
o
m
 
l
e
m
o
n
 
s
e
n
s
o
r
s
adopt TDD (Test-Driven Development)
compatibility with python 2.4, 2.7, 3.4
Continuous 
Development
 (CI/
CD) 
using GitLab
5 - Plugin architecture
6 - 
The big migration
 
C
o
l
l
e
c
t
d
 
p
l
u
g
i
n
s
:
c
o
l
l
e
c
t
d
-
m
d
s
t
a
t
:
 
i
n
 
p
r
o
d
u
c
t
i
o
n
 
(
n
e
w
)
 
c
o
l
l
e
c
t
d
-
s
m
a
r
t
-
t
e
s
t
s
:
 
i
n
 
p
r
o
d
u
c
t
i
o
n
 
c
o
l
l
e
c
t
d
-
m
e
g
a
r
a
i
d
s
a
s
:
 
i
n
 
Q
A
 
c
o
l
l
e
c
t
d
-
s
a
s
a
r
r
a
y
:
 
i
n
 
d
e
v
e
l
o
p
m
e
n
t
c
o
l
l
e
c
t
d
-
b
l
o
c
k
d
e
v
i
c
e
s
:
 
i
n
 
p
i
p
e
l
i
n
e
c
o
l
l
e
c
t
d
-
a
d
a
p
t
e
c
:
 
i
n
 
p
i
p
e
l
i
n
e
m
c
e
l
o
g
:
 
f
r
o
m
 
t
h
e
 
c
o
m
m
u
n
i
t
y
C
e
n
t
r
a
l
i
z
e
d
 
m
o
n
i
t
o
r
i
n
g
:
C
I
N
N
A
M
O
N
:
 
i
n
 
p
r
o
d
u
c
t
i
o
n
 
 
(
r
e
q
u
i
r
e
s
 
m
i
n
o
r
 
c
h
a
n
g
e
s
)
P
O
D
I
U
M
:
 
i
n
 
d
e
v
e
l
o
p
m
e
n
t
7 - 
Plugin development workflow
 
identify output metrics and 
write the tests
write the plugin
i
f
 
t
e
s
t
s
.
c
o
l
o
r
 
=
=
 
g
r
e
e
n
:
 
p
l
u
g
i
n
.
p
u
p
p
e
t
_
d
e
p
l
o
y
(
)
8 - 
Plugin deployment workflow
 
 
RPM packaging and repositories using Koji
C
o
l
l
e
c
t
d
 
p
l
u
g
i
n
 
d
e
f
i
n
i
t
i
o
n
 
o
n
 
P
u
p
p
e
t
it-puppet-module-cerncollectd_contrib on GitLab
standard CERN CRM QA -> Production pipeline (1 week) 
d
e
p
l
o
y
e
d
 
o
n
 
p
h
y
s
i
c
a
l
 
m
a
c
h
i
n
e
s
it-puppet-module-hardware: physical.pp
9 - 
Alarms
 
b
a
s
e
d
 
o
n
 
s
t
a
n
d
a
r
d
 
c
o
l
l
e
c
t
d
 
T
h
r
e
s
h
o
l
d
 
p
l
u
g
i
n
checks local metrics against defined thresholds
states: 
OK
, 
WARNING
, 
FAILURE
p
u
p
p
e
t
 
d
e
f
i
n
e
d
 
(
m
e
t
r
i
c
m
g
r
 
i
s
 
a
l
r
e
a
d
y
 
r
e
a
d
-
o
n
l
y
)
Service Managers can override thresholds and SNOW targets
finish porting of the sensors to collectd
start retirement of old lemon sensors
too many tickets:
fine tu
ning of the alarms is necessary
waiting for better SNOW tickets deduplication
tickets are not very descriptive:
a 
pull request
 has been sent to the upstream community
no lemon-host-check
do we need it?
is collectdctl enough?
 
10 - What’s left and current issues
c
o
l
l
e
c
t
d
 
p
r
o
v
i
d
e
s
 
a
 
m
a
t
u
r
e
 
e
n
v
i
r
o
n
m
e
n
t
 
f
o
r
 
H
W
 
m
o
n
i
t
o
r
i
n
g
u
s
i
n
g
 
p
u
p
p
e
t
 
f
o
r
 
a
l
a
r
m
s
 
d
e
f
i
n
i
t
i
o
n
 
i
s
 
d
e
f
i
n
i
t
e
l
y
 
a
 
p
l
u
s
 
f
o
r
v
e
r
s
i
o
n
i
n
g
 
a
n
d
 
m
a
i
n
t
e
n
a
n
c
e
,
 
c
o
m
p
a
r
e
d
 
t
o
 
m
e
t
r
i
c
m
g
r
a
f
t
e
r
 
a
n
 
i
n
i
t
i
a
l
 
s
e
r
i
e
s
 
o
f
 
d
e
l
a
y
s
,
 
m
a
i
n
l
y
 
d
u
e
 
t
o
 
o
u
r
 
e
a
r
l
y
 
a
d
o
p
t
i
o
n
,
w
e
 
a
r
e
 
n
o
w
 
m
o
r
e
 
t
h
a
n
 
h
a
l
f
-
w
a
y
 
t
h
e
r
e
 
a
n
d
 
p
r
o
g
r
e
s
s
i
n
g
 
s
t
e
a
d
i
l
y
targeting end of the year for finishing the migration
a
 
g
o
o
d
 
o
c
c
a
s
i
o
n
 
f
o
r
 
c
o
l
l
a
b
o
r
a
t
i
o
n
 
w
i
t
h
 
I
T
-
C
M
-
M
M
 
11 - Conclusions
 
Hardware monitoring with collectd
Backup slides - Plugin definition
class
 cerncollectd_contrib:
:plugin
:
:mdstat
 (  
  Integer       
$interval
,  
  String        
$mdstat_path
,  
) {  
  
  require :
:cerncollectd_contrib
  
  
  package { 
'collectd-mdstat'
:  
    
ensure
 => present,  
  }  
  
  collectd:
:plugin
:
:python
:
:module
 { 
'collectd_mdstat'
:  
    
ensure
  => present,  
    config  => [{  
      
'INTERVAL'
    => 
$interval
,  
      
'MDSTAT_PATH'
 => 
$mdstat_path
,  
    }],  
    require => Package[
'collectd-mdstat'
],  
  }  
}  
Backup slides - Alarm definition
class
 cerncollectd_contrib:
:alarm
:
:mdstat_wrong
 (  
  Integer 
$failure_max
,  
  Integer 
$hits
,  
  Boolean 
$persist
,  
  Boolean 
$interesting
,  
  Optional[Hash] 
$custom_targets
,  
  Optional[String] 
$actuator
,  
) {  
  :
:cerncollectd
:
:alarms
:
:threshold
:
:plugin
 {
'mdstat_wrong'
:  
    plugin      => 
'mdstat'
,  
    type        => 
'disk_error'
,  
    failure_max => 
$failure_max
,  
    hits        => 
$hits
,  
    persist     => 
$persist
,  
    interesting => 
$interesting
,  
  }  
  
  :
:cerncollectd
:
:alarms
:
:extra
 {
'mdstat_wrong'
:  
    ctd_namespace => 
'mdstat'
,  
    targets       => 
$custom_targets
,  
    actuator      => 
$actuator
,  
  }  
}
Backup slides - Plugin deployment
if
 (versioncmp($:
:operatingsystemmajrelease
,
'6'
) >= 0) 
or
 (versioncmp($:
:operatingsystemmajrelease
,
'7'
) >= 0){  
  
  
# Software RAID failures (see target in YAML file data)
  
  include :
:cerncollectd_contrib
:
:alarm
:
:mdstat_wrong
  
  include :
:cerncollectd_contrib
:
:plugin
:
:mdstat
  
  
  
# SMART attributes failures
  
  include :
:cerncollectd_contrib
:
:alarm
:
:smart_wrong
  
  include :
:cerncollectd_contrib
:
:plugin
:
:smart_tests
  
  
  
# MegaRAID failures
  
  include :
:cerncollectd_contrib
:
:alarm
:
:megaraidsas
:
:bbu_status_wrong
  
  include :
:cerncollectd_contrib
:
:alarm
:
:megaraidsas
:
:controller_status_wrong
  
  include :
:cerncollectd_contrib
:
:alarm
:
:megaraidsas
:
:controller_correctable_errors
  
  include :
:cerncollectd_contrib
:
:alarm
:
:megaraidsas
:
:controller_uncorrectable_errors
  
  include :
:cerncollectd_contrib
:
:alarm
:
:megaraidsas
:
:cache_policy_on_faulty_bbu_wrong
  
  include :
:cerncollectd_contrib
:
:alarm
:
:megaraidsas
:
:cache_policy_on_raid_array_wrong
  
  include :
:cerncollectd_contrib
:
:alarm
:
:megaraidsas
:
:raid_array_status_wrong
  
  include :
:cerncollectd_contrib
:
:alarm
:
:megaraidsas
:
:missing_drives
  
  include :
:cerncollectd_contrib
:
:alarm
:
:megaraidsas
:
:unconfigured_good_drives
  
  include :
:cerncollectd_contrib
:
:alarm
:
:megaraidsas
:
:unconfigured_bad_drives
  
  include :
:cerncollectd_contrib
:
:alarm
:
:megaraidsas
:
:offline_drives
  
  
  
if
 (versioncmp($:
:operatingsystemmajrelease
,
'6'
) >= 0){  
    
class
 {
'::cerncollectd_contrib::plugin::megaraidsas'
 :  
      lsmod_path => 
'/sbin/lsmod'
,  
    }  
  } 
else
 {  
    include :
:cerncollectd_contrib
:
:plugin
:
:megaraidsas
  
  }    
}  
Backup slides - collectdctl output
c
o
l
l
e
c
t
d
 
n
a
m
e
s
p
a
c
e
:
<
h
o
s
t
n
a
m
e
>
/
<
p
l
u
g
i
n
>
-
<
p
l
u
g
i
n
_
i
n
s
t
a
n
c
e
>
/
<
t
y
p
e
>
-
<
t
y
p
e
_
i
n
s
t
a
n
c
e
>
l
i
s
t
i
n
g
 
v
a
l
u
e
s
:
[root@lxfsrd08c04 ~]# collectdctl listval
l
x
f
s
r
d
0
8
c
0
4
.
c
e
r
n
.
c
h
/
m
e
g
a
r
a
i
d
s
a
s
-
b
b
u
_
s
t
a
t
u
s
/
c
o
u
n
t
-
c
0
l
x
f
s
r
d
0
8
c
0
4
.
c
e
r
n
.
c
h
/
m
e
g
a
r
a
i
d
s
a
s
-
c
o
n
t
r
o
l
l
e
r
_
c
a
c
h
e
_
p
o
l
i
c
y
_
o
n
_
f
a
u
l
t
y
_
b
b
u
/
c
o
u
n
t
-
c
0
l
x
f
s
r
d
0
8
c
0
4
.
c
e
r
n
.
c
h
/
m
e
g
a
r
a
i
d
s
a
s
-
c
o
n
t
r
o
l
l
e
r
_
c
a
c
h
e
_
p
o
l
i
c
y
_
w
r
o
n
g
_
o
n
_
r
a
i
d
_
a
r
r
a
y
/
c
o
u
n
t
-
c
0
l
x
f
s
r
d
0
8
c
0
4
.
c
e
r
n
.
c
h
/
m
e
g
a
r
a
i
d
s
a
s
-
c
o
n
t
r
o
l
l
e
r
_
m
e
m
o
r
y
_
c
o
r
r
e
c
t
a
b
l
e
_
e
r
r
o
r
s
/
c
o
u
n
t
-
c
0
l
x
f
s
r
d
0
8
c
0
4
.
c
e
r
n
.
c
h
/
m
e
g
a
r
a
i
d
s
a
s
-
c
o
n
t
r
o
l
l
e
r
_
m
e
m
o
r
y
_
u
n
c
o
r
r
e
c
t
a
b
l
e
_
e
r
r
o
r
s
/
c
o
u
n
t
-
c
0
l
x
f
s
r
d
0
8
c
0
4
.
c
e
r
n
.
c
h
/
m
e
g
a
r
a
i
d
s
a
s
-
c
o
n
t
r
o
l
l
e
r
_
s
t
a
t
u
s
/
c
o
u
n
t
-
c
0
l
x
f
s
r
d
0
8
c
0
4
.
c
e
r
n
.
c
h
/
m
e
g
a
r
a
i
d
s
a
s
-
m
i
s
s
i
n
g
_
d
r
i
v
e
s
/
c
o
u
n
t
l
x
f
s
r
d
0
8
c
0
4
.
c
e
r
n
.
c
h
/
m
e
g
a
r
a
i
d
s
a
s
-
o
f
f
l
i
n
e
_
d
r
i
v
e
s
/
c
o
u
n
t
-
c
0
l
x
f
s
r
d
0
8
c
0
4
.
c
e
r
n
.
c
h
/
m
e
g
a
r
a
i
d
s
a
s
-
r
a
i
d
_
a
r
r
a
y
_
s
t
a
t
u
s
/
c
o
u
n
t
-
c
0
_
v
d
0
l
x
f
s
r
d
0
8
c
0
4
.
c
e
r
n
.
c
h
/
m
e
g
a
r
a
i
d
s
a
s
-
r
a
i
d
_
a
r
r
a
y
_
s
t
a
t
u
s
/
c
o
u
n
t
-
c
0
_
v
d
1
l
x
f
s
r
d
0
8
c
0
4
.
c
e
r
n
.
c
h
/
m
e
g
a
r
a
i
d
s
a
s
-
u
n
c
o
n
f
i
g
u
r
e
d
_
b
a
d
_
d
r
i
v
e
s
/
c
o
u
n
t
-
c
0
l
x
f
s
r
d
0
8
c
0
4
.
c
e
r
n
.
c
h
/
m
e
g
a
r
a
i
d
s
a
s
-
u
n
c
o
n
f
i
g
u
r
e
d
_
g
o
o
d
_
d
r
i
v
e
s
/
c
o
u
n
t
-
c
0
g
e
t
t
i
n
g
 
v
a
l
u
e
s
:
[root@lxfsrd08c04 ~]# collectdctl getval lxfsrd08c04.cern.ch/megaraidsas-bbu_status/count-c0
           value=0.000000e+00
Backup slides - GitLab CI/CD
Slide Note
Embed
Share

Comparison between Lemon and Collectd for hardware monitoring at CERN, detailing the differences, necessary changes, choices made, status update, current issues, and proposed fixes in transitioning from Lemon to Collectd. Collectd's advantages, drawbacks, and the adaptation process are discussed, highlighting the complex hardware needs, test-driven development approach, and continuous integration/development using GitLab.

  • Hardware Monitoring
  • CERN
  • Lemon
  • Collectd
  • Evolution

Uploaded on Sep 08, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. CF Computing Facilities Hardware monitoring with collectd Luca Gardi - luca.gardi@cern.ch CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it

  2. CF Introduction explain the differences between Lemon and collectd summarize needed changes for hardware monitoring explain the choices made during the process provide a status update explain current issues and proposed fixes CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it

  3. CF 1 - Lemon and collectd Lemon developed by CERN in production since 2006 (at least) old monitoring infrastructure has been replaced retirement efforts started mid-2017 m collectd open source project collects system and service metrics optimized to handle thousands of metrics modular and portable with community plugins easy to develop new plugins in Python/Java/C/Perl continuously improving and well documented CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it

  4. CF 2 - Why collectd? Pros: community-driven and rich ecosystem alarms and plugins definitions are puppet-based better reusability, documentation easier to set up for quick metric collection easier metric dispatch in plugins Cons: alarms generated on transition existing plugins require re-writing MONIT provides a lemon-sensor wrapper but is deprecated CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it

  5. CF 3 - HW monitoring in the Lemon era Agent sensors: lemon-sensor-smart: SMART logs monitoring lemon-sensor-tw: 3ware RAID controllers lemon-sensor-megaraidsas: LSI MegaRAID controllers lemon-sensor-adaptec: Adaptec RAID controllers lemon-sensor-sasarray: JBODs monitoring lemon-sensor-blockdevice-drives: log parser for SCSI errors lemon-sensor-ipmi: IPMI monitoring On-behalf monitoring (centralized): pdu-xmas: centralized out-of-band PDU monitoring (SNMPv2) CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it

  6. CF 4 - Moving towards a collectd era very specific and complex needs heterogeneity of hardware and configurations hardware RAID controllers intense use of IPMI no community sensors we could adopt good news! code can be ported from lemon sensors adopt TDD (Test-Driven Development) compatibility with python 2.4, 2.7, 3.4 Continuous Development (CI/CD) using GitLab CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it

  7. CF 5 - Plugin architecture CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it

  8. CF 6 - The big migration Collectd plugins: collectd-mdstat: in production (new) collectd-smart-tests: in production collectd-megaraidsas: in QA collectd-sasarray: in development collectd-blockdevices: in pipeline collectd-adaptec: in pipeline mcelog: from the community Centralized monitoring: CINNAMON: in production PODIUM: in development (requires minor changes) CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it

  9. CF 7 - Plugin development workflow identify output metrics and write the tests write the plugin if tests.color == green: plugin.puppet_deploy() CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it

  10. CF 8 - Plugin deployment workflow RPM packaging and repositories using Koji Collectd plugin definition on Puppet it-puppet-module-cerncollectd_contrib on GitLab standard CERN CRM QA -> Production pipeline (1 week) deployed on physical machines it-puppet-module-hardware: physical.pp CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it

  11. CF 9 - Alarms based on standard collectd Threshold plugin checks local metrics against defined thresholds states: OK, WARNING, FAILURE puppet defined (metricmgr is already read-only) Service Managers can override thresholds and SNOW targets CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it

  12. CF 10 - What s left and current issues finish porting of the sensors to collectd start retirement of old lemon sensors too many tickets: fine tuning of the alarms is necessary waiting for better SNOW tickets deduplication tickets are not very descriptive: a pull request has been sent to the upstream community no lemon-host-check do we need it? is collectdctl enough? CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it

  13. CF 11 - Conclusions collectd provides a mature environment for HW monitoring using puppet for alarms definition is definitely a plus for versioning and maintenance, compared to metricmgr after an initial series of delays, mainly due to our early adoption, we are now more than half-way there and progressing steadily targeting end of the year for finishing the migration a good occasion for collaboration with IT-CM-MM CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it

  14. CF Hardware monitoring with collectd CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it

  15. CF Backup slides - Plugin definition class cerncollectd_contrib::plugin::mdstat ( Integer $interval, String $mdstat_path, ) { require ::cerncollectd_contrib package { 'collectd-mdstat': ensure => present, } collectd::plugin::python::module { 'collectd_mdstat': ensure => present, config => [{ 'INTERVAL' => $interval, 'MDSTAT_PATH' => $mdstat_path, }], require => Package['collectd-mdstat'], } } CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it

  16. CF Backup slides - Alarm definition class cerncollectd_contrib::alarm::mdstat_wrong ( Integer $failure_max, Integer $hits, Boolean $persist, Boolean $interesting, Optional[Hash] $custom_targets, Optional[String] $actuator, ) { ::cerncollectd::alarms::threshold::plugin {'mdstat_wrong': plugin => 'mdstat', type => 'disk_error', failure_max => $failure_max, hits => $hits, persist => $persist, interesting => $interesting, } ::cerncollectd::alarms::extra {'mdstat_wrong': ctd_namespace => 'mdstat', targets => $custom_targets, actuator => $actuator, } } CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it

  17. CF Backup slides - Plugin deployment if (versioncmp($::operatingsystemmajrelease,'6') >= 0) or (versioncmp($::operatingsystemmajrelease,'7') >= 0){ # Software RAID failures (see target in YAML file data) include ::cerncollectd_contrib::alarm::mdstat_wrong include ::cerncollectd_contrib::plugin::mdstat # SMART attributes failures include ::cerncollectd_contrib::alarm::smart_wrong include ::cerncollectd_contrib::plugin::smart_tests # MegaRAID failures include ::cerncollectd_contrib::alarm::megaraidsas::bbu_status_wrong include ::cerncollectd_contrib::alarm::megaraidsas::controller_status_wrong include ::cerncollectd_contrib::alarm::megaraidsas::controller_correctable_errors include ::cerncollectd_contrib::alarm::megaraidsas::controller_uncorrectable_errors include ::cerncollectd_contrib::alarm::megaraidsas::cache_policy_on_faulty_bbu_wrong include ::cerncollectd_contrib::alarm::megaraidsas::cache_policy_on_raid_array_wrong include ::cerncollectd_contrib::alarm::megaraidsas::raid_array_status_wrong include ::cerncollectd_contrib::alarm::megaraidsas::missing_drives include ::cerncollectd_contrib::alarm::megaraidsas::unconfigured_good_drives include ::cerncollectd_contrib::alarm::megaraidsas::unconfigured_bad_drives include ::cerncollectd_contrib::alarm::megaraidsas::offline_drives if (versioncmp($::operatingsystemmajrelease,'6') >= 0){ class {'::cerncollectd_contrib::plugin::megaraidsas' : lsmod_path => '/sbin/lsmod', } } else { include ::cerncollectd_contrib::plugin::megaraidsas } } CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it

  18. CF Backup slides - collectdctl output collectd namespace: listing values: lxfsrd08c04.cern.ch/megaraidsas-bbu_status/count-c0 lxfsrd08c04.cern.ch/megaraidsas-controller_cache_policy_on_faulty_bbu/count-c0 lxfsrd08c04.cern.ch/megaraidsas-controller_cache_policy_wrong_on_raid_array/count-c0 lxfsrd08c04.cern.ch/megaraidsas-controller_memory_correctable_errors/count-c0 lxfsrd08c04.cern.ch/megaraidsas-controller_memory_uncorrectable_errors/count-c0 lxfsrd08c04.cern.ch/megaraidsas-controller_status/count-c0 lxfsrd08c04.cern.ch/megaraidsas-missing_drives/count lxfsrd08c04.cern.ch/megaraidsas-offline_drives/count-c0 lxfsrd08c04.cern.ch/megaraidsas-raid_array_status/count-c0_vd0 lxfsrd08c04.cern.ch/megaraidsas-raid_array_status/count-c0_vd1 lxfsrd08c04.cern.ch/megaraidsas-unconfigured_bad_drives/count-c0 lxfsrd08c04.cern.ch/megaraidsas-unconfigured_good_drives/count-c0 getting values: value=0.000000e+00 <hostname>/<plugin>-<plugin_instance>/<type>-<type_instance> [root@lxfsrd08c04 ~]# collectdctl listval [root@lxfsrd08c04 ~]# collectdctl getval lxfsrd08c04.cern.ch/megaraidsas-bbu_status/count-c0 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it

  19. CF Backup slides - GitLab CI/CD CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#