Modernizing Monitoring Systems with NagMQ and ZeroMQ

undefined
 
E
x
t
e
n
s
i
b
l
e
 
M
o
n
i
t
o
r
i
n
g
 
w
i
t
h
 
N
a
g
i
o
s
 
a
n
d
M
e
s
s
a
g
i
n
g
 
M
i
d
d
l
e
w
a
r
e
 
LISA 2012
Jonathan Reams <jreams@columbia.edu>
 
S
y
m
o
n
 
S
a
y
s
 
N
a
g
i
o
s
 
P
r
o
j
e
c
t
 
Replace 12-year-old home grown monitoring system
Very customized
Very engineered
Very unsupported
~17,000 checks
Mandate to move to Nagios
 
F
a
l
s
e
 
S
t
a
r
t
 
1.
Installed Nagios
2.
Ported checks from old system to new
3.
Went out for coffee
4.
Problems
a.
High check latency
b.
High load
 
 
S
t
o
c
k
 
N
a
g
i
o
s
 
N
a
g
i
o
s
 
P
r
o
b
l
e
m
s
 
Trapped on one host:
Check results
Status data
Configuration data
Nagios isn’t a great executor
Forks 2 processes per check
Everything is basically synchronous – async achieved
with multiple processes
Data format is simple but non-standard
 
N
a
g
i
o
s
 
P
r
o
b
l
e
m
s
 
Implementation is all in C – hard to customize
Can be I/O bound by reading/writing check result files
Cannot query data from status file/configuration without
reading/parsing all of it
Input via FIFO gives no feedback and has a limited
buffer size
 
N
a
g
i
o
s
 
P
r
o
b
l
e
m
s
 
C
o
m
m
u
n
i
c
a
t
i
o
n
 
i
s
 
h
a
r
d
!
 
M
y
 
S
o
l
u
t
i
o
n
 
N
a
g
M
Q
 
A
 
Z
e
r
o
M
Q
-
b
a
s
e
d
 
A
P
I
 
f
o
r
 
N
a
g
i
o
s
 
B
a
c
k
g
r
o
u
n
d
 
o
n
 
Z
e
r
o
M
Q
 
Broker-less messaging kernel in a single library
Emulates Berkeley socket API
Supports IPC/TCP/Multicast transports
Fanout, pub/sub, pipe-line, and request/reply messaging
patterns
All I/O is asynchronous after connections are established
with dedicated I/O threads
Bindings available for large number of operating systems
and languages
Agnostic of data being sent – no defined data format
 
N
a
g
M
Q
 
E
v
e
n
t
 
P
u
b
l
i
s
h
e
r
 
&
 
C
o
m
m
a
n
d
s
 
Host check result from publisher
host_check_processed localhost
{ "host_name": "localhost", "check_type": 0, "check_options": 0,
"scheduled_check": 1, "reschedule_check": 1, "current_attempt": 1,
"max_attempts": 1, "state": 0, "last_state": 0, "last_hard_state": 0,
"last_check": 1354996955, "last_state_change": 1337098090, "latency":
1.63600, "timeout": 60, "type": "host_check_processed", "start_time": {
"tv_sec": 1354996955, "tv_usec": 636453 }, "end_time": { "tv_sec":
1354996964, "tv_usec": 161965 }, "early_timeout": 0, "execution_time":
0.07324, "return_code": 0, "output": "Host up", "long_output": null,
"perf_data": null, "timestamp": { "tv_sec": 1354996964, "tv_usec": 161966 }
}
Command to add an acknowledgement to service problem
{'comment_data': 'Stop alerting me!!', 'notify_contacts': False,
'author_name': ’jreams', 'persistent_comment': False, 'host_name':
'localhost', 'service_description': 'rotate-unix', 'time_stamp': {'tv_sec':
1355074576}, 'type': 'acknowledgement'}
 
S
t
a
t
e
 
D
a
t
a
 
Request
{'keys': ['host_name', 'services', 'hosts', 'service_description',
'current_state', 'members', 'type', 'name',
'problem_has_been_acknowledged', 'plugin_output', 'checks_enabled',
'notifications_enabled', 'event_handler_enabled'], 'include_services':
True, 'host_name': 'localhost'}
Response
[{'checks_enabled': True, 'notifications_enabled': True, 'current_state':
0, 'plugin_output': 'Host up', 'problem_has_been_acknowledged': 0,
'event_handler_enabled': True, 'host_name': 'localhost', 'services':
['rotate-unix'], 'type': 'host'}, {'checks_enabled': False,
'notifications_enabled': True, 'current_state': 1, 'plugin_output': 'You
are now on call', 'problem_has_been_acknowledged': False,
'event_handler_enabled': True, 'host_name': 'localhost',
'service_description': 'rotate-unix', 'type': 'service'}]
 
S
o
m
e
 
e
x
a
m
p
l
e
s
 
Distributed check execution (mqexec)
Custom user interfaces (nag.py, etc)
High availability (haagent.py, halib.py)
 
m
q
e
x
e
c
 
m
q
e
x
e
c
 
Asynchronous command executor
Subscribes to 
host_check_initiate
,
service_check_initiate
, and 
event_handler_start
messages, and executes command line specified
Can filter which commands to execute based on any
attribute in message
Receives messages as
Fair-queued worker pool (pull from MQ broker)
Individual worker (subscribe directly to NagMQ)
Sends results back to command interface of NagMQ
 
P
e
r
f
o
r
m
a
n
c
e
:
 
S
t
o
c
k
 
N
a
g
i
o
s
 
P
e
r
f
o
r
m
a
n
c
e
:
 
N
a
g
M
Q
/
m
q
e
x
e
c
 
U
s
e
r
 
I
n
t
e
r
f
a
c
e
s
 
Command-line
$ nag.py -c 'Stop alerting me!!' add ack localhost
[localhost]: No problem found
[uptime@localhost]: Acknowledgement added
Python/Javascript/Twitter Bootstrap web interface using
NagMQ (see demo)
Interface to Twitter
 
H
i
g
h
 
A
v
a
i
l
a
b
i
l
i
t
y
 
 
S
t
o
c
k
 
N
a
g
i
o
s
 
H
i
g
h
 
A
v
a
i
l
a
b
i
l
i
t
y
 
-
 
N
a
g
M
Q
 
H
i
g
h
 
A
v
a
i
l
a
b
i
l
i
t
y
 
-
 
N
a
g
M
Q
 
Use regular 
program_status 
to provide heartbeat
Retrieve active state from state interface to bring passive
node into sync with active node on startup
Subscribe to and send check result messages,
acknowledgements, downtimes, and adaptive changes
to command interface
Passive host’s mqexec(s) run checks for whatever host
is active
Use VIFs owned by the message broker to direct traffic
to active host
 
W
h
y
 
n
o
t
 
u
s
e
 
o
n
e
 
o
f
 
t
h
e
s
e
?
 
LiveStatus – live state query module with check
execution workers
Mod_gearman – distributed check execution based on
gearman job queue
Merlin – database/distributed backend for Nagios
Ndoutils – database backend for Nagios
NSCA – allows check/command submission over
network
NRPE – remote check executor
 
A
P
I
 
 
n
o
t
 
a
 
p
r
o
d
u
c
t
 
NagMQ is just an interface into Nagios, not a product
Better communication with clients comes from larger
ZeroMQ project – leaving NagMQ to focus on Nagios
Implement ad-hoc tools for Nagios without having to
write any compiled code
Doing expensive data processing of monitoring data
doesn’t have to create latency in monitoring system
Re-use one interface for many tools
 
F
u
t
u
r
e
 
W
o
r
k
 
Pluggable authentication/encryption for NagMQ
Pluggable parser/emitter for custom data formats (XML,
Yaml, etc)
NDOutils database replacement
More user interfaces (Jabber, SMS, email gateway,
REST API)
Nagios 4
 
N
a
g
M
Q
 
https://github.com/jbreams/nagmq
 
 
 
 
Jonathan Reams
jbreams@gmail.com
Slide Note
Embed
Share

Learn about the challenges faced with traditional Nagios monitoring systems and the innovative solution NagMQ, a ZeroMQ-based API, offers for improved monitoring efficiency and customization. ZeroMQ's asynchronous communication capabilities and flexibility provide a modern approach to monitoring, addressing issues such as customization limitations, communication difficulties, and performance bottlenecks. Explore how NagMQ empowers organizations to enhance their monitoring infrastructure effectively.


Uploaded on Sep 12, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Extensible Monitoring with Nagios and Messaging Middleware LISA 2012 Jonathan Reams <jreams@columbia.edu>

  2. Symon Says Nagios Project Replace 12-year-old home grown monitoring system Very customized Very engineered Very unsupported ~17,000 checks Mandate to move to Nagios

  3. False Start 1. Installed Nagios 2. Ported checks from old system to new 3. Went out for coffee 4. Problems a. High check latency b. High load

  4. Stock Nagios Nagios Host Nagios Process Nagios Reapers Check Results Status Data File Check Processes CGIs Sysadmin

  5. Nagios Problems Trapped on one host: Check results Status data Configuration data Nagios isn t a great executor Forks 2 processes per check Everything is basically synchronous async achieved with multiple processes Data format is simple but non-standard

  6. Nagios Problems Implementation is all in C hard to customize Can be I/O bound by reading/writing check result files Cannot query data from status file/configuration without reading/parsing all of it Input via FIFO gives no feedback and has a limited buffer size

  7. Nagios Problems Communication is hard!

  8. My Solution NagMQ A ZeroMQ-based API for Nagios

  9. Background on ZeroMQ Broker-less messaging kernel in a single library Emulates Berkeley socket API Supports IPC/TCP/Multicast transports Fanout, pub/sub, pipe-line, and request/reply messaging patterns All I/O is asynchronous after connections are established with dedicated I/O threads Bindings available for large number of operating systems and languages Agnostic of data being sent no defined data format

  10. NagMQ

  11. Event Publisher & Commands Host check result from publisher host_check_processed localhost { "host_name": "localhost", "check_type": 0, "check_options": 0, "scheduled_check": 1, "reschedule_check": 1, "current_attempt": 1, "max_attempts": 1, "state": 0, "last_state": 0, "last_hard_state": 0, "last_check": 1354996955, "last_state_change": 1337098090, "latency": 1.63600, "timeout": 60, "type": "host_check_processed", "start_time": { "tv_sec": 1354996955, "tv_usec": 636453 }, "end_time": { "tv_sec": 1354996964, "tv_usec": 161965 }, "early_timeout": 0, "execution_time": 0.07324, "return_code": 0, "output": "Host up", "long_output": null, "perf_data": null, "timestamp": { "tv_sec": 1354996964, "tv_usec": 161966 } } Command to add an acknowledgement to service problem {'comment_data': 'Stop alerting me!!', 'notify_contacts': False, 'author_name': jreams', 'persistent_comment': False, 'host_name': 'localhost', 'service_description': 'rotate-unix', 'time_stamp': {'tv_sec': 1355074576}, 'type': 'acknowledgement'}

  12. State Data Request {'keys': ['host_name', 'services', 'hosts', 'service_description', 'current_state', 'members', 'type', 'name', 'problem_has_been_acknowledged', 'plugin_output', 'checks_enabled', 'notifications_enabled', 'event_handler_enabled'], 'include_services': True, 'host_name': 'localhost'} Response [{'checks_enabled': True, 'notifications_enabled': True, 'current_state': 0, 'plugin_output': 'Host up', 'problem_has_been_acknowledged': 0, 'event_handler_enabled': True, 'host_name': 'localhost', 'services': ['rotate-unix'], 'type': 'host'}, {'checks_enabled': False, 'notifications_enabled': True, 'current_state': 1, 'plugin_output': 'You are now on call', 'problem_has_been_acknowledged': False, 'event_handler_enabled': True, 'host_name': 'localhost', 'service_description': 'rotate-unix', 'type': 'service'}]

  13. Some examples Distributed check execution (mqexec) Custom user interfaces (nag.py, etc) High availability (haagent.py, halib.py)

  14. mqexec

  15. mqexec Asynchronous command executor Subscribes to host_check_initiate, service_check_initiate, and event_handler_start messages, and executes command line specified Can filter which commands to execute based on any attribute in message Receives messages as Fair-queued worker pool (pull from MQ broker) Individual worker (subscribe directly to NagMQ) Sends results back to command interface of NagMQ

  16. Performance: Stock Nagios 18 16 Latency in Seconds 14 12 10 Max Host Avg Host Max Svc Avg Svc 8 6 4 2 0 1 2 3 4 5 6 7 8 9 1011121314151617181920 Time in Minutes

  17. Performance: NagMQ/mqexec 18 16 14 Latency in Seconds 12 10 Max Host Avg Host Max Svc Avg Svc 8 6 4 2 0 1 2 3 4 5 6 7 8 9 1011121314151617181920 Time in Minutes

  18. User Interfaces Command-line $ nag.py -c 'Stop alerting me!!' add ack localhost [localhost]: No problem found [uptime@localhost]: Acknowledgement added Python/Javascript/Twitter Bootstrap web interface using NagMQ (see demo) Interface to Twitter

  19. High Availability Stock Nagios

  20. High Availability - NagMQ

  21. High Availability - NagMQ Use regular program_status to provide heartbeat Retrieve active state from state interface to bring passive node into sync with active node on startup Subscribe to and send check result messages, acknowledgements, downtimes, and adaptive changes to command interface Passive host s mqexec(s) run checks for whatever host is active Use VIFs owned by the message broker to direct traffic to active host

  22. Why not use one of these? LiveStatus live state query module with check execution workers Mod_gearman distributed check execution based on gearman job queue Merlin database/distributed backend for Nagios Ndoutils database backend for Nagios NSCA allows check/command submission over network NRPE remote check executor

  23. API not a product NagMQ is just an interface into Nagios, not a product Better communication with clients comes from larger ZeroMQ project leaving NagMQ to focus on Nagios Implement ad-hoc tools for Nagios without having to write any compiled code Doing expensive data processing of monitoring data doesn t have to create latency in monitoring system Re-use one interface for many tools

  24. Future Work Pluggable authentication/encryption for NagMQ Pluggable parser/emitter for custom data formats (XML, Yaml, etc) NDOutils database replacement More user interfaces (Jabber, SMS, email gateway, REST API) Nagios 4

  25. NagMQ https://github.com/jbreams/nagmq Jonathan Reams jbreams@gmail.com

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#