Rethinking Network Monitoring: A Journey from Troubleshooting to Automation

 
Do we need to rethink
monitoring?
 
Kemal Sanjta
ThousandEyes
Nature of the troubleshooting
 
REACTIVE
PROACTIVE?
 
Troubleshooting life cycle
 
 
 
Issue
Troubleshooting
Conclusion based on the RCA
 
Troubleshooting tools
 
Ping and traceroute good as starting point, but we realized we need
something more
 
MTR
Paris traceroute
Dublin traceroute
NLNOG RING
… but we are still reactive and quite possibly late to
the party!
Back to alerting
 
 
Various sources (wrapper for end user reports)
SYSLOG
SNMP
Lately streaming telemetry solutions
 
Now that we have alerts and the tools to
troubleshoot the problems…
 
WHAT IS THE PROBLEM?
What is the problem?
 
 
 
TIME
We are too slow to respond to alerts!
Improvement?
 
AUTOMATION
We discovered…
 
 
 
Python (and countless libraries)
Go Programming Language  (and its concurrency)
And few frameworks along the way like Ansible
Once automation provided results…
 
 
 
Are $vendors telling the full truth about
performance of the networks?
How many times have you heard?
 
Linecards rebooting as a result of solar flares? (No root cause
analysis)
Counters for _exactly that_ issue are not user exposed?
Counters exist, but you need to be linecard level wizard to get to
them? (involves knowing good piece about architecture and
silicon/ASIC type)
Backplane was hit with this specifically crafted package that took your
fully redundant backplane down?
Control plane can not handle it?
Automation gave us product called…
 
VENDOR
DISTRUST
 
ACTIVE
NETWORK
MONITORING
Challenges with active network monitoring
 
 
 
Large scale/enterprise networks moved to CLOS
Fabric Designs
CLOS Fabric Designs to de-aggregate large chassis,
depend on smaller scale devices (limit the “blast
radius”)
Smaller scale devices, in turn, suffer from smaller
RIB/FIB sizes and weak Control planes
 
Are they really smaller scale devices?
 
 
Juniper PTX1000
: 24X100GbE, 72X40GbE, 288X10GbE = 
2.88Tbps
Cisco NCS5000 series
: 32X100GbE, 32X40GbE, 128X25GbE, 128X10GbE
= 
3.2Tbps
Arista 7170 series
: 32X100GbE, 64X50GbE, 32X40GbE, 128X25GbE,
130x10GbE = 
6.4Tbps
 
Depends on the angle… Better to lose 2.8Tbps – 6.4Tbps capacity
compared to fully loaded ASR 9022 taking down 160Tbps
Some more challenges…
 
 
Label switched networks (backbone
networks) utilizing features like auto-bw are
not that straight forward to implement active
network monitoring on
 
That implies…
 
NO 100% 
ACTIVE
 NETWORK
MONITORING COVERAGE
 
 
Did we forget about something?
 
THE
INTERNET
T
h
e
 
I
n
t
e
r
n
e
t
 
Packet Loss
Latency
Jitter
BGP advertisements/withdrawals
Prefix hijacks
Some more challenges…
 
SERVICES
Don’t be that person that shunts the issue(s) to SREs and says:
“Not my problem”
Solution?
 
Learn how to code (as your job might depend on it)
Utilize research papers on data center and backbone design not to
repeat someone else’s mistakes
Utilize both active and passive network monitoring regardless of how
hard that might be… or just buy off the shelf solution that does it
Extend active network monitoring solutions to achieve 100% active
network monitoring coverage
Monitor performance of your internet paths as life of your packets,
and patience of your customers depends on it!
Know/Monitor/Alert on your services and don’t play the blame game!
Slide Note
Embed
Share

Explore the evolution of network monitoring from reactive troubleshooting to proactive automation. Discover the importance of timely response, the role of tools like MTR and NLNOG RING, the need for alerts and automation, and the challenges in obtaining accurate network performance insights. Delve into the world of Python, Go Programming Language, and Ansible for automation, and question the transparency of vendors in reporting network performance metrics.


Uploaded on Jul 29, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Do we need to rethink monitoring? Kemal Sanjta ThousandEyes

  2. Nature of the troubleshooting REACTIVE PROACTIVE?

  3. Troubleshooting life cycle Issue Troubleshooting Conclusion based on the RCA

  4. Troubleshooting tools Ping and traceroute good as starting point, but we realized we need something more MTR Paris traceroute Dublin traceroute NLNOG RING but we are still reactive and quite possibly late to the party!

  5. Back to alerting Various sources (wrapper for end user reports) SYSLOG SNMP Lately streaming telemetry solutions

  6. Now that we have alerts and the tools to troubleshoot the problems WHAT IS THE PROBLEM?

  7. What is the problem? TIME We are too slow to respond to alerts!

  8. Improvement? AUTOMATION

  9. We discovered Python (and countless libraries) Go Programming Language (and its concurrency) And few frameworks along the way like Ansible

  10. Once automation provided results Are $vendors telling the full truth about performance of the networks?

  11. How many times have you heard? Linecards rebooting as a result of solar flares? (No root cause analysis) Counters for _exactly that_ issue are not user exposed? Counters exist, but you need to be linecard level wizard to get to them? (involves knowing good piece about architecture and silicon/ASIC type) Backplane was hit with this specifically crafted package that took your fully redundant backplane down? Control plane can not handle it?

  12. Automation gave us product called VENDOR DISTRUST

  13. ACTIVE NETWORK MONITORING

  14. Challenges with active network monitoring Large scale/enterprise networks moved to CLOS Fabric Designs CLOS Fabric Designs to de-aggregate large chassis, depend on smaller scale devices (limit the blast radius ) Smaller scale devices, in turn, suffer from smaller RIB/FIB sizes and weak Control planes

  15. Are they really smaller scale devices? Juniper PTX1000: 24X100GbE, 72X40GbE, 288X10GbE = 2.88Tbps Cisco NCS5000 series: 32X100GbE, 32X40GbE, 128X25GbE, 128X10GbE = 3.2Tbps Arista 7170 series: 32X100GbE, 64X50GbE, 32X40GbE, 128X25GbE, 130x10GbE = 6.4Tbps Depends on the angle Better to lose 2.8Tbps 6.4Tbps capacity compared to fully loaded ASR 9022 taking down 160Tbps

  16. Some more challenges Label switched networks (backbone networks) utilizing features like auto-bw are not that straight forward to implement active network monitoring on

  17. That implies NO 100% ACTIVE NETWORK MONITORING COVERAGE

  18. Did we forget about something?

  19. THE INTERNET

  20. The Internet The Internet Packet Loss Latency Jitter BGP advertisements/withdrawals Prefix hijacks

  21. Some more challenges SERVICES Don t be that person that shunts the issue(s) to SREs and says: Not my problem

  22. Solution? Learn how to code (as your job might depend on it) Utilize research papers on data center and backbone design not to repeat someone else s mistakes Utilize both active and passive network monitoring regardless of how hard that might be or just buy off the shelf solution that does it Extend active network monitoring solutions to achieve 100% active network monitoring coverage Monitor performance of your internet paths as life of your packets, and patience of your customers depends on it! Know/Monitor/Alert on your services and don t play the blame game!

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#