Vitrage Project Update: Root Cause Analysis Service in OpenStack

Slide Note
Embed
Share

Vitrage is an OpenStack service for organizing, analyzing, and expanding alarms and events, providing a holistic view of the system. Founded in Mitaka release, Vitrage became an official project in 2016 with a focus on Root Cause Analysis. It integrates with Mistral for workflow insights and is advancing towards incorporating machine learning capabilities for alarm correlation and causality. Check out the latest updates and features of Vitrage in the OpenStack ecosystem.


Uploaded on Aug 30, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. November 2017 Vitrage Project Update, OpenStack Summit Sydney Ifat Afek, IRC: ifat_afek Queens Virtual PTG: https://etherpad.openstack.org/p/vitrage-ptg-queens

  2. What is Vitrage? The OpenStack RCA (Root Cause Analysis) service Vitrage is used for organizing, analyzing and expanding OpenStack alarms & events. Root Cause Analysis understand what causes faults to occur Deduced alarms and states raising alarms and modifying states based on system insights Holistic and complete view of the system

  3. Project Background Founded during the Mitaka release of OpenStack Became an official OpenStack project on June 1st2016 First official release - Newton ~10 contributors in the last release

  4. High Level Architecture API, CLI, UI Graph Machine Learning Notifications To External Systems External Projects and Monitors Logic Templates

  5. Pike Features

  6. Vitrage Integration with Mistral Vitrage provides insights about the system Mistral is a workflow service Vitrage + Mistral -> Analysis & corrective actions VM migrated to another host NIC is down Execute migrate_vm workflow VM migrate Zabbix Vitrage Mistral Nova Raise VM unreachable alarm Clear VM unreachable alarm

  7. Machine Learning First Steps First steps of augmenting Vitrage with machine learning capabilities Implemented the infrastructure Implemented a basic Jaccard Correlation algorithm Today: evaluator templates are manually edited by the user Tomorrow: Automatically generate Evaluator Templates based on alarm history Finding alarm correlation (B usually appears right after A) Finding alarm causality (Is A the root cause of B?) Algorithm developed by Bell Labs X Y Event Y real start time can be before or after X Event Y real end time can be before or after X

  8. Other Features Vitrage template language extension added not operator SNMP notifications Keycloak support Alarm equivalence

  9. Queens Features

  10. High Availability and Alarm History Improve Vitrage high availability support Lay the ground for alarm and RCA history Store alarm history using snapshots and events (event sourcing pattern) Implemented in stages: Pike collector Queens persister and player Queens/Rocky alarm history

  11. Configurable Notifications Existing: Dedicated notifiers (Nova, SNMP, Mistral) In Queens: API for registering on Vitrage alarms By resource id By alarm name By regular expression HTTP callback upon alarm

  12. Equivalence and Aggregation Resource equivalence Two datasources report the same resource How to indicate the equivalency? What if one datasource removes the resource? Aggregation: APIs should return a semi-merged resource Design in progress Nova Discovery agent Aggregated display Equivalent API call compute-0 AVAIALABLE compute-0 SUBOPTIMAL compute-0 SUBOPTIMAL

  13. Proactive RCA Host down alarm -> deduce that instance is down Instance down alarm -> suspect that host is down Could be more than one suspects Run Diagnostics Host Down Still under design and requirement definition How to verify that a suspect is a real alarm? When to clear a suspect alarm? Deduce Suspect Instance Down

  14. Other Features Parallel evaluation of Vitrage templates Integration with OPNFV Doctor SNMP parsing service Templates CRUD Discovery agent

  15. We are looking for contributors! Vitrage wiki page: https://wiki.openstack.org/wiki/Vitrage Vitrage IRC channel: #openstack-vitrage OpenStack mailing list use [vitrage] tag

  16. Q&A Thank you! openstack @OpenStack openstack OpenStackFoundation

More Related Content