Understanding OpenStack Upgrades: A Comprehensive Overview

Slide Note
Embed
Share

Delve into the world of OpenStack upgrades as we explore the reasons for upgrading, the process involved, and the key areas to focus on. Discover the importance of minimizing downtime, improving stability, and bringing in new features. Gain insights into upgrading packages, configuration files, databases, and deployment specifics. Explore a containerized environment for OpenStack services and the tools used for deployment within Docker containers.


Uploaded on Sep 26, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. OpenStack Upgrade: A journey from Liberty to Ocata Ajay Kalambur, Technical Leader Shail Bhargava, Technical Leader Rich winters, Senior Software Eng. November 8, 2017

  2. Agenda OpenStack Upgrades: Why Should One Care? What Does An Upgrade Entail? Environment Overview Key Areas OS: RHEL Upgrade major releases Storage: CEPH Upgrade from Hammer to Jewel OpenStack Ver: Upgrade from Liberty to Ocata Infra Services; Upgrade of rabbitmq, galera, haproxy Rollback Of An Upgrade (Newton-Ocata) Verification/Testing Of An Upgrade Process 2

  3. OpenStack Upgrade: Why Should One Care? Minimize Any Production Cloud Downtime Minimize data plane and control plane disruption No need to recreate workloads on cloud Having a secondary cloud to migrate workloads is not viable economically Move At The Speed Of Open Source Bring in new OpenStack features Improved stability with each release (e.g. Bug fixes) Reduce unnecessary support (e.g. EOL code) 3

  4. OpenStack Upgrade: What does it entail?? General Package Upgrade Upgrade packages to bring in new software. Update configuration files Update configuration files with the latest parameters Configuration Translations across releases Sync databases Run a database sync to update schemas to the new structure Deployment Specific Containers vs Host vs VM Operating system CEPH Infra services (rabbitmq, galera, haproxy, etc.) OpenStack Services 4

  5. Our Environment 5

  6. Containerized OpenStack Services Ansible playbooks used to deploy Openstack services within Docker containers Docker containers started via systemd Haproxy used for load balancing Elastic Search Deployment Kibana Nova Glance Neutron Nova Container Registry Repo Mirror VMTP ceilometer haproxy Horizon Glance Neutron CEPH Rabbitmq Galera Haproxy Memcached Fluentd Libvirt Fluentd Cinder Keystone Fluentd Mgmt APIs Storage Node (min of 3) Management Node Control Node (3) Compute Node (n)

  7. Control Plane High Availability Openstack API load balanced via HA proxy and Keepalived [Active/Standby]. Openstack Services/Message queue/database high availability implemented using three OpenStack control nodes. API Call #1 API Network HA Proxy [Active] HA Proxy [Standby] HA Proxy [Standby] API Call #2 Management Network Galera (Standby) Keystone Horizon Nova Neutron Server Others Galera (Standby) Keystone Horizon Nova Neutron Server Others Galera (Active) Keystone Horizon Nova Neutron Server Others Control Node 3 Control Node 2 Control Node 1

  8. Upgrade High Level View 8

  9. Key Upgrade Events L I B E R T Y S A N I T Y M I T A K A N E W T O N S A N I T Y O C A T A Host Packages and Kernel Upgrade (7.2->7.4) Infra Service Upgrade Openstack Services CEPH Upgrade: Hammer to Jewel Infra Services OpenStack Services Infra Services OpenStack Services Infra Services OpenStack Services Ocata Newton Ocata Liberty Mitaka Mitaka Newton Key Service Events (During an Upgrade) Cleanup old service Bootstrap container Bring up Service Bring up Stop old service Remove old container Remove old image for container DB sync for service Create any new services/users for keystone Bring up new container with new configuration Check service health

  10. Upgrade Specifics 10

  11. Operating System and Host Package Details Component Operating System Specifics RHEL 7.2 7.4 Method Liberty-Mitaka includes RHEL 7.2 7.4 Same kernel Mitaka Ocata Kexec upgrade Optional delayed reboot of compute nodes SELinux relabel Mitaka-Ocata minimal control and data plane disruption Host packages Host packages Docker 1.8.2-1.10 Galera auto recovery after docker upgrade Ansible and python-docker-py upgrade 11

  12. Ceph Upgrade Details Component Hammer Jewel Specifics CEPH Monitors Tasks Change permission of /var/lib/ceph, /var/log ceph, /etc/ceph to ceph:ceph Upgrade to newer version OSD Nodes Change permission /var/lib/ceph/osd, /var/log/ceph to ceph:ceph Enable ceph-osd.target and ceph-osd@<osdid> services Upgrade to newer version Post Upgrade tasks ceph osd set require_jewel_osds ceph osd crush tunables firefly 12

  13. Infrastructure Upgrade: Rabbit & Galera Component Infra services upgrade Specifics Rabbitmq upgrade Tasks Stop rabbitmq on all 3 nodes (major or minor version change) Remove mnesia file Bring up new containers Set HA policy for mirrored queues Disable galera backend in haproxy Galera Upgrade Shutdown services in right order Remove gratstate.dat Bootstrap new cluster bring up primary node Bring up other 2 members Enable galera backend in haproxy 13

  14. OpenStack Services Upgrade Component OpenStack Services Specifics General flow Tasks Liberty-Mitaka-Newton-Ocata Rolling upgrade of each service Delete old container, bootstrap service(db sync), install new containers Bring in new services example placement as part of Upgrade Snapshot database using mysqldump to recover from an upgrade failure Rollback support to recover from upgrade failure Create nova cells database Nova Placement handling (Newton-Ocata) Keystone changes for placement Cell setup 14

  15. Upgrade and Rollback User flow Commit Yes Upgrade success Pre-Upgrade Validations Upgrade Services Galera backup No (Rollback) Shutdown all OpenStack services Rollback OpenStack Services Restore Galera database Post Rollback Validations

  16. Challenges faced and addressed Galera cluster in a bad shape after a host package upgrade Run automated galera recovery Rabbitmq reconnections sometimes not working as expected Restart rabbitmq servers post Upgrade Handle soft deleted records on upgrade https://review.openstack.org/#/c/435620/ Handling a different network design for CEPH nodes between Liberty Ocata Keep backward compatibility VMs are not reachable over floating IP post upgrade Move the network namespace to host from container 16

  17. Upgrade Verification Pre Upgrade Tests Check Health of OpenStack services Check CEPH health Check health of Infra services (rabbitmq, galera, haproxy) Post Upgrade Tests Functional tests: Rally and tempest Validate CEPH cluster functionality Verify existing resources created before upgrade Database schema comparison between upgraded setup and fresh deployment Data plane and control plane downtime tracked through multi release upgrade Validate health of infra services (rabbitmq, galera, haproxy) 17

  18. Post Upgrade Verification Automated testing Performs end-to-end installation of liberty release followed by upgrade to mitaka/newton/ocata End to end wrapper script which eases intermediate mitaka & newton upgrades Helps uncovers timing issues (rabbitmq cluster does not respond intermittently) Runs nightly to catch any regression 18

  19. Demos 19

  20. Demos Neutron Upgrade (Mitaka-Newton) Sample flow for Upgrade of 1 service Rolling upgrade of each component Bring down old service Bootstrap new service Bring up new service Rollback(Ocata->Newton) Preview of Ocata Stop of Ocata Services Restore Newton mysql DB Rollback of Openstack services Post rollback sanity of cloud No control plane operatons post post Upgrade 20

  21. Summary An OpenStack multi release upgrade can be performed by internally triggering a step by step in-sequence upgrade (every release) An upgrade between releases normally also involves additional components E.g. Operating system + CEPH + Infra Services Rollback support works Control plane will be down during rollback window Containerized OpenStack deployments need to handle Docker upgrade Repeatable automation to run Upgrades every night is critical to flush out any hidden timing bugs 21

  22. Questions? Ajay Kalambur akalambu@cisco.com Shail Bhargava shabharg@cisco.com Richard Winters riwinter@cisco.com 22

  23. Backup/Details 23

  24. RHEL version Upgrade Ansible upgrade on management node is simple Ansible: 1.9.4 -> 2.2.1 python-docker-py: 1.4.0-1.9.0 RHEL version Upgrade from 7.2 to 7.4 Docker Upgrade from 1.8.2-1.10 Option of Delayed reboot of Host operating system on compute nodes All operating system and package updates performed Liberty-Mitaka Auto execute galera cluster recovery see later Removed the oci-register machine hook before docker upgrade rm rf /usr/libexec/oci/hooks.d/oci-register-machine 24

  25. RHEL Version Upgrade SELinux relabeling done as part of OS Upgrade Mitaka-Ocata use same kernel so no downtime of any nodes as it s a hitless upgrade Handle kexec changes for new kernel Install kexec loader based on new kernel Install modified kexec unit file Setup kexec kernel load for restart Default to kexec restart Patch libvirt systemd file to add a dependency on machine.slice Needed for VM to automatically startup after system reboot Automatically install any new packages added as part of newer release 25

  26. CEPH upgrade Hammer to Jewel Upgrade CEPH mon nodes first Change permission of /var/lib/ceph, /var/log ceph, /etc/ceph to ceph:ceph We run ceph-mon in a docker container to we replace the new container Track ceph-mon docker container through systemd CEPH OSD node upgrade One node upgraded at a time Stop all OSD services Change permission of /var/lib/ceph/osd, /var/log/ceph to ceph:ceph Yum update all CEPH packages Make sure to create mount entries for all ceph osd drives in /etc/fstab Systemctl enable ceph-osd.target, ceph-osd@<osdid> for all OSD touch ./autorelabel and reboot 26

  27. CEPH Upgrade Post Upgrade tasks On Mon node: ceph osd set require_jewel_osds On Mon node: ceph osd crush tunables firefly Check status of ceph cluster ceph -s to make sure Health is OK 27

  28. Rabbitmq Cluster upgrade procedure Involves a change to major or minor version Stop all rabbitmq servers on all 3 controllers Remove old containers and images Remove the mnesia file: /var/lib/docker/volumes/rabbitmq/_data/mnesia Bring up the new rabbitmq containers with new configs Enable ha policy explicitly for mirrored queues: rabbitmqctl set_policy ha-all "" '{"ha-mode":"all","ha- sync-mode":"automatic"} Validate cluster state to make sure things came up fine: rabbitmqctl cluster_status Cluster running with required number of members No partitions [{nodes,[{disc,['rabbit@control-server-1','rabbit@control-server-2', 'rabbit@control-server-3']}]}, {running_nodes,['rabbit@control-server-2','rabbit@control-server-1', 'rabbit@control-server-3']}, {cluster_name,<<"rabbit@control-server-3">>}, {partitions,[]}]...done. 28

  29. Galera Cluster Upgrade procedure Disable galera backend in haproxy touch /var/tmp/clustercheck.disabled Wait for few seconds for pending transactions to sync Shut down galera services on all 3 nodes in proper order Remove grastate.dat Avoid getting into higher transaction id issues Moved to new cluster approach vs a graceful highest transcation id shutdown (Requests in transit did not matter to us) Stop and remove old container and docker images Start new container with new configs Bootstrap one node as primary and startup 29

  30. Galera cluster post upgrade procedure Restore galera cluster behind haproxy remove /var/tmp/clustercheck.disabled Perform xinetd checks and cluster status check Validate cluster status output mysql u root p <> -e SHOW STATUS LIKE wsrep% wsrep_local_state_comment: Synced, wsrep_cluster_size: <# controllers> Perform health check of <vip>:3306 to make sure things work end to end Ability to trigger galera automated recovery if previous steps fail 30

  31. Galera cluster automated recovery Handle the case of heuristic rollback where galera was shutdown in middle of transaction If cluster cannot still be recovered perform complete failure recovery Stop all existing galera services Bootstrap all mariadb containers with wsrep-recover option Check node with highest transaction number Force this node as primary and bootstrap mariadb with wsrep-new cluster option Wait for node to come online and respond to SQL query Start remaining nodes to join cluster 31

  32. ELK-EFK Components Upgrade Liberty setup: ELK components at 1.5 version No requirement to persist logs and database between Liberty-Ocata Moved to EFK 5.x with a rip and replace Remove old container and images. Install new EFK components Old logs were lost but EFK was setup properly going forward with Ocata 32

  33. OpenStack release Upgrade Customer initiates Liberty-Ocata upgrade no extra nodes available We internally perform rolling upgrade 1 release at a time Liberty Mitaka Newton Ocata Test all resources before and after Upgrade Bring in any new services as part of Upgrade example placement in Ocata Snapshot database using mysqldump to recover from an upgrade failure POC rollback implementation Newton-Ocata Upgrade of rabbitmq and galera cluster have different flow 33

  34. Placement handling Newton-Ocata Create nova cells database mandatory in Ocata Create the placement user in keystone Create the placement service in keystone Create the placement endpoints in keystone Register the cell0 database: nova-manage cell_v2 map_cell0 Create the cell1 cell: nova-manage cell_v2 create_cell --name=cell1 nova-manage cell_v2 simple_cell_setup Perform nova api and nova db sync Post nova upgrade trigger a manual discover hosts: nova-manage cell_v2 discover_hosts (later on: config option discover_hosts_in_cells_interval 300) 34

  35. Verification Functionality testing Comparison of upgraded & fresh installed setup Database schema migration using mysqldump Manual verification by comparing the two dumps using "diff" Exploring automated tools for database schema comparison to put in CI/CD Host RPM packages including kernel, ansible, docker, OpenStack clients, etc. CEPH cluster functionality Host configuration changes e.g. reserving a TCP port for newly introduced service(s) Host system services ordering and dependencies Functionality of kexec reboot (to minimize downtime and avoid doing lengthy hardware POST) Migration of custom configuration TCP/UDP port scan to identify changes in Open/Closed Ports Existing OpenStack resources can be accessed, modified and deleted post upgrade New OpenStack resources can be created post upgrade OpenStack Services running Rally test for various components on upgraded setup 35

  36. Verification Control plane downtime Host RPM package update results in increased downtime (e.g. docker RPM) Liberty to mitaka has more downtime when compared to mitaka to newton or newton to ocata upgrade Data plane downtime Host RPM package update increases the downtime (e.g. kernel, iptables) Increased downtime with external (floating) network when compared to that of the provider networkVu 36

More Related Content