OpenStack Upgrade from Mitaka to Ocata (across 2 releases!) with Mirantis Cloud Platform
To have this problem as an OpenStack distribution vendor, however, that's less common. But that's what happened.
That's not to say it took us by surprise. When we released Mirantis Cloud Platform 1.0 in March of this year, we wanted to provide the easiest transition possible -- as well as feature parity -- from Mirantis OpenStack 9.2. MOS 9.2 is based on OpenStack Mitaka, which has the extra added bonus of enabling us to provide customers the ability to move between Ubuntu Trusty and Ubuntu Xenial.
That's the upside. The downside is that it meant we were dealing with an OpenStack that was almost a year old, and that's certainly not what we want. Especially with a third release just three months away, we had to figure out how to get back on track, bypassing Newton and going straight to Ocata so our customers would be ready when Pike arrived.
This blog describes OpenStack upgrades in general, and how we performed a double upgrade. We'll also share a DEMO video of real upgrade on a hardware cluster, upgrading Mitaka-based OpenStack running with a Ubuntu Trusty control plane to upgrade to Ocata running on Xenial.
What is an OpenStack Upgrade?
There's a good reason that upgrades has been one of the hottest topics in OpenStack for the last few years. Like any system upgrade, an OpenStack upgrade is something scary to operations people, and it has been very painful in the community since the beginning.I've been around long enough to remember that upgrades from releases such as Grizzly with Neutron were almost impossible. Historically, the OpenStack community has supported only upgrades of the actual components, and even then, only from one version to the next. However, considering the OpenStack project's complexity and adding in the inevitable vendor plugins necessary for a truly enterprise-grade deployment, the reality is that guaranteeing that your setup was even upgradable was impossible, at least without further analysis.
Ultimately, in most cases "upgrade" really meant redeployment of the whole cloud and then migrating workloads to the new infrastructure. Projects such as CloudFerry were developed to ease the burden of resource migration between these clouds, but clearly that wasn't the answer.
Technically speaking, an OpenStack upgrade consists of 3 steps:
- Upgrade packages - upgrade to the latest packages, with binaries and python code.
- Update config files - update configuration files with the latest parameters.
- Sync databases - run a database sync to update schemas to the new structure.
- Verifying release notes to ensure that current features are available in the new release, especially considering architecture changes over time.
- Going through the list of your neutron, glance, and cinder backends and verifying compatibility for the target release. Neutron plugins are usually especially challenging, because the network is such a core part of the cloud. The last thing you want is for your workload to be disconnected during the upgrade.
- Adopting structure and configuration options from the service configuration files and merging them with existing configuration files. You'll need to go through the OpenStack Configuration Reference, as it will contain new, updated, and deprecated options for most services.
- Considering the approach to upgrading the environment. Some users will live migrate instances or relocate workloads. You must have a plan about what phases touch what part of the infrastructure in order to ensure that you do not cause downtime.
- Considering the impact of an upgrade to operation. The upgrade process interrupts management of the environment including the dashboard, but the upgrade should not touch significantly on existing/running instances. Instances might experience intermittent network interruptions, but it's important to keep these minimal.
- Developing an upgrade procedure and assessing it thoroughly by using a test environment similar to your production environment.
- Preparing for failure. As with all major system upgrades, your upgrade could fail for one or more reasons. You must have a procedure ready for this situation by having the ability to roll back the environment to the previous release -- including databases, configuration files, and packages, and not just containerized binaries.
MCP Openstack Upgrade Mitaka to Ocata
Before we look at how this was done, let's start with defining what it was we actually did.MCP has a specific reference architecture, which has the advantage of limiting the number of different component versions we need to worry about. So let's start by defining the requirement component matrix, and what is and is not part of the upgrade. The core requirement was to do it as automatically as possible with minimum of manual steps. Here's where we started, and where we ended up:
MCP Mitaka | MCP Ocata | |
Control plane | VCP - Split into VMs | VCP - Split into VMs |
OS version | Ubuntu 14.04 VCP and 16.04 on computes | Ubuntu 16.04 |
OpenContrail | 3.1.1 | 3.1.1 |
OpenVSwitch | 2.6.1 | 2.6.1 |
QEMU/Libvirt | 2.5/1.3.1 | 2.5/1.3.1* |
Oslo messaging | 4.6.1 | 5.17.1 |
Keystone | 9.3.0 | 11.0.2 |
Glance | 12.0.0 | 14.0.0 |
Cinder | 8.1.1 | 10.0.3 |
Nova | 13.1.4 | 15.0.6 |
Neutron | 8.4.0 | 10.0.2 |
Heat | 6.1.2 | 8.0.1 |
Horizon | 9.1.2 | 11.0.2 |
Some things to keep in mind:
As I already mentioned, Neutron and the network in general are the most difficult parts to get right, and perhaps the most delicate. MCP supports 2 network backends:
- OpenVSwitch ML2 - If you're running OVS, you'll need to keep a particularly careful eye on this process; you will be upgrading neutron-openvswitch-agent as well as OVS itself, which can cause downtime on instance network connections if not done properly. On the other hand, MCP supports both standard DVR and non-DVR architectures, and thankfully there are no differences between Mitaka and Ocata from this point of view, so it is mostly a binary upgrade.
- OpenContrail - OpenContrail is quite independent of the OpenStack release itself, which is a huge benefit when it comes to upgrades. OpenContrail 3.1.1 can run with Kilo, Liberty, Mitaka, Newton and Ocata, so this upgrade does not really touch the data plane, which means zero outage for workloads.
- Glance glare - GLare Artifact Repository was added in the Newton release cycle, and Glance requires it to configure extra service and keystone endpoints.
- Nova Placement API - The Ocata release includes the new placement API as a mandatory service, so that must be added.
- Nova nova_cell0 - CellsV2 deployments are mandatory in Ocata, which requires an extra database in Galera and use of the online migration command.
- Cinder v3 endpoint - Ocata requires that the Cinder v3 endpoint be configured.
- Keystone v2 client support was definitely removed in Ocata, so only the openstack client can be used for endpoint enforcement.
Nova db syncs
Doing an upgrade across two releases requires calling nova database synchronization in a specific order. Fortunately, DriveTrain is managed via Infrastructure as Code. That means that by editing Salt formulas, we can reliably control the way in which the cloud is built, or any changes are made.To make sure the changes are done properly, we first extended salt-formula-nova by modules for database syncs on based on version. Using this procedure, we are able to skip the Newton release and jump directly to Ocata. These formulas make sure that we:
- Check the version of api_db and sync api_db sync to the latest version used in Newton (version 20):
- nova-manage api_db version 2>/dev/null
- nova-manage api_db sync --version 20
- Then run db sync to the latest version used in the Newton release (version 334):
- nova-manage db version 2>/dev/null
- nova-manage db sync --version 334
- Online data migrations to the latest versions of the Newton DB:
- nova-manage api_db version 2>/dev/null
- nova-manage db version 2>/dev/null
- nova-manage db online_data_migrations
- Create a cell mapping to the database connection for the cell0 database:
- nova-manage cell_v2 map_cell0
- Create the default cell:
- nova-manage cell_v2 create_cell --name=cell1
- Map hosts to the cell:
- nova-manage cell_v2 discover_hosts
- Map instances to the cell:
- nova-manage cell_v2 list_cells 2>&- | grep cell1 | tr -d "\n" | awk '{print $4}'
- nova-manage cell_v2 map_instances --cell_uuid <cell1_cell_uuid>
- Api_db sync to the ocata version:
- nova-manage api_db sync
- Db sync to the ocata version:
- nova-manage db sync
- Online data migrations to ocata:
- nova-manage db online_data_migrations
In general, we'll split the upgrade into 2 parts: the control plane and the data plane.
Ocata OVS DVR default route missing
The second issue was that the DVR FIP namespace was missing the default route after the upgrade to Ocata. This caused the disconnection of all instances using the SNAT router. To solve this problem, our Neutron team sent the following patch to upstream OpenStack: https://review.openstack.org/#/c/475108/ .OK, now that we've got that out of the way, we can go ahead and start the actual upgrade process.
Upgrade the OpenStack control plane (controllers)
The control plane upgrade is mostly independent of the data plane upgrade, and it does not have to be done all at once. Moreover, as you will see in the demo video, we can upgrade just the control plane to Ocata and still boot instances on Mitaka-based compute nodes.The entire process is automated via the Jenkins pipeline, which has the following stages, but that doesn't mean that the process isn't human-controlled. Between each stage there is human judgment; approval is required to continue on to the next stage.
Let's look at what happens during each step.
1. Prerequisite - Reclass model preparation
The first step is to make sure that the cluster reclass model is changed to include the right configuration for Ocata. Customers will ultimately have access to detailed information about those steps in the MCP documentation, but basically it requires the following steps:- Include the definition for the upgrade node, a VM called upg01 - this node is required for the Test Upgrade phase.
- Include classes to backup and restore all databases
- Include classes specific to Ocata, such as:
2. Test the Upgrade - Single VM verification
This the first stage of the actual pipeline, the goal of which is to verify cloud database consistency during the upgrade of the database schema. It automatically spins up a new single VM with an OpenStack controller and verifies API functionality against the dumped databases. DriveTrain will:- Create a new VM called upg01 for a single node of OpenStack on one of the KVM foundation nodes.
- Automatically backup all databases via xtradb and rsync them into the backup node, which is by default a salt master.
- Create new databases in the existing galera cluster with an “_upgrade” suffix (as in nova_upgrade) and restore them from the previous backup. This clone of production databases will be used for schema API verification using the single VM node.
- Install a single openstack controller inside the upg01 VM
- Verify that the APIs are working correctly by calling the following commands:
openstack service list openstack image list openstack flavor list openstack compute service list openstack server listAfter this verification phase, we can be sure that the upgrade will not fail due some specific database content or API configuration. It also verifies that the metadata model fits the Ocata deployment with the existing configuration.
openstack network list openstack volume list openstack orchestration service list
3. Real Upgrade - control plane rebuild
Once a human has verified that the test upgrade succeeded, he or she can approve moving on to the next stage, which does the actual upgrade of the OpenStack control plane. During this stage, nobody can provision a new workload, and the cloud dashboard is unavailable, but there will be zero downtime for existing workloads. During this phase, DriveTrain will:- Backup the control and proxy node VM disks. It then stops the VMs of the control plane and copies their qcow2 disk files in case a rollback is needed.
- Create new VMs for control and proxy nodes based on Ubuntu 16.04. It automatically provisions these nodes and registers them back to the salt master.
- Execute the standard deployment phase for OpenStack control (ctl) and proxy (prx) nodes, performing redeployment from scratch on clean VMs with clean operating system installs.
- Verify that services are working correctly by testing all API commands and service lists.
4. (Optional) Rollback the Upgrade
The last stage is an optional rollback in the case of any issues or failures. Automated pipelines will stop, waiting on rollback confirmation. During this time you should verify that your cloud is functioning properly. Should you need to roll the changes back, the procedure is:- Change the source db data in the reclass model. (You will have configured this information when creating the backup/restore specification; now point to the original data.)
- Stop the VMs running your control plane nodes and restore the clt and prx qcow2 disks.
- Restore the previous functional DB schemas via the xtradb backup.
- Start the controller VMs.
- Verify your services are working correctly.
Demo Upgrade Video
All of that sounds like a lot of work, but in actuality there's really not much for you to do to achieve this upgrade. To see it in action, check out this video:As you can see, we performed a real control plane upgrade from OpenStack Mitaka to OpenStack Ocata on a 6 node bare metal hardware cluster with Open vSwitch. We also had the new Ocata controllers boot instances on the existing Mitaka computes.
The process looked like this:
- Check the MCP Mitaka-based cloud with existing instances running
- Launch the Upgrade control plane Jenkins pipeline
We then built the pipeline and kicked it off.
- When real upgrade finished, we verified the control plane upgrade by booting an instance on the Mitaka-based compute nodes.
- We then confirmed the rollback action and reloaded the previous Mitaka control plane
- Finally, we booted an instance to verify that the cluster works correctly.
Data plane upgrade (computes)
As you saw in the previous section, the compute node upgrade is almost completely independent of the control plane upgrade, which actually gives us a great deal of flexibility in scheduling upgrades across zones.As for upgrading the computes, the process will differ depending on the networking in use by the OpenStack cluster:
- OpenContrail - With OpenContrail used for networking, you will only have to upgrade nova, qemu and libvirt, because OpenContrail is independent of the OpenStack release. However, since the versions of qemu and libvirt are same, the only thing you actually have to upgrade is nova's python code. Therefore there is zero touching of OpenStack instances, and zero downtime.
- OpenVSwitch - If you're using OpenVSwitch, the process is more complicated, because it uses neutron-openvswitch-agents and upgrades OVS itself. However, MCP again ships the same version of OVS, so again all you're really upgrading is python code. When you upgrade OVS-based OpenStack compute nodes, you will see a disconnection of instances during OVS reload and neutron synchronization, but there is zero downtime, as we're just changing python code around.
What’s more, the pipeline enables the operator to see the outputs on test, sample, and so on. Let’s go through the steps:
- The operator defines a subset of test nodes showing the difference between the two sets of packages.
To do that, you first specify the salt master credentials and URL so Jenkins knows where to find it. You can then specify the particular servers that you want to work with. For example, you can specify cmp* for all compute nodes that start with cmp, or I@nova:compute for all compute nodes.
- The first stage updates the repository with the latest packages and shows packages waiting for upgrade on the test node.
- After approval, it runs the upgrade on the sample subset of nodes, upgrading packages and updating the appropriate configuration files. Then the operator can verify that the upgrade was successful, or wait for some time period before initiating the upgrade of the rest of the infrastructure.
- As the next step, the pipeline sets the default expiration for approving the upgrade of the rest of the infrastructure to two hours. If it's not confirmed, the upgrade is aborted. After confirmation, all nodes run the upgrade.
As you can see this automated procedure gives you a very powerful way to manage the upgrade of all your hypervisors without accessing them manually, and with complete audit capabilities.
Conclusion
In the future, MCP will likely utilize a containerized control plane, which can speed up package installation by downloading docker images. This process can save as much as 30 minutes, cutting the control plane upgrade time to a total of only 15 minutes. However, even with these changes, we still need to perform all of the other steps, such as database backup, architecture changes, configuration changes, and so on. On the other hand, a containerized control plane will add an extra layer such as Docker and Kubernetes, which can potentially increase operational complexity. We will have to monitor it and come with other tooling around to provide seamless upgrade of that layer as well.In this blog, we demonstrated how easily we can take our customers from OpenStack Mitaka to Ocata clouds with backward compatibility and without significant outages. With the right tooling, upgrading OpenStack is no problem.