Benchmarking OpenStack at megascale: How we tested Mirantis OpenStack at SoftLayer
It seems like one of the leading amateur sports in cloud is to ask if OpenStack is ready for Enterprise ‘prime time’ in production. Common water-cooler conversations ask about stability and performance of OpenStack-based clouds at scale. But what does ‘scale’ mean? What level of scale is relevant in the real world?
From our work with many enterprise-class OpenStack customers, we’ve observed that ‘large private cloud’ generally refers to something that runs into tens of thousands of VMs. Hundreds of users from hundreds of projects manipulate those VMs simultaneously. At this scale, I think it’s fair to say, a ‘private cloud’ essentially becomes a ‘virtual data center’.
A virtual data center has some meaningful advantages compared to a classic physical datacenter, not least in flexibility. The time between the customer's order and having the requested infrastructure available drops from days/hours to minutes/seconds. On the other hand, you need to be confident about your Infrastructure-as-a-Service platform in terms of stability under the load.
Scaling Mirantis OpenStack on Softlayer
We set out to validate desired behaviors in Mirantis OpenStack under loads you would expect to see in just such a virtual datacenter. Working with our friends at IBM Softlayer, we carved out an infrastructure of 350 physical machines in two locations. And for starters, we wanted to establish some baseline for an SLA for virtual data centers powered by Mirantis OpenStack.
Obviously, size matters. But there were other questions about an OpenStack virtual datacenter at Softlayer that we thought were important to answer right out of the gate. For example:
Could the VDC could service hundreds of requests at the same time?
Would the user experience suffer as more virtual servers are added in the virtual datacenter?
To find the answer to the first question, we’d need a series of tests with increasing numbers of requests issued simultaneously. Our target concurrency for this phase of benchmarks was 500. We approached the second question by refining it to gauge how long it takes to provision a server in our virtual datacenter, and, more importantly, how duration depends on the number of existing servers.
Based on expectations listed above, we designed first set of benchmarks. This first set is aimed to exercise the Mirantis OpenStack Compute infrastructure, the key component of the platform. The exercise targeted not only the Nova API and scheduler, but Compute agents on hypervisor servers as well.
Because we're benchmarking deployment time, measuring the scalability of the OpenStack Compute service doesn't require creating any load inside the VMs, so we can use an OpenStack flavor with a small footprint, a small image to boot into this flavor and no tasks executed inside the VM. For default flavors (or instance types) defined in OpenStack, the size of the instance does not have a significant effect on provisioning time. However, we’d need to test on real hardware to have a clear understanding of what the Nova Compute agent could handle under load.
Baseline SLA parameters we wanted to verify were:
variance of virtual server instance boot time, including any dependencies between the time and the number of virtual instances in the cloud database;
success rate of the boot process for our instances, in other words, how many requests out of 100 are expected to fail for one reason or another.
We set a target for this benchmark at 100 thousand virtual instances running in a single OpenStack cluster. With such a small footprint flavor, our experiment could be compared to a micro-site service for mobile applications. Another possible use case for this type of workload is a testing environment for a mobile application server with thousands of clients running on independent embedded systems. It was our hypothesis that we could figure this out with several hundred servers -- generously provisioned by Softlayer -- for a limited period of time as our test bed.
Setting up an OpenStack benchmarking environment
“How much hardware do you need?” is a question we often hear, and we often joke with customers that the answer is “More.” But it was clear that running a hundred thousand VMs required sufficient hardware firepower. IBM Softlayer has a pretty broad variety of configurations available, and they were as curious as we were to see what OpenStack could deliver at this level.
To calculate hardware resources needed, we first established a type of instance, or ‘flavor’, in OpenStack terms. Our flavor has 1 vCPU, 64MB of RAM and 2GB of disk, much smaller than a m1.micro AWS type, for example. This is just enough to boot a Cirros test image.
Using a very small instance type doesn't impact results of the test because for the default flavors (or instance types) defined in OpenStack, the size of the instance does not have a significant effect on provisioning time. This is because those flavors only define the number of CPUs, the amount of RAM, and the disk file size. Those resources are usually provisioned with effectively zero variance based on instance size, because unlike in an AWS instance, which also involves provisioning additional services such as EBS, OpenStack instances are self-contained; creating ephemeral storage is accomplished by essentially creating an empty file, for example. For this reason, we can use a very small instance type without skewing the test results.
Totals for resources are:
100,000 vCPUs,
6250 Gb of RAM,
- 200TB of disk space.
So we concluded that a good representative compute server configuration for this test would include 4 physical cores (+4 for HT), 16Gb of RAM and 250GB of disk space. Because we're not going to perform any actual load on the VMs, we can comfortably overcommit CPU as 1:32 (twice the default overcommit rate for OpenStack) and RAM as 1:1.5, which allows us density of 250 virtual servers per physical node. With these non-standard overcommit rates, we only needed approximately 400 hardware servers to hit the target of one hundred thousand VMs.
Knowing that our controllers would be handling the heaviest load, we spec’d servers for those with a more advanced configuration. Each controller had 128GB of RAM and SSD storage to accelerate the work of the OpenStack state database that runs on the controller.
One interesting note: due to differences in configuration between Compute and Controller nodes, they were provisioned in different SoftLayer datacenters with 40ms latency between them; it’s not unusual to expect that in a real cluster, you might have exactly this kind of distributed configuration. Fortunately, due to low latency between the Softlayer datacenters, no significant effect of this distributed architecture was noticed in the benchmark.
OpenStack details
We took release 4.0 of the Mirantis OpenStack product, which includes OpenStack 2013.2 (Havana) and deployed it to servers provided by SoftLayer. To do this, we had to tweak our Deployment Framework: because we could not boot off the Mirantis OpenStack ISO in SoftLayer, we just copied the needed code via rsync over the standard operating system image. Once we'd deployed the cluster, we could use it for benchmarking.
Our configuration included only three OpenStack services: Compute (Nova), Image (Glance) and Identity (Keystone). This provides basic functionality to the Compute service, which is sufficient for our use cases (mobile platform testing).
For networking, we used the Nova Network service in FlatDHCP mode with the Multi-host feature enabled. It has total feature parity with Neutron in our simple use case (single private network, FlatDHCP network and no floating IPs). (Nova Network was recently restored from deprecated status in OpenStack upstream.)
During the benchmark, we had to increase certain parameters for the default OpenStack configuration. The most important change is the size of the connection pool for the SQLAlchemy module (max_pool_size and max_overflow parameters in the nova.conf file). In all other aspects we leveraged the standard Mirantis OpenStack HA architecture, including synchronously replicated multi-master database (MySQL+Galera), software load balancer for API services (haproxy) and corosync/pacemaker for IP level HA.
Rally benchmark engine
To actually run the benchmark test, we needed a test engine; the logical choice was the Rally project for OpenStack. Rally automates test scenarios in OpenStack; i.e., it enables you to configure steps to be performed for each test run, specify the number of runs, and record the time to complete each request. With these metrics, we could calculate the parameters that needed verification: boot times variance and success rate.
We already had experience using Rally on SoftLayer, so that helped us a lot in this benchmark.
The benchmark configuration looks something like this:
{
"execution": "continuous",
"config": {
"tenants": 500,
"users_per_tenant": 1,
"active_users": 250,
"times": 25000
},
"args": {
"image_id": "0aca7aff-b7d8-4176-a0b1-f498d9396c9f",
"flavor_id": 1
}
}
Parameters in this config are defined as follows:
execution is the mode of operation of the benchmark engine. With ‘continuous’ mode, Rally worker thread will issue the next request as soon as the previous one returns (or times out).
tenants is the number of temporary Identity tenants (or projects, as OpenStack now calls them) created in the OpenStack Identity service. It reflects the number of projects which could correspond, for example, to versions of a tested mobile platform.
users_per_tenant is the number of temporary end-users created in every temporary project mentioned above. Rally threads issue actual requests on behalf of randomly selected users from this list.
active_users is the actual number of Rally worker threads. All these threads are started simultaneously and begin sending requests. This parameter represents the number of users who are working with the cluster (creating instances in our case) at any given moment in time.
times is the total number of requests sent during the benchmark run. The success rate is calculated based on this number.
image_id and flavor_id are cluster-specific parameters used to boot virtual servers. Rally uses the same image and flavor for all servers in the benchmark iteration.
- nics defines the number of networks each virtual server will be connected to. By default, each virtual server is connected to a single network (the default network for the project).
The benchmark process
As a first step, we established a sustainable rate of concurrent requests. To do so, we created a set of benchmark configurations that differed across only two parameters: tenants, and active_users, from 100 to 500. The number of VMs to boot in these cases was 5000 or 25000. To economize in the early stages, we started with a smaller cluster of 200 Compute servers, plus 3 Controllers.
The following charts show how time to boot for a given virtual server depends on the number of concurrent requests.
Run 1.1. 100 concurrent sessions, 5000 VMs
Total | Success rate, % | Min (sec) | Max (sec) | Avg (sec) | 90 percentile | 95 percentile |
5000 | 98.96 | 33.54 | 120.08 | 66.13 | 78.78 | 83.27 |
Run 1.2. 250 concurrent sessions, 25000 VMs
Total | Success rate, % | Min (sec) | Max (sec) | Avg (sec) | 90 percentile | 95 percentile |
25000 | 99.96 | 26.83 | 420.26 | 130.25 | 201.1 | 226.04 |
We encountered bottlenecks couple of times during these benchmarks, but all of them were resolved by increasing the size of the connection pool in sqlalchemy (from 5 to 100 connections) and in haproxy (from 4000 to 16000 connections). We also saw false positive detection of haproxy failure by Pacemaker several times. This was caused by a depleted pool of haproxy connections and caused Pacemaker to switch its virtual IP over to another Controller node. In 2 cases, load balancer went unresponsive after the failover. These problems were solved by raising the connection limit in haproxy.
With that done, our next step was to increase the number of VMs in the cloud up to a target of 100,000. We set up and had Rally run the test load with another set of configurations with target populations of 10,000, 25,000, 40,000, 50,000, 75,000 and 100,000 simultaneous virtual servers; we discovered that 250 parallel requests was the topmost stable level. At this point we added another 150 Compute physical hosts to the cluster, for a total of 350 Computes, plus 3 Controller servers.
Based on what we’d learned, we revised our goals and lowered our target number from 100,000 VMs to 75,000. This allowed us to continue the experiment with a 350+3 cluster setup.
With the given level of concurrency, it took just over 8 hours to deploy 75,000 virtual servers with the instance configuration we chose. The following charts show how the time to boot a VM depended on the number of running VMs in those runs.
Run 1.3. 500 concurrent sessions, 75000 VMs
Total | Success rate, % | Min (sec) | Max (sec) | Avg (sec) | 90 percentile | 95 percentile |
75000 | 98.57 | 23.63 | 614.29 | 197.6 | 283.93 | 318.96 |
As you can see, with 500 concurrent connections, we saw an average time of just under 200 seconds per machine, or an average of 250 VMs every 98.8 seconds.
Results: Mirantis OpenStack on Softlayer
This first phase of this benchmarking effort has allowed us to set baseline SLA expectations for a virtual datacenter tuned for compute resources (as opposed to, say, something that is storage and I/O intensive). Mirantis OpenStack is capable of serving 500 simultaneous requests, and can sustain 250 simultaneous requests to build up a population of 75000 virtual server instances in single cluster. In a ‘Compute’ architecture, virtual server instances are not supposed to have permanent virtual block storage attached, and are not organized into complex network configurations. Note that to be conservative, we didn’t “trick out” the cluster with any of OpenStack's scalability features, such as cells, a modified scheduler or dedicated clusters for OpenStack services or platform services (such as DB or queue). This benchmark shows Mirantis OpenStack with its standard architecture; any of these modifications would have improved performance even further.
A critical success factor, of course, was the quality of the IBM SoftLayer hardware and network infrastructure for the benchmark. Our results show that Mirantis OpenStack could be leveraged to manage a large virtual datacenter on SoftLayer standard equipment. While we didn’t instrument the network parameters to see how much load there was on the connections, we observed no impact at all. In addition, the quality of network connections between datacenters in SoftLayer allowed us to easily build multi-DC OpenStack clusters.
Looking ahead: expanding the scope of OpenStack workloads
For the next phase of the benchmarking project, we plan to exercise SoftLayer hardware with benchmark tasks running inside the virtual server instances. We plan to establish service level thresholds for disk and network I/O based on common workload profiles (web hosting, virtual desktop, build/development environment), and gauge how Mirantis OpenStack behaves in each of those scenarios.