Nova Scheduler-Database Interactions: How to Nail Those Scalability Thwarters
OpenStack is designed to be a versatile and scalable solution for building clouds. On the face of it, this sounds like a good definition. Everybody knows that the main reason for building clouds is high scalability. But it's a generic description, and doesn't quantify that scalability.
If you’re an OpenStack user, you might want to know how much you can scale OpenStack, or, in other words, when to expect it to quit scaling. On the other hand, if you happen to be a developer, you might want to know the same thing but for another reason--you may want to know how to push this boundary. So, you might need to have a tool that can measure the cloud performance as a whole and also target specific components.
While the development of such tools (for example, the Rally project, which is in progress and very promising) already is underway, it may take some time for them to become available. But, you can still find the sore spots that stymie scalability and the overall performance, and you can do so without specific testing frameworks. I will show you a technique that addresses one specific issue, along with the results of testing it out, and a suggested path to solve problems. While rather specific, this approach can be used as a base for further hotspot research.
What has haunted me while tweaking the Nova scheduler for some time now is the way the compute node states are handled. The idea behind scheduling in Nova is simple: Nova has to find a node with enough resources to run an instance, and the scheduler must know the current state of all of the available nodes to find such a node. Therefore, compute node states are essential to scheduling and must be stored in a safe place that is accessible to an arbitrary number of schedulers. Yes, there can be many schedulers; remember the versatility and scalability from the very first line. Right now, that place is a database, which concerns me. The current mode allows you to have several tables with node-specific data, and on each scheduling request, these tables are obtained from the database. This works fine for small clouds, where these tables are relatively small, but it is important to figure out how cloud size affects database performance and the related scheduling lag.
I decided to do a little research, focusing on one aspect of Nova scheduler's behavior--database access. I wanted to know how long it would take to obtain all of the compute node-related data from a database because this delay would be present in all of the instance requests. To figure this out, I developed simple tools to measure the delay. I’ll share how I did it and what I found out.
Measuring clouds
In an ideal world, you would take a server farm with some tens of thousands of machines, deploy OpenStack on it, run it for several years, record the precise timings, load the responses, and make conclusions based on this data. Be as it may, we don't live in an ideal world and we have no means to perform real-life experiments with complex systems like neutron stars, tectonic plates, and huge OpenStack clouds. Whether we like it or not, we have to stick to synthetic experiments and simulations, even when it comes to OpenStack performance.
To carry out this experiment, I used a dedicated server equipped with Xeon E5-2620 CPU, 32 GB of RAM, and a 256-GB SSD. Naturally, your mileage may vary, and these results should not be considered a universal truth, but they present a reference point and allow for rough estimates by comparing processing power of the end user’s server with the one used in this experiment. I used a test MySQL database and created three tables--compute_nodes
, compute_node_stats
, and services
. Then, I populated these tables with fake compute nodes. Each record in compute_node
had one corresponding record in the services table and ten in the compute_node_stats
table. I did this to take into account the fact that you often want to have very finely grained host state specifications, including but not limited to the number, characteristics, and states of physical devices available at the node, and networking specifications. Currently, you can do this by adding extra entries to the compute_node_stats
table. To take that into account, I used ten entries as a middle ground between what we have now and what we might get in a very near future (remember the plethora of proposals for increasing the scheduler’s awareness of node states, which would lead to more entries in compute_node_stats
).
Having filled the database, I measured how long it took to collect the data from all of the tables. To do so, I called db.compute_node_get_all()
for different table sizes, from 1K to 19K entries, in 1K increments. The normalized results are shown in Figure 1.
Figure 1 Length of db.compute_node_get_all()
call as a function of the number of nodes
So far so good: the database response time grows linearly with the table size. A simple fitting shows the access time growth can be expressed as t(n)=0.32*x
, which is very nice indeed. Does that speak well for MySQL? Yes. Should that be interpreted as no foreseeable trouble with the current database usage scheme? Not so fast. The problem with this approach is that I measured the access time for only one process--exchanging static data with DB. Unfortunately, this is not the case for a real cloud, considering its behaviour is more complex. The truth is that you can have several requests for database access from different services, not to mention the constant stream of node state updates. Let's see how that behavior can be taken into account and how it thwarts this perfect picture.
Simulating OpenStack database behavior
To make matters a little simpler, suppose that compute nodes and the scheduler are the only database load sources. Thus we will have to emulate only compute node requests for state updates and scheduler's requests for all compute nodes related data. Consequently we'll obtain lower boundary values and it is likely that in a real cloud DB response will be even slower. Let us start with updates-caused load.
In a healthy cloud, compute nodes send state updates at least once in a minute. Therefore, the more nodes in a cloud, the more uniform the load distribution. In other words, it is safe to consider that every second a database will receive N/60
updates, where N
is the number of nodes in the cloud and is sufficiently large. So, to emulate a normal database load, you have to devise a way to produce N/60
updates. The following script allows you to do exactly that.
import multiprocessing
import os
import time
import random
from nova import db
from nova.db.sqlalchemy import models
from nova.openstack.common.db.sqlalchemy import session
hps_minimum = 5 # if we manage to produce less than this many hits per
# second then we probably have a problem.
barrage_dur = 5. # length of a single loading period.
hps_maximum = 150 # this is the number of hits we want to experience.
setup_duration = 300. # time limit for running the test.
db.CONF.sql_connection = 'mysql://testuser:testpwd@127.0.0.1/testdb'
cc = type('FakeContext', (object,), {'is_admin': True, 'read_deleted': "no"})
def worker(e, q, data):
while True:
if e.is_set():
break # received poison and dying.
tstart = time.time()
hitcounter, delta = 0., 0.
while delta < barrage_dur:
fmb = random.randint(0, 2048)
try:
db.compute_node_update(cc, random.choice(data), {'free_ram_mb': fmb})
except Exception:
pass
delta = time.time() - tstart
hitcounter += 1.
hitcounter /= delta
q.put(hitcounter)
if hitcounter < hps_minimum:
e.set() # This can produce some thrashing, but it may become an incarnation
break # of grinding halt otherwise.
return
if __name__ == '__main__':
def kill_all(lst):
for worker in lst:
worker[0].terminate()
compute_nodes = db.compute_node_get_all(cc)
cn_ids = [x.id for x in compute_nodes]
jobs, hitlist = [], []
hits_per_second = 0
fb_queue = multiprocessing.Queue()
zero_day, current_day = time.time(), time.time()
while (current_day - zero_day) < setup_duration:
if hits_per_second < hps_maximum - 0.05*hps_maximum:
poison = multiprocessing.Event()
p = multiprocessing.Process(target=worker,
args=(poison, fb_queue, cn_ids))
jobs.append((p, poison))
p.start()
if hits_per_second > hps_maximum + hps_maximum*0.1:
job = jobs.pop()
job[1].set()
hitlist.append(hits_per_second)
hits_per_second = 0
time.sleep(0.2)
jobs = [x for x in jobs if not x[1].is_set()]
for i in jobs:
hits_per_second+=fb_queue.get(timeout=60)
current_day = time.time()
if jobs:
kill_all(jobs)
for job in jobs:
job[0].join(30)
This script spawns processes until a desired database load (hps_maximum
) is produced, and then it maintains this load for setup_duration
seconds. While there are hundreds of ways to improve it, this script does the job. With it up and running, it is now possible to measure the db.compute_get_all()
call duration. Let’s do it the same way we did it earlier.
Figure 2 Length of db.compute_node_get_all()
call as a function of the number of nodes with continuous data updates
Figure 2 shows the results of this experiment. At first glance, it is clear that a database experiencing constant table updates responds slower. The question that must be answered is, however, which law does this growth follow. If it’s some kind of linear law, then everything is just fine. Unfortunately, my setup didn't allow me to carry out experiments with bigger table sizes, but a simple fitting of the obtained data points produces the following quadratic function:
t(n) = 0.02*x^2 + 0.21*x + 0.32
Are the results bad? If yes, then how bad is bad? Figure 3 answers both questions.
Figure 3. A fitted response function for a greater number of compute nodes. The red dot marks the point at which the tested setup crosses a one-minute delay.
When you have 10K nodes, you can still pretend that you have no database access issues considering that the scheduling time is only a few seconds long. When you have 20K nodes, you have to convince yourself that 13+ second delays for scheduling actually are nothing compared with the net profit of using OpenStack. But when you get closer to 49.7K nodes, you approach an important point: by now, you are definitely getting outdated compute node states because the state acquisition lasts more than a minute, and a minute is a default time step for sending state updates. Should the inability to get accurate compute node states be considered a problem? I think so.
More measurements
Scared yet? Yes? No? Hold on, there is more. It is possible at least in theory to have several schedulers in your cloud. (Hey, isn't that the primary reason for doing all of this song and dance with a database instead of simply storing everything in the scheduler's own memory, where it eventually winds up anyway?) So one question naturally arises: what will happen with access time if several schedulers try to talk to the database at the same time?
To see what happens if there are multiple schedulers, I carried out the earlier experiment with one modification. Now I had several processes issuing db.compute_node_get_all()
calls almost simultaneously. I carried this experiment for three and nine "schedulers" to get a trend. For three schedulers, I got almost the same fitting as for a single one:
t(n) = 0.02*x^2 + 0.28*x + 0.17
It means that t(n)
grows slightly faster, but it is almost as fast as in the case of one scheduler.
Okay, maybe this is not that bad after all? Let's see what we can get with nine schedulers!
t(n) = 0.04*x^2 + 0.44*x + 0.17
It is still a quadratic function, but now it grows much faster. With nine schedulers, we hit a one-minute limit at only 33K nodes. I’ve not performed other experiments, but apparently adding more schedulers won't reduce the response time, and judging from what I have seen, I think it’s safe to assume that adding more schedulers will worsen the situation. To put the results in perspective I plotted all of the data into a single graph, as seen in Figure 4.
Figure 4. All of the measured data in one plot. The “no load” plot corresponds to the case where the database was static; all of the other cases assume that periodic updates are present.
A possible solution
So, now that we've seen the data, we have to find a way to deal with it. There are two options, as I see it.
One possible solution to this problem is to discredit this database test due to a lack of database tuning and the fact that no one will ever use OpenStack for more than <insert_your_estimate> nodes.
Another way to deal with it is to think about how this problem could be mitigated.
I prefer the second approach. I am convinced that the problem stems from the need to share compute node states among multiple schedulers and having to use a database for scheduler synchronization. Therefore, you could solve the problem by devising a model of storing and sharing data that does not require shoveling huge piles of data to and from a database. For instance, you could share incremental state updates via a queue or, even better, via a distributed hash-table. I did some preliminary experiments, and it appears that memcached works well as a mean for synchronization. You can see changes needed to implement one approach to this solution here. I am planning to get the hard data of a memcached-based state exchange soon, but according to my preliminary study, even in the worst case when obtaining all of the data from memcached is required, it takes less than a second to do it for tens of thousands of nodes.
It is also worth noting that an idea for a smarter scheduler has been floating around for some time, and here a smarter scheduler means obtaining more compute node characteristics more frequently. In the light of these new trends, scheduler performance has become a major issue that must be addressed before it’s too late and you’re forced to shovel gazillion lines of code to fix it.
Alexey Ovchinnikov has successfully built brain-computer interfaces for rodents and studied the behavior of chaotic electronic generators before the complexity of OpenStack stole away his attention. Now he devotes much of his time to making OpenStack even better than it is.