Edge of the Stack: Improve Performance of Python Programs by Restricting Them to a Single CPU
Articles and tutorials in the "Edge of the Stack" series cover fundamental programming issues and concerns that might not come up when dealing with OpenStack directly, but are certainly relevant to the OpenStack ecosystem. For example, drivers are often written in C rather than Python, leading to concerns about memory leaks and other C-specific issues. The "Edge of the Stack" series is meant to provide information on these peripheral, but still important areas.
Sometimes the combination of CPython, global interpreter lock (GIL), and operating system (OS) scheduler can decrease performance of your multithreaded program due to extra context switches and thread migration across all available CPU cores as a direct result of such a combination. In this article, we will explain this problem and show how easy it can be to improve performance in some cases.
We will begin by showing an example of a simple python TCP server and client. The server creates a thread pool and then waits for the client to connect. When the connection is established, the server passes it to the pool for processing. The processing function reads one line from socket and simulates a CPU load. After receiving 'bye\n', the server closes the connection. The client then creates N connections and generates a fixed-size load.
<----------------------------------------------------------------------------->
python:
data = ' ' * 100 + '\x0A'
def client(th_count):
sockets = []
for i in range(th_count):
sock = socket.socket()
for cnt in range(3):
try:
sock.connect(host_port)
break
except socket.error:
if cnt == 2:
raise
time.sleep(0.1)
sockets.append(sock)
for i in range(NUM_PACKETS):
sock = random.choice(sockets)
sock.send(data)
for sock in sockets:
sock.send('bye\x0A')
python:
def server(th_count):
def process_client(sock):
num = 0
while True:
msg = ""
while not msg.endswith('\n'):
msg += sock.recv(1)
if msg == 'bye\n':
break
for i in range(serv_tout):
pass
num += 1
s = socket.socket()
s.bind(host_port)
s.listen(5)
with ThreadPoolExecutor(max_workers=th_count) as pool:
fts = []
for i in range(th_count):
sock,_ = s.accept()
fts.append(pool.submit(process_client, sock))
for ft in fts:
ft.result()
The following are the code timings for the four-thread case:
shell:
$ python mt_test.py client 4 & time python mt_test.py server 4
real 0m8.342s
user 0m7.789s
sys 0m6.587s
Here we have the same timings, but now with the command line option 'taskset 0x00000001', the OS is allowed to run all of the python threads (from the eight available) on a single core:
shell:
$ python mt_test.py client 4 & time taskset 0x00000001 python mt_test.py server 4
real 0m4.663s
user 0m3.186s
sys 0m0.762s
The results don't exactly jump out at you--a four-thread program is faster when executed on a single core instead of eight. There are two main reasons for this result. First, we have the GIL. It doesn't matter how many Python threads are ready to run--only one of them is allowed to execute the Python code at any particular moment.
We couldn’t have expected to achieve performance improvement just by running this code on a multi-core computer. All pure-python programs are concurrent, but not parallel, and performance mostly does not depend on how many cores you have. This explains why the performance doesn't degrade after processor affinity is turned on. Why does it improve? There are two main reasons:
When you have more than one processor, you have extra thread context switching.
Threads can't use the CPU cache simultaneously, so they continuously battle for the CPU cache with other threads.
Let’s take a closer look at what happened in the two-thread example. While the first thread is processing certain data, the second is waiting for data from the socket. The second thread socket then gets some data and becomes ready to continue the execution. The OS scans the available CPU cores and discovers that the first core is busy processing the first thread. So, it schedules the second thread to the second core. The second thread starts and first tries to acquire GIL. It fails because GIL is held up by the first thread. So it goes to sleep again, waiting for GIL to be released.
As a result, the OS, which is clueless about GIL semantics, is doing a lot of extra work. Part of this work is being done by the second CPU core and should not really slow down the first thread, but it does create extra load for the memory bus and the CPU cache. If the processor has HyperThreading (HT), the situation may be even worse.
In the meantime, the real problem is that the second thread is now scheduled for execution on the second core. When GIL is released, the OS does not migrate this thread to the first core because it knows about the caches and tries not to move a thread to its core without a real reason. As a result, all Python threads, which in sum can produce a 100 percent load on a single core, are producing a 12.5-percent load on each of the eight available cores.
In this situation, Python threads are continuously jumping around all the cores. Data is moved in and out of L1/L2 caches and LLC/RAM, and each cache miss can costs thousands of CPU cycles for memory access.
By restricting the OS to scheduling all server Python threads on a single core we eliminate most of the context switches. Also, in this case other threads would (mostly) not be scheduled to this core, which also would decrease frequency of cache misses.
To test, all measurements were taken on a Core i7-2630QM@2.90 GHz, Python 2.7.5, x64, Ubuntu 13.10., averaged from seven tests. (NOTE: Python 3.3 shows this same behavior.) To eliminate the turbo boost influence, the CPU clock was decreased to 800 MHz. Here are the raw results:
Threads | SUM | SUM_AF | %DIFF | SYS | SYS_AF | %DIFF | USR | USR_AF | %DIFF |
---|---|---|---|---|---|---|---|---|---|
1 | 3.35 | 3.55 | -5 | 0.54 | 0.52 | 4 | 2.78 | 3.03 | -8 |
2 | 7.26 | 4.63 | 36 | 4.91 | 0.67 | 86 | 5.10 | 2.95 | 42 |
4 | 8.28 | 4.90 | 41 | 6.58 | 0.76 | 88 | 7.37 | 3.14 | 57 |
8 | 7.96 | 5.00 | 37 | 6.49 | 0.84 | 87 | 7.32 | 3.15 | 57 |
16 | 9.77 | 5.88 | 40 | 6.53 | 0.73 | 89 | 7.01 | 3.15 | 55 |
32 | 9.73 | 6.84 | 30 | 6.54 | 0.81 | 88 | 7.06 | 3.04 | 57 |
Where,
SUM is the execution time, as shown by the 'real' field of the 'time' utility’s output.
SYS is the OS time, as shown by the 'system' field.
USR is the user time, as shown by 'user' field.
XXX_AF is XXX when CPU affinity is turned on. (In other words, SUM_AF, SYS_AF, and USR_AF numbers are SUM, SYS, and USR with CPU affinity turned on.)
%DIFF is the difference between XXX and XXX_AF (a positive shows that affinity increases the speed).
Running the program in VTune shows that when processor affinity is turned on, the amount of cache misses decrease by 500% and the number of context switches declines by a factor of 40. During the experiments I found another interesting thing: if we restrict the program to just one core, we also can enable the turbo boost feature--it also speeds up the program, which uses a single core, by increasing this particular core clock.
When can we expect such speed increases? Our example is a CPU bound program because the data flow is faster than it can be processed. In case of I/O bound programs, the speed increase would be smaller. So, a higher CPU load provides a bigger boost in this case.
We’d expect that processor affinity would slow down the program in the following cases:
If some calculations that release GIL during execution are made in C/C++ library.
If you use Jython or IronPython.
If you use multiprocessing/ProcessPoolExecutor, which uses processes instead of threads and allows python to side-step GIL. Since processor affinity is inherited by child processes, you should either set a dedicated CPU core or forego affinity for child processes.
In some single-thread programs, such as the ones that use gevent.
For further consideration
If you run a multithreaded Python program using CPython, it’s worth trying to restrict available cores to one or two and look at the results.
For further information, please check out the following resources: