Have you ever wondered as to how processor affinity influences a single threaded process on a multiprocessor machine? Well, I have. Today nearly all new machines come with 2 or 4 cores. If you’re lucky, you have an 8 core machine, and if you are a very lucky, you get 16 or more to play with. And no, virtual cores don’t count 🙂

So what does processor affinity do to a process on a multiprocessor architecture? This depends on the system architecture. The two common system architectures today are SMP and NUMA. SMP is the classic multiprocessing used by Intel up to the Core2 generation, and NUMA has been used for a long time by AMD Opterons and descendants, and lately also by the Intel Core i7 family. On SMP it does not matter, which data are processed on which core; however, the contrary is true on NUMA, where memory can be local or remote relative to the core used.

While playing with my workstation (Server 2008 x64 on a quad core Core2 Q6600 @ 3GHz with 8GB of RAM), I noticed a funny thing. The process I was running, a Python script, was jumping from core to core without any recognizable pattern. No other CPU intensive processes were running, only the usual system stuff. The script does some data analysis on quite large datasets – tens of gigabytes, which have to be processed as a stream, because of Python’s incredible memory hunger.

Before you start complaining that the script should have been parallelized in the first place, yes, it is. Parts of it at least are. But certain parts simply must be run sequentially.

So my question was, wouldn’t it be more efficient, if the single threaded process would run on one dedicated processor? Well, unless the process scheduler is über smart, I think it would.

See, the jumps from core to core probably because some serious cache trashing, so if the process runs on one dedicated core, the cache can be utilized much better. Unless, (and here is where my doubts come in), Python is an interpreted scripting language. The interpreter is quite complex with its own infrastructure, taking a lot of CPU power on itself. What if the scheduler is smart enough to split the different parts of the interpreter and the running byte code across multiple cores, and dedicate each piece of code to a core to avoid cache trashing? That would be nice! Not as nice as an auto parallelizing Python, but still, a smart feature to have.

I looked at some documentation about Server 2008 process scheduler, but could not find anything, which would confirm or disprove any of my theories. Apparently there are optimizations related to NUMA architecture, to minimize process core/cpu to memory distance, but nothing about maximizing cache utilization by jumping cores.

Therefore I ran my own tests. I tested on 32 and 64 bit version of Python 2.6.2, as well as Psyco – the python accelerator. Unfortunately, Psyco supports only 32 bit Python :(. All tests were run on my workstation as mentioned earlier. Times are in minutes and seconds.

Processor Affinity                       All Cores      Dedicated Core
Python 2.6.2 32bit                              10:03                10:03
Python 2.6.2 64bit                              10:54                10:42
Python 2.6.2 32bit + Psyco             10:14                 9:38

Processor Affinity                       All Cores      Dedicated Core
Python 2.6.2 32bit                              41:58                 39:46
Python 2.6.2 64bit                              40:09                40:00
Python 2.6.2 32bit + Psyco             30:12                29:38

I have run each test only once, because of time constraints, but I made sure that all tests run under equal conditions. As you can see, the affinity gain is minimal, but measurable, and specific to the executed code.

So generally I can say, yes, it makes sense to pin down a process to use a dedicated core, but the gain is dependant on the code used, and probably negligible. Ultimately, there are better ways to speed up your code.