I have noticed the the bottleneck that is updateQueuesRec() also and after some investigating, determined the cause of the bottleneck to be cache misses, not cpu instructions. Therefore, adding threads will not yield the 1.5X performance improvements you are hoping for. Furthermore adding threads for scenes with large scene graph will yield even lower gains if your CPU has shared cache (even the trusted core2 has shared L2 cache) as cache misses from one thread may pollute the cache of the other thread (ie. the predictive prefetching of one thread my be flushed due to a cache miss of another thread). Compound this with the overhead of synchronization access to shared data and you now have to be very careful in when and how you use threads in order to actually get some benefits. As for which thread library to use, I would recommend looking into OpenMP as a better way of optimizing this function since it is platform independent and wouldn't require additional source files (only a few tags here and there). Hope this helps