Horde3D

Next-Generation Graphics Engine
It is currently 24.11.2024, 23:44

All times are UTC + 1 hour




Post new topic Reply to topic  [ 32 posts ]  Go to page 1, 2, 3  Next
Author Message
 Post subject: Experiments with threads
PostPosted: 24.10.2008, 22:29 
Offline

Joined: 14.04.2008, 15:06
Posts: 183
Location: Germany
I've done some hacks to use my dual core inside of Horde3D.
Profiling revealed that SceneManager::updateNodes and SceneManager::updateQueuesRec are quite expensive and might benefit from parallelization.

Assumptions: Any node in the scene graph depends only on its parent and their parents etc. That means we can process the nodes in any order as long as parents are processed before their children.

Idea: Use a pool of worker threads to do updateQueuesRec / SceneNode::update in parallel.
I started with updateQueuesRec and tried to transform the scene graph into a linear structure - split according to depth level of the nodes. Generating this data every frame proved to be too expensive and data locality wast lost, so I used a simpler approach:
Processing only the children of the root node in parallel instead of a parallelism on every recursion level. Additional cost is scattering (very cheap) / gathering the data.

Goal: Speedup factor of at least 1.5 which would result in an overall benefit for Chicago of around 8% and should increase with larger scene graphs.

Results:

Chicago:
100 instances of man: -6% / -3FPS
500 instances of man: +10% / +1FPS
1000 instances of man: no measureable change
500 instances of sphere and no animation: no measureable change
2000 instances of sphere and no animation: +15% / +1FPS

Terrain:
no reproducible change

Knight:
nothing really reproducible, but it showed a tendency to be slower with threads.





- for small scene graphs there's a tendency to be slower
- huge dependency: boost::thread
- added complexity
- benefits for large scene graphs as long as "enough" nodes are direct children of the root node to allow equal load distribution. Otherwise fallback to old behavior
- better testcase is needed


If anyone is interested in the code I can cleanup it up and post a patch against community svn trunk.


Top
 Profile  
Reply with quote  
PostPosted: 25.10.2008, 00:40 
Offline

Joined: 22.11.2007, 17:05
Posts: 707
Location: Boston, MA
I would be interested to see the code, if you don't mind - I might be able to offer suggestions to speed it up.

_________________
Tristam MacDonald - [swiftcoding]


Top
 Profile  
Reply with quote  
PostPosted: 25.10.2008, 03:52 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
Good job ! I can't wait no longer to see the source code. Perhaps we can gain better performance boosts on Quad cores and higher. It's great to have a benchmark on Quad core and old HTT enabled cpus too :wink:


Top
 Profile  
Reply with quote  
PostPosted: 25.10.2008, 16:10 
Offline

Joined: 14.04.2008, 15:06
Posts: 183
Location: Germany
So here's the code. Still a bit of a mess with local functions ;)

In egScene.h is a new define which enables / disables threads. In egScene.cpp are two locations where a variable NUM_PARTS is defined. These must be tuned more carefully to get the best results. Too few and the last thread takes too long while the main thread waits, too many and the overhead gets to large.

You need boost 1.34 / 1.34.1


Attachments:
File comment: Parallel processing of SceneManager::updateQueuesRec and SceneNode::update
threads.patch.zip [4.24 KiB]
Downloaded 892 times
Top
 Profile  
Reply with quote  
PostPosted: 26.10.2008, 16:36 
Offline

Joined: 22.11.2007, 17:05
Posts: 707
Location: Boston, MA
Just read through the code, looks like you did a very nice job. I don't think that scenes as small as the ones in the samples will benefit any from multi-threading though. If you build a scene with several hundred nodes, we might see an increase in performance.

_________________
Tristam MacDonald - [swiftcoding]


Top
 Profile  
Reply with quote  
PostPosted: 27.10.2008, 01:55 
Offline

Joined: 08.11.2006, 03:10
Posts: 384
Location: Australia
Nice work :wink:

I think this is important though:
Codepoet wrote:
In egScene.h is a new define which enables / disables threads.
On a quad-core CPU with my engine, I use one thread to call Horde functions, and at the same time 3 other threads start updating all the game entities for the next frame - so I would actually prefer Horde to be single-threaded :oops:


Top
 Profile  
Reply with quote  
PostPosted: 27.10.2008, 09:34 
Offline

Joined: 14.04.2008, 15:06
Posts: 183
Location: Germany
DarkAngel wrote:
On a quad-core CPU with my engine, I use one thread to call Horde functions, and at the same time 3 other threads start updating all the game entities for the next frame - so I would actually prefer Horde to be single-threaded :oops:

If we choose to integrate this into the core I'd like to make this runtime configureable with an engine option which sets the number of worker threads. If you choose 0, you get the old behavior. If we additionally integrate some kind of profiler we could let the engine choose what's best depending on runtime statistics.


Top
 Profile  
Reply with quote  
PostPosted: 28.10.2008, 01:12 
Offline

Joined: 08.11.2006, 03:10
Posts: 384
Location: Australia
I had another think about it, and if threading is a long-term goal of Horde3D then it might fit with horde's philosophy to allow the user to create threads for it.
For example, we could allow the user to specify a call-back function that Horde can use internally when scheduling "threaded work"/"tasks". Then we can leave it up to the call-back function to take responsibility for creating threads and dividing "tasks" between them.

Code:
typedef void (*H3DTaskMain)( void* );
typedef bool (*H3DNewTask)( H3DTaskMain, void* );

void SetNewTaskCallback( H3DNewTask );


Then internally, when we have a "task" to do, we call the user's call-back function, or if they haven't set a call-back then fall back to a default.
Code:
H3DNewThread g_fnNewTask = 0;
void SetNewTaskCallback( H3DNewTask f ) { g_fnNewTask = f; }
void RunTask( H3DTaskMain f, void* arg )
{
    bool ok = false;
    if( g_fnNewTask )
        ok = g_fnNewTask( f, arg );
    if( !ok )//there was an error or no callback was set, just run the task in this thread
        f( arg );
}


Then we could write something like the following in the utility layer to provide a default multi-threaded implementation:
Code:
typedef std::pair<H3DTaskMain, void*> Work;
typedef atomic::queue<Work> WorkQueue;//atomic::queue doesn't actually exist, we would need to write a thread-safe queue of our own (I can help with that)
WorkQueue g_Work;
bool g_Abort = false;

//put the work into a thread-safe queue
bool ThreadedNewTask( H3DTaskMain f, void* arg )
{
    taskQueue.push_back( std::make_pair( f, arg ) );
    return true;
}
void TaskProcessingThreadMain()
{
    Work work;
    while( !g_Abort )
    {
        if( g_Work.pop_front( work ) )
        {
            work.first( work.second );
        }
    }
}
void InitThreadingUtils()
{
    g_Abort = false;
    Horde3D::SetNewTaskCallback( &ThreadedNewTask );
    for( int i=0 ... )
        new boost::thread( &TaskProcessingThreadMain );
}
void ShutdownThreadingUtils()
{
    g_Abort = true;
}


Top
 Profile  
Reply with quote  
PostPosted: 28.10.2008, 04:26 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
If there is a lot of work with multithreading, I'll be glad to help you :wink: Optimizing the utmath is near to end and some little bugs must be fixed [I think this takes ~1-3 weeks], after that I'm going to be a bit friendly with boost lib and learning their usage from their online manual [anybody knows any offline resources?].

tip : I think it's better to place the patched files too and this makes the integrating process a bit faster :wink:

BTW, it's better to share the work between community members and don't leave the optimizers alone :idea:


Top
 Profile  
Reply with quote  
PostPosted: 28.10.2008, 13:48 
Offline

Joined: 14.04.2008, 15:06
Posts: 183
Location: Germany
DarkAngel:
The idea to allow the user to create the threads is really good. That would avoid any unwanted dependencies like boost.
But I see a few problems:
We can implement a threadsafe queue with atomic operations but the pop operation would have to block if there is nothing to do. Implementing that blocking efficiently needs some OS support. The same is true for a waitUntilEmpy function - but here we could at least help processing the work.
Another tricky issue is signaling shutdown / g_Abort: the while(!g_Abort) could be optimized away. But some kind of memory barrier should help here, at least gcc provides primitives for that. No volatile tricks, please.

A way around this would be if the user not only creates the threads for us, but also a queue implementation with efficient waiting. But this again can be more complicated than it sounds: My current implementation supports having independent thread pools, so that fire and forget tasks don't interfere with other sections of the engine were we have to wait for the results.
I don't use this feature at the moment, but it could be useful later. I'll think about more about this.

Siavash:
Posting the real files instead of patches is suboptimal - what if someone changes the same files in svn? Just happened with the "no using in headers" patch. Maybe it's just to easy to generate these patches with git and keep them up to date...
Anyone can help. My proof of concept implementation needs some more testing, especially with large scenes to show if it's really useful. Before integrating it into the engine we should discuss some more possible designs and their respective dis-/advantages.


Top
 Profile  
Reply with quote  
PostPosted: 29.10.2008, 00:47 
Offline

Joined: 08.11.2006, 03:10
Posts: 384
Location: Australia
Thanks ;)
Quote:
Another tricky issue is signaling shutdown / g_Abort: the while(!g_Abort) could be optimized away. But some kind of memory barrier should help here, at least gcc provides primitives for that. No volatile tricks, please.
A memory barrier isn't needed as the CPU's ordering of reads/writes from/to the abort flag are unimportant - we just need to stop the compiler from re-ordering or optimising away the flag.
Making that variable volatile should ensure that the compiler actually reads the value of the flag each iteration.
Unfortunately, newer versions of MSVC actually do insert memory barriers automatically when using volatile variables - as this isn't required in our case it's overkill, but hopefully not a performance-killer.
Quote:
We can implement a threadsafe queue with atomic operations but the pop operation would have to block if there is nothing to do. Implementing that blocking efficiently needs some OS support. The same is true for a waitUntilEmpy function - but here we could at least help processing the work.
I don't like the idea of pop blocking - IMO it should just return false if there is nothing for it to pop. In my example, this would result in a tight loop, so you could write something like this to cause it to block:
Code:
        if( g_Work.pop_front( work ) )
        {
            work.first( work.second );
        }
        else
        {
            g_Work.wait_until_not_empty();
        }

In my engine, I wouldn't want it to ever block - if the queue becomes empty then I will immediately re-use the thread to perform other tasks.

As for the efficient pausing/signalling/blocking - we can probably 'borrow' some code from GLFW, which implements cross-platform pause/resume.
Quote:
A way around this would be if the user not only creates the threads for us, but also a queue implementation with efficient waiting.
In the example that I gave, the queue is implemented in the utility library, which means the user doesn't have to use it if they don't want to - they're free to use their own implementation.
Quote:
But this again can be more complicated than it sounds: My current implementation supports having independent thread pools, so that fire and forget tasks don't interfere with other sections of the engine were we have to wait for the results.
I don't use this feature at the moment, but it could be useful later. I'll think about more about this.
If the user has their own work-queuing system, they should be able to correctly schedule horde tasks by passing their own function to Horde3D::SetNewTaskCallback. For example, within your callback you could put the tasks into a high-priority queue so they don't end up waiting for F&F tasks.


Top
 Profile  
Reply with quote  
PostPosted: 29.10.2008, 14:49 
Offline

Joined: 14.04.2008, 15:06
Posts: 183
Location: Germany
Quote:
A memory barrier isn't needed as the CPU's ordering of reads/writes from/to the abort flag are unimportant - we just need to stop the compiler from re-ordering or optimising away the flag.

The memory barrier is detected by the compiler and will restrict optimizations of the flag accordingly ;)

Quote:
Making that variable volatile should ensure that the compiler actually reads the value of the flag each iteration.
Unfortunately, newer versions of MSVC actually do insert memory barriers automatically when using volatile variables - as this isn't required in our case it's overkill, but hopefully not a performance-killer.

Search for "volatile considered harmful" or something like that. It's explained fairly good in the Linux kernel docs what volatile is and what it's not. It's not intended do be used for thread synchronization / event signaling.

Quote:
As for the efficient pausing/signalling/blocking - we can probably 'borrow' some code from GLFW, which implements cross-platform pause/resume.

And I have looked for a small library with portable threads to use instead of boost...


Top
 Profile  
Reply with quote  
PostPosted: 30.10.2008, 00:41 
Offline

Joined: 08.11.2006, 03:10
Posts: 384
Location: Australia
Codepoet wrote:
And I have looked for a small library with portable threads to use instead of boost...
Actually, is the util library already dependant on GLFW?
If so, I think it has threads too :D

[edit]
Codepoet wrote:
Search for "volatile considered harmful" or something like that. It's explained fairly good in the Linux kernel docs what volatile is and what it's not. It's not intended do be used for thread synchronization / event signaling.


I found that linux document - it just seems to say that you should use their pre-written synchronisation primitives instead of trying to do it yourself with volatile.
Quote:
Like volatile, the kernel primitives which make concurrent access to data safe (spinlocks, mutexes, memory barriers, etc.) are designed to prevent unwanted optimization. If they are being used properly, there will be no need to use volatile as well.

There's still nothing wrong with using volatile, it's just linux has higher-level alternatives. Internally, these alternatives make use of volatile themselves.

In the case of a simple abort flag, we have simple requirements:
- the flag starts as false. It will be set to true once. It will not be set to false.
- many threads read the flag many times - these reads must not be optimised away.
- one thread will set the flag to true, one time.

In this case, race conditions are not a concern as they can't occur (only one writer) and even if they did (if there were two writers), the written value would be the same anyway.
So given the simple requirements we can get away with avoiding any kind of synchronisation primitive (barriers, spin locks, etc...) just as long as we ensure the compiler doesn't optimise the multiple reads (which is what volatile does).


Top
 Profile  
Reply with quote  
PostPosted: 30.10.2008, 01:41 
Offline

Joined: 14.04.2008, 15:06
Posts: 183
Location: Germany
Ok, then we'll use volatile.

The utils library does not depend on glfw, only the samples. Probably not ideal to change that, so lets default to disabled for the thread support for now.


Top
 Profile  
Reply with quote  
PostPosted: 30.10.2008, 02:16 
Offline

Joined: 08.11.2006, 03:10
Posts: 384
Location: Australia
Codepoet wrote:
Ok, then we'll use volatile.
probably a good idea to put a comment in there with my justification of why it's ok, otherwise other people might not trust it either ;)

Quote:
The utils library does not depend on glfw, only the samples. Probably not ideal to change that, so lets default to disabled for the thread support for now.
If we can't put thread-creation in the util library, then I'd be up for writing some a sample program for GLFW and one for Boost that show how to create the threads (perhaps license the sample code as BSD so people can use it directly with their own code?).


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 32 posts ]  Go to page 1, 2, 3  Next

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group