Horde3D

Next-Generation Graphics Engine
It is currently 25.11.2024, 23:52

All times are UTC + 1 hour




Post new topic Reply to topic  [ 49 posts ]  Go to page 1, 2, 3, 4  Next
Author Message
PostPosted: 02.12.2008, 22:44 
Offline

Joined: 02.12.2008, 21:25
Posts: 8
First off, I want to say how much I love Horde3D: its architecture is clean and simple, its implementation is 100% human readable (god I hate spaghetti code...) and its configurable pipeline is just awesome! That being said, I was slightly disappointed in the performance of the Chicago sample. I mean sure the scene is complex but is 100fps really the best I can do on a ~500k triangle scene with no intense shaders and only 1 light? I didn't think so, so I dug a little deeper. I looked at my system cpu usage and saw one core running at 60% and another running at 90%. This indicated to me that my gpu was being starved and that's the reason the fps was low. I quickly installed amd code analyst and profiled the sample. Here are the results:

For process sample_chicago.exe
Hord3d.dll - 39.32%
nvoglv32.dll - 26.74%
...

for Horde3d.dll
SceneManager::updateQueueseRec - 40.51%
JointNode::onPostUpdate - 16.87%
SceneNode::update - 12.4%
...

Two things are immediately apparent from this:
1. The high % of on the opengl show there is room for improvement in the way Hord3D uses opengl.
2. The majority of Horde3D.dll's time is spent traversing the node tree.

To the first problem, the solution requires quite a bit of effort as it requires optimizing the use of opengl and can be very hardware dependant:
1. Minimize state changes that might invalidate data and cause driver to validate data.
2. Reduce calls to the OpenGl api.

The second problem is easier to solve. After toying around some more with code analyst, I realized the reason behind the enormous amount of time spent in updateQueueseRec(): cache misses. Trees by nature offer very poor locality of reference hence why iterating over the tree causes many cache misses. There are two approaches to fixing this problem.
1. Minimizing cache misses
2. minimizing the amount of nodes accessed.

To the first approach, there are several solutions:
1.Allocate the nodes of the tree in a vector and iterate through the vector rather than the tree.
2.Create a list of nodes and iterate through the list while prefetching nodes further ahead in the list.

For the second approach I was able to make some headway really quickly. I made this simple assumption: "If the parent isn't going to be added to the renderable queue then it's likely it's children won't either." My first change was the following:
Code:
void SceneManager::updateQueuesRec( const Frustum &frustum1, const Frustum *frustum2, bool sorted,
                            SceneNode &node, bool lightQueue, bool renderableQueue )
{   
   bool nodeVisible = false;
    if (!node._active)
    {
        return;
    }

   if( node._type == SceneNodeTypes::Group )
   {
      // LOD
      Vec3f nodePos( node._absTrans.c[3][0], node._absTrans.c[3][1], node._absTrans.c[3][2] );
      float dist = (nodePos - frustum1.getOrigin()).length();
      
      GroupNode *gn = (GroupNode *)&node;
      if( dist < gn->_minDist || dist >= gn->_maxDist ) return;
        nodeVisible = true;
   }
   else if( lightQueue && node._type == SceneNodeTypes::Light )
   {
        nodeVisible = true;
      _lightQueue.push_back( &node );
   }
   else if( renderableQueue && node._renderable )
   {
      if( node._type == SceneNodeTypes::Emitter )
      {
         // Emitters are a special case since we have to use their local bounding box
         // If the emitter is transformed particle positions don't change
         if( !frustum1.cullBox( *node.getLocalBBox() ) &&
            (frustum2 == 0x0 || !frustum2->cullBox( *node.getLocalBBox() )) )
         {
            if( sorted )
            {
               node.tmpSortValue = nearestDistToAABB( frustum1.getOrigin(),
                  node.getLocalBBox()->getMinCoords(), node.getLocalBBox()->getMaxCoords() );
            }
                                nodeVisible = true;
            _renderableQueue.push_back( &node );
         }
           
      }
      else
      {
         if( !frustum1.cullBox( node._bBox ) &&
            (frustum2 == 0x0 || !frustum2->cullBox( node._bBox )) )
         {
            if( sorted )
            {
               node.tmpSortValue = nearestDistToAABB( frustum1.getOrigin(),
                  node._bBox.getMinCoords(), node._bBox.getMaxCoords() );
            }
                                nodeVisible = true;
            _renderableQueue.push_back( &node );
         }
      }
   }
         
    if (nodeVisible)
    {
       // Recurse over children
       for( uint32 i = 0, s = (uint32)node._children.size(); i < s; ++i )
       {
          updateQueuesRec( frustum1, frustum2, sorted, *node._children[i], lightQueue, renderableQueue );
       }
    }
}


I then ran the chicago sample and Bang! over 35% improvement in fps. I was stunned that this didn't break the sample (Since I have very little understanding of the inner workings of the engine) and by how much of a gain such a small change can make. After testing of the knight sample, I did notice that this made the particles disappear. I then had to make the following change to the recursion condition:
Code:
if (nodeVisible || !node._renderable)
    {
       // Recurse over children
       for( uint32 i = 0, s = (uint32)node._children.size(); i < s; ++i )
       {
          updateQueuesRec( frustum1, frustum2, sorted, *node._children[i], lightQueue, renderableQueue );
       }
    }


This lowered the fps gain in the Chicago sample to 25% but did fix the particle problems. I do believe it is possible to reorganize the scene hierachy to attain the performance of the first attempt while retaining particle functionality.

I hope this insight will allow the developers to improve the performance of this wonderful engine. Hopefully I also just scratched the surface of all the optimisations that can be done!


Top
 Profile  
Reply with quote  
PostPosted: 03.12.2008, 00:15 
Offline

Joined: 19.11.2007, 19:35
Posts: 218
That's a pretty substantial performance improvement, is that for the updateQueuesRec()'s speed increase or is that an across the board speed increase. Given that the chicago demo is one big open space it doesn't surprise me that the queue updates are massive. That function is meat and potatoes.

Do you have percentages for other applications using the nvoglv32.dll? Also what's your hardware? Deferred or forward shading? Windowed or Fullscreen? Release or debug build?

Joint node doesn't surprise me, there's loads of joints in the chicago demo. Both the JointNode::onPostUpdate() and SceneNode::update() have quite a few back-to-back conditionals in them might not be very predictable, could be some places that SceneNode::update() could be memoized.


Top
 Profile  
Reply with quote  
PostPosted: 03.12.2008, 07:04 
Offline

Joined: 02.12.2008, 21:25
Posts: 8
Wow, that was a quick response! Well to answer your question, the test was done in release, using windowed mode and the increased was a global fps increase (I went from ~100 to ~125 fps in forward lighting. With deferred shading, performance went from ~130 to ~150 fps.

The only other game using opengl I could think of was world of warcraft with -opengl flag. nvoglv32.dll usaged seemed to hover between 23-24 %.

My system is the following:
Amd athlon64 x2 4400
nvidia 8800gts 320
4gb ram
Vista 64

JointNode::onPostUpdate() seem to spend the majority of its time doing matrix operations (which vc80 has nicely optimized with sse2) so other than reducing the animation rate, I don't really see how to make it any faster.

As for SceneManager::updateQueueseRec(), the first instruction to touch the node ( if (!node._active) ) triggers a ton of cache misses. If truly every node must be tested then perhaps the scene nodes should be allocated from a lookaside (or pool) list based on a vector. Then the vector could be iterated instead of the tree resulting in significantly better locality of reference and fewer cache misses. If a tree must be used, perhaps this function could be converted to be iterative (rather than recursive) and children would be added to the end of a queue while nodes would be taken from the beginning of the queue each iteration. This would not only eliminate the overhead of the recursive function call but would also allow for items in the queue to be prefetched so they are in the cache when they are used.

Hope this helps!


Top
 Profile  
Reply with quote  
PostPosted: 03.12.2008, 13:11 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
philippec wrote:
JointNode::onPostUpdate() seem to spend the majority of its time doing matrix operations (which vc80 has nicely optimized with sse2) so other than reducing the animation rate, I don't really see how to make it any faster.
Here NOS PACK comes handy to fully utilize your AMD Athlon64X2 SSE units for matrix operations. We can gain better results by combining your idea and vectorized utmath.h :idea:

Thanks for help :wink:


Top
 Profile  
Reply with quote  
PostPosted: 06.12.2008, 22:26 
Offline
Engine Developer

Joined: 10.09.2006, 15:52
Posts: 1217
Philippe, thank you very much for your profiling efforts and insights.

You are completely right with your idea to patch the updateQueuesRec function. Of course we should exploit the AABB tree for culling, actually that's why it exists (and it also costs performance to update it). We will certainly improve that code section in the future.

Slightly off-topic: I tried out the new CodeAnalyst 2.8 and I am excited by how much more useful it is than the previous versions. It fully integrates with Visual Studio now (2005 and 2008) and added support for Instruction-Based Sampling and Call Stack Sampling. I always missed these two sampling mode since they can give a lot of insight.


Top
 Profile  
Reply with quote  
PostPosted: 09.12.2008, 02:57 
Offline

Joined: 02.12.2008, 21:25
Posts: 8
Although I haven't had the time to implement anything (damn final exams! :P) I have had plenty of ideas on how to optimize horde3D. I just have a few questions as to the philosophy of the coding standard in horde3D.
  • Would prefetching instructions (wrapped in pretty macros) fit within the coding standard aimed at code clarity?
  • Would adding openmp tags be acceptable?

Thanks in advance!


Top
 Profile  
Reply with quote  
PostPosted: 09.12.2008, 17:26 
Offline

Joined: 22.11.2007, 17:05
Posts: 707
Location: Boston, MA
philippec wrote:
Would prefetching instructions (wrapped in pretty macros) fit within the coding standard aimed at code clarity?
Aren't those going to be very architecture dependant? The engine does need to maintain portability.

_________________
Tristam MacDonald - [swiftcoding]


Top
 Profile  
Reply with quote  
PostPosted: 09.12.2008, 18:57 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
swiftcoder wrote:
philippec wrote:
Would prefetching instructions (wrapped in pretty macros) fit within the coding standard aimed at code clarity?
Aren't those going to be very architecture dependant? The engine does need to maintain portability.
I'm agree with philippec, IMHO it's better to have some optimized versions as a Swiss Knife for engine :wink:


Top
 Profile  
Reply with quote  
PostPosted: 10.12.2008, 20:18 
Offline
Engine Developer

Joined: 10.09.2006, 15:52
Posts: 1217
philippec wrote:
I just have a few questions as to the philosophy of the coding standard in horde3D.
  • Would prefetching instructions (wrapped in pretty macros) fit within the coding standard aimed at code clarity?
  • Would adding openmp tags be acceptable?

I think we should take a good tradeoff between code clearity/beauty and top-performance. I want to emphasize that we are talking about hardcore optimizations now, of course the code should always be efficient.
For the inner loops of the most expensive/often called functions (e.g. SceneNode updates) it is good to have the maximum performance, even if the function becomes more cluttered with macros and other trickery. But at other locations like resource loading or rarely used functionality where the optimizations are not noticable in practice I would prefer clearness. So hardcore performance optimizations are definitely desired, but only when we get a real value of them.

OpenMP is another thing. There has already been some discussion here. DarkAngel brought up a good idea which was basically that Horde can generate independent tasks and give them with a callback to the task/threading system of the game engine. This is probably the best solution for a professional level middleware but on the other hand most of our users don't have such a complex system. So it would be nice if there were some basic threading support in Horde if it does not clutter the code. If there are a few loops that we can parallelize with OpenMP I think that's a good thing. OpenMP integration code is very lightweight, does not add a dependency and can easily be made optional (we could offer an engine option UseThreading).

So you are very welcome to propose/prototype your optimization ideas :)


Top
 Profile  
Reply with quote  
PostPosted: 11.12.2008, 22:06 
Offline

Joined: 02.12.2008, 21:25
Posts: 8
Here is a better patch to take advantage of the hierarchical nature of aabb trees for frustrum culling. I think those of you that are cpu bound for the Chicago sample will be very very pleased :D Personally, I got an fps increase of 30% - 50% (fps seems to fluctuate a lot). Additionally, the knight sample (and any other scene that have renderable nodes as children of non-renderable nodes) will work correctly).

Patched scene.h and scene.cpp
Attachment:
File comment: Updated scene manager to take advantage of the hierarchical nature of aabb trees.
Horde3DSceneUpdate.zip [7.82 KiB]
Downloaded 911 times


Top
 Profile  
Reply with quote  
PostPosted: 05.01.2009, 01:06 
Offline

Joined: 02.01.2009, 20:45
Posts: 11
First of all, let me apologize for posting on the developer's forum, but my problem is related to this post, hence I'm posting here.

I'm developing a crowd rendering infrastructure for my research on crowd simulation, and am trying to use Horde for the same. I came upon this post when I started experiencing problems with Horde's performance. I'm aiming to render crowds of the scale of 100k and above. But even at 2k I noticed the computation was highly CPU bound, and using lower resolution meshes did not help (on a Core2 Duo P8400, Geforce 9600GT mobile). After applying philippec's patch for the Scenegraph update, I only observed a minor performance improvement. In all I'm getting a peak rate of ~20fps.

To clarify other execution details, my simulation runs parallel to the rendering, and has a very minor footprint. In my timing and performance analysis, I noticed that the renderer spends most of it's time in drawrenderables and drawparticles (using a deferred pipeline). But the fact that the mesh resolution is not affecting the performance even with the forward pipeline, to me, implies that the bottleneck is with the CPU processing only. In addition, when scaling up to 20k agents, the initialization routines consumed 400 seconds.

I wanted to ask whether there has been any further improvement to the Horde engine post 1.0.0 Beta2, even as an experimental version, which could help me. If not, could someone point me in the right direction to resolve these issues.

Thanks in advance,
Abhinav


Top
 Profile  
Reply with quote  
PostPosted: 05.01.2009, 02:11 
Offline

Joined: 22.11.2007, 17:05
Posts: 707
Location: Boston, MA
abhinavgolas wrote:
First of all, let me apologize for posting on the developer's forum, but my problem is related to this post, hence I'm posting here.
No need to apologise, deeply technical questions belong here too.
Quote:
I'm aiming to render crowds of the scale of 100k and above.
Even on a 9600, you are going to be vertex bound before you hit the 100k mark - even at a (very conservative) 100 polygons per model, you are looking at 10 million triangles per frame.

Quote:
In my timing and performance analysis, I noticed that the renderer spends most of it's time in drawrenderables and drawparticles (using a deferred pipeline). But the fact that the mesh resolution is not affecting the performance even with the forward pipeline, to me, implies that the bottleneck is with the CPU processing only.
How complex is your scene (apart from the agents)? I meant to optimise particle rendering a while ago, but don't currently have a video card Horde will run on.

Quote:
In addition, when scaling up to 20k agents, the initialization routines consumed 400 seconds.
The engine isn't reusing resources quite as efficiently as it might in places. With 20k agents, how many distinct model/material combinations are we looking at?

_________________
Tristam MacDonald - [swiftcoding]


Top
 Profile  
Reply with quote  
PostPosted: 05.01.2009, 02:37 
Offline

Joined: 02.01.2009, 20:45
Posts: 11
Thanks for the prompt reply :)

Quote:
Quote:
I'm aiming to render crowds of the scale of 100k and above.
Even on a 9600, you are going to be vertex bound before you hit the 100k mark - even at a (very conservative) 100 polygons per model, you are looking at 10 million triangles per frame.

Yes, I'll be trying that range on a better configuration, the lab at my university has an SLI config with 8800GTX's and even a machine with a GTX280, so I can probably do something about it. Even if it is vertex bound, I can atleast understand that it's a hardware limitation, but being CPU bound is something I did not expect with Horde :) .

Quote:
Quote:
In my timing and performance analysis, I noticed that the renderer spends most of it's time in drawrenderables and drawparticles (using a deferred pipeline). But the fact that the mesh resolution is not affecting the performance even with the forward pipeline, to me, implies that the bottleneck is with the CPU processing only.
How complex is your scene (apart from the agents)? I meant to optimise particle rendering a while ago, but don't currently have a video card Horde will run on.

I derived the scene from the Chicago example, I'm using the same skybox, platform and model mesh. I simply replaced the crowd simulation module with mine. Oddly enough, I don't have an emitter or particles in the scene, and I don't think the example has either, so I don't understand why the renderer needs to call drawparticles, I assumed it's an internal use for the deferred rendering.

Quote:
Quote:
In addition, when scaling up to 20k agents, the initialization routines consumed 400 seconds.
The engine isn't reusing resources quite as efficiently as it might in places. With 20k agents, how many distinct model/material combinations are we looking at?

As I said, it's just one model, and one anim file, and a simple flat hierarchy of all nodes being children of the Root Node. The issue is that if I'm going to try 100k, I'm going to have to leave my machine for about half an hour before the rendering actually begins, which seems very steep.


Top
 Profile  
Reply with quote  
PostPosted: 05.01.2009, 03:21 
Offline

Joined: 22.11.2007, 17:05
Posts: 707
Location: Boston, MA
abhinavgolas wrote:
Oddly enough, I don't have an emitter or particles in the scene, and I don't think the example has either, so I don't understand why the renderer needs to call drawparticles, I assumed it's an internal use for the deferred rendering.
Can you check with a forward rendering setup as well? I am thinking that there may be an issue under the hood there.
Quote:
As I said, it's just one model, and one anim file, and a simple flat hierarchy of all nodes being children of the Root Node. The issue is that if I'm going to try 100k, I'm going to have to leave my machine for about half an hour before the rendering actually begins, which seems very steep.
With only a single model, that startup time sucks - I have noticed this as well with one of my projects, it seems likely that Horde's current resource system is doing way too much heavy lifting.

_________________
Tristam MacDonald - [swiftcoding]


Top
 Profile  
Reply with quote  
PostPosted: 05.01.2009, 10:30 
Offline

Joined: 02.01.2009, 20:45
Posts: 11
swiftcoder wrote:
Can you check with a forward rendering setup as well? I am thinking that there may be an issue under the hood there.

I ran the program again with the forward rendering setup. Oddly enough, the drawparticles calls are still there, even though I have whittled the pipeline down to a drawgeometry command and a drawoverlay command. The persistence of these calls is simply puzzling. I'm using VTune to profile the code btw, compiled by Intel's C compiler, all evaluation versions.

swiftcoder wrote:
With only a single model, that startup time sucks - I have noticed this as well with one of my projects, it seems likely that Horde's current resource system is doing way too much heavy lifting.

Do you know of any tweaks/forks that have been made for Horde which would help in this regard?


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 49 posts ]  Go to page 1, 2, 3, 4  Next

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 7 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group