First off, I want to say how much I love Horde3D: its architecture is clean and simple, its implementation is 100% human readable (god I hate spaghetti code...) and its configurable pipeline is just awesome! That being said, I was slightly disappointed in the performance of the Chicago sample. I mean sure the scene is complex but is 100fps really the best I can do on a ~500k triangle scene with no intense shaders and only 1 light? I didn't think so, so I dug a little deeper. I looked at my system cpu usage and saw one core running at 60% and another running at 90%. This indicated to me that my gpu was being starved and that's the reason the fps was low. I quickly installed amd code analyst and profiled the sample. Here are the results:
For process sample_chicago.exe
Hord3d.dll - 39.32%
nvoglv32.dll - 26.74%
...
for Horde3d.dll
SceneManager::updateQueueseRec - 40.51%
JointNode::onPostUpdate - 16.87%
SceneNode::update - 12.4%
...
Two things are immediately apparent from this:
1. The high % of on the opengl show there is room for improvement in the way Hord3D uses opengl.
2. The majority of Horde3D.dll's time is spent traversing the node tree.
To the first problem, the solution requires quite a bit of effort as it requires optimizing the use of opengl and can be very hardware dependant:
1. Minimize state changes that might invalidate data and cause driver to validate data.
2. Reduce calls to the OpenGl api.
The second problem is easier to solve. After toying around some more with code analyst, I realized the reason behind the enormous amount of time spent in updateQueueseRec(): cache misses. Trees by nature offer very poor locality of reference hence why iterating over the tree causes many cache misses. There are two approaches to fixing this problem.
1. Minimizing cache misses
2. minimizing the amount of nodes accessed.
To the first approach, there are several solutions:
1.Allocate the nodes of the tree in a vector and iterate through the vector rather than the tree.
2.Create a list of nodes and iterate through the list while prefetching nodes further ahead in the list.
For the second approach I was able to make some headway really quickly. I made this simple assumption: "If the parent isn't going to be added to the renderable queue then it's likely it's children won't either." My first change was the following:
Code:
void SceneManager::updateQueuesRec( const Frustum &frustum1, const Frustum *frustum2, bool sorted,
SceneNode &node, bool lightQueue, bool renderableQueue )
{
bool nodeVisible = false;
if (!node._active)
{
return;
}
if( node._type == SceneNodeTypes::Group )
{
// LOD
Vec3f nodePos( node._absTrans.c[3][0], node._absTrans.c[3][1], node._absTrans.c[3][2] );
float dist = (nodePos - frustum1.getOrigin()).length();
GroupNode *gn = (GroupNode *)&node;
if( dist < gn->_minDist || dist >= gn->_maxDist ) return;
nodeVisible = true;
}
else if( lightQueue && node._type == SceneNodeTypes::Light )
{
nodeVisible = true;
_lightQueue.push_back( &node );
}
else if( renderableQueue && node._renderable )
{
if( node._type == SceneNodeTypes::Emitter )
{
// Emitters are a special case since we have to use their local bounding box
// If the emitter is transformed particle positions don't change
if( !frustum1.cullBox( *node.getLocalBBox() ) &&
(frustum2 == 0x0 || !frustum2->cullBox( *node.getLocalBBox() )) )
{
if( sorted )
{
node.tmpSortValue = nearestDistToAABB( frustum1.getOrigin(),
node.getLocalBBox()->getMinCoords(), node.getLocalBBox()->getMaxCoords() );
}
nodeVisible = true;
_renderableQueue.push_back( &node );
}
}
else
{
if( !frustum1.cullBox( node._bBox ) &&
(frustum2 == 0x0 || !frustum2->cullBox( node._bBox )) )
{
if( sorted )
{
node.tmpSortValue = nearestDistToAABB( frustum1.getOrigin(),
node._bBox.getMinCoords(), node._bBox.getMaxCoords() );
}
nodeVisible = true;
_renderableQueue.push_back( &node );
}
}
}
if (nodeVisible)
{
// Recurse over children
for( uint32 i = 0, s = (uint32)node._children.size(); i < s; ++i )
{
updateQueuesRec( frustum1, frustum2, sorted, *node._children[i], lightQueue, renderableQueue );
}
}
}
I then ran the chicago sample and Bang! over 35% improvement in fps. I was stunned that this didn't break the sample (Since I have very little understanding of the inner workings of the engine) and by how much of a gain such a small change can make. After testing of the knight sample, I did notice that this made the particles disappear. I then had to make the following change to the recursion condition:
Code:
if (nodeVisible || !node._renderable)
{
// Recurse over children
for( uint32 i = 0, s = (uint32)node._children.size(); i < s; ++i )
{
updateQueuesRec( frustum1, frustum2, sorted, *node._children[i], lightQueue, renderableQueue );
}
}
This lowered the fps gain in the Chicago sample to 25% but did fix the particle problems. I do believe it is possible to reorganize the scene hierachy to attain the performance of the first attempt while retaining particle functionality.
I hope this insight will allow the developers to improve the performance of this wonderful engine. Hopefully I also just scratched the surface of all the optimisations that can be done!