Horde3D

Next-Generation Graphics Engine
It is currently 25.11.2024, 06:57

All times are UTC + 1 hour




Post new topic Reply to topic  [ 4 posts ] 
Author Message
PostPosted: 03.04.2012, 11:28 
Offline

Joined: 23.07.2009, 21:03
Posts: 51
Location: Germany
Hello everybody.

I'm currently investigating the possibility of a vision-based steering approach for characters using Horde3D. Basically I'm using a paper from Siggraph 2010 (Youtube-Video) as an inspiration for what hopefully will turn into my masters thesis.

For now I'm stuck because I'm experiencing heavy performance problems.
My current application is based on the chicago sample, with the following "changes":

  • Every character is parent to an additional camera node and a cylinder-geometry. Each camera renders its characters "view".
  • The character-cameras all use the same special pipeline, which only renders simplified geometries of other characters (the cylinders).
  • The character cameras render to the same texture, rescaled and ordered next to each other, depending on the number of cameras.

This is what it looks like using 64 characters:
Image
And the view of a character in fullscreen (the colors represent different values like distance from character, velocity, etc):
Image

The performance problem:
As one can see from the first screenshot, the app is almost 100% cpu bound. Using 200 characters (=201 cams) the CPU time spent on one frame is about 150ms (Core2Quad @2.83GHz / GeForce GTS 250) or even 250ms (Core i7 @ 2.8 GHz - AMD radeon hd 7XXX) (VisualStudio2010 - Release mode).

Having so many cameras leads to about 10.000 batches per frame. So my initial thought was that reducing these would increase my framerate.

Therefore I removed the cylinders (each character had one attached to him, all cylinders using the same material) and instead put one mesh, made up of 200 cylinders, into the scene. The idea being, I could later write a "Dynamic Batching Extension" if this would reduce the cpu workload.

The new scene:
Image

But this didn't result in the improvement I hoped for. Now I have one static mesh (200 cylinders) in the scene which each camera renders while it walks around in the scene. This leads to 1 batch per "walker-cam" (and a lot more triangles) but only a slight computation-time improvement (120ms instead of 150ms on the nvidia pc).

Here are the profiling results for the Core2Quad @2.83GHz / GeForce GTS 250 computer:
Individual cylinders -------- 1 static cylinder mesh
Image ---------------- Image

As expected the speed improvement originates from faster drawGeometry calculations.

If you're interested in the Core i7 @ 2.8 GHz - AMD radeon hd profiles:
Individual cylinders -------- 1 static cylinder mesh
Image ---------------- Image

No idea what to make out of these...

Anyways, what I'm looking for are suggestions on how to improve my framerate. As a comparison: In the mentioned paper they achive 25fps with 200 walkers, including data transfer from gpu back to cpu which is described as the major bottleneck, on a laptop with an Intel T7800@2.6GHz CPU and a Quadro FX 3600M graphics card. I have no idea what rendering engine they used.

How can I improve all the render calculations having hundreds of cameras which all use the same pipeline?

I'd appreciate any comments and suggestions you guys have. I'll gladly share more implementation details, but I don't want this first post to be longer than it has to be :)


Top
 Profile  
Reply with quote  
PostPosted: 04.04.2012, 16:27 
Offline

Joined: 16.05.2009, 12:43
Posts: 207
One thing you might want to try. Is to add a render command to the pipeline. Then batch up all your drawing for cameras into a single object node (make a new node type) and then draw the cylinders as billboarded quads. Basically, look at the particle system for example of how to add new node types and render them.

That 1000 batches is probably quite high, although I've heard conflicting things on the overhead of batching in OpenGL. But it seems likely you're not using a special shader for the cylinders or anything. I doubt you're geometry bound, so its more likely you're having to draw each object seperately thats causing a slowdown.

But as with all these kind of things, you will have to optimize for your own use-case. So get a profiler and check how things are affected in your real application, not just some specific test case. I use a profiler lib called shiny.


Top
 Profile  
Reply with quote  
PostPosted: 04.04.2012, 17:44 
Offline

Joined: 23.07.2009, 21:03
Posts: 51
Location: Germany
I didn't put it in my first post, but I guess it would make sense to explain my current pipeline:

Every agent camera uses the "syntheticViewDeferred" pipeline. Deferred, because when using a forward-pipeline the depth buffer is not working when rendering to texture.

The pipeline for each camera looks like this:
  • stage1: target = GBUFFER; class=synthetic (all the cylinders); context: render all cylinders, color them according to depth and uniforms
  • stage2: almost same as standard deferred lightning stage. Without lightning
  • stage3: draw quad to texture at a specific position (depending on nodeId uniform) -> all cameras draw to this one texture, each to its own position

---------------


@zoombapup:
I'm not quite sure I fully understand what you are suggesting, but you just gave me an idea:
zoombapup wrote:
Then batch up all your drawing for cameras into a single object node (make a new node type) and then draw the cylinders as billboarded quads.

Lets assume I have 200 agents in my scene, each of them is rendering his own viewport every frame. I can safely assume that each agent (=> cylinder) will be in the viewport of at least one other agent (given the crowd scenario).
So how about, instead of having 200 render(agent.camNode) calls like I have now, I do the following:
  • Only use one camera,
  • upload all my agents geometries to the gpu, no mater where they are positioned
  • render my scene 200 times by transforming the viewProjMat according to the agent positions on the gpu
This would basically move all the work - horde3d has to do each render call - to the gpu.
Not sure how to construct the pipeline so that it still allows me to do all the stuff that I want to, though. And whether it would result in big performance improvements.

Is that sort of what you meant?

I'm away until next week, so I can't try it out right now.


Top
 Profile  
Reply with quote  
PostPosted: 16.04.2012, 11:42 
Offline

Joined: 23.07.2009, 21:03
Posts: 51
Location: Germany
Small update:

I have a new version running which looks like this:
  • One camera for all characters, instead of one for each.
  • For this camera I'm generating a pipeline on startup which, for each agent generates two stages:
Code:
...
...
<Stage id="VertexDepth42">
      <SwitchTarget target="BUFFER"/>
      <ClearTarget depthBuf="true" colBuf0="true"/>
      <DrawGeometry context="DEPTH" class="synthetic"/>
</Stage>
<Stage id="compResults42">
      <SwitchTarget target="AGENTRESULTS"/>
      <BindBuffer sampler="view" sourceRT="BUFFER" bufIndex="0"/>
      <DrawQuad context="ONE_CAM_COMPUTATION" material="materials/syntheticComputation.material.xml"/>
</Stage>
...
...

Using 200 agents this results in a 2.000-lines-pipeline :D
Stage VertexDepth42 renders the view of agent 42 to the BUFFER-rendertarget.
Stage compResults42 is supposed to analyze this image and write its results to the AGENTRESULTS target at a specific pixel position reserved for this agent.

The next stage in the pipeline is VertexDepth43 which overwrites the BUFFER target with the new agent Nr. 43s-view.
compResults43 writes its results to the AGENTRESULTS-target next to the ones from the previos agent.
AGENTRESULTS is downloaded to the application at the end of the frame using h3dGetRenderTargetData.

Using this setup I doubled my framerate (200 agents = 20fps/30fps, amd/nvidia).
It could be a lot higher, but I still have to use DrawGeometry for each agent, meaning I am still uploading my synthetic scene from cpu to gpu for each agent.

With the available pipeline commands I see no way to use the uploaded geometry for all my characters.
Does anyone have any suggestions or can tell me if there is a way to reuse the uploaded geometry on the gpu, without any culling or clipping, multiple times?
Could this be made possible using a geometry shader?

_________________________________________________________________

Regarding batching/billboarding: I have not yet implemented any batching of the cylinders, or using billboarded quads like zoombapup suggested. I'm sure this would improve my framerate further, but the main bottleneck, drawgeometry and drawquad and all the OpenGL calls that are the result of this, will remain.

Another optimization opportunity could be the 10% redundant OpenGL state changes I am currently having according to gDEBugger, which supposedly can "reduce render performance" and "are not cheap". I'll have to look into this.

Any input would be appreciated. Cheers.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 4 posts ] 

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group