Horde3D :: View topic - MSVC poor generated code performance

// Chicago Sample - crowd.cpp

void CrowdSim::init()
{
...

// Add characters
for( unsigned int i = 0; i < 200; ++i )
{
   Particle p;

   // Add character to scene and apply animation
   p.node = h3dAddNodes( H3DRootNode, characterRes );
   h3dSetNodeParamI(p.node, H3DModel::SWSkinningI, 1);
   h3dSetupModelAnimStage( p.node, 0, characterWalkRes, 0, "", false );

   // Characters start in a circle formation
   p.px = sinf( (i / 100.0f) * 6.28f ) * 10.0f;
   p.pz = cosf( (i / 100.0f) * 6.28f ) * 10.0f;

   chooseDestination( p );

   h3dSetNodeTransform( p.node, p.px, 0.02f, p.pz, 0, 0, 0, 1, 1, 1 );

   _particles.push_back( p );
}
}

Normal code : about 60ms
/arch:SSE : about 60ms
/arch:SSE2 : about 130ms

/I"D:/Development/SDK/Horde3D/Horde3D SF.net/Horde3D/Source/Horde3DEngine/." /I"D:/Development/SDK/Horde3D/Horde3D SF.net/Horde3D/Source/Horde3DEngine/../Shared" /I"D:/Development/SDK/Horde3D/Horde3D SF.net/Horde3D/Source/Horde3DEngine/../../Bindings/C++" /I"D:/Development/SDK/Horde3D/Horde3D SF.net/Horde3D/Source/Horde3DEngine/../../.." /I"D:/Development/SDK/Horde3D/Horde3D SF.net/CM_BIN_x86" /Zi /nologo /W3 /WX- /O2 /Ob1 /Oy- /D "WIN32" /D "_WINDOWS" /D "NDEBUG" /D "CMAKE" /D "CMAKE_INTDIR=\"RelWithDebInfo\"" /D "Horde3D_EXPORTS" /D "_WINDLL" /D "_MBCS" /Gm- /EHsc /MD /GS /arch:SSE2 /fp:precise /Zc:wchar_t /Zc:forScope /GR /Fp"Horde3D.dir\RelWithDebInfo\Horde3D.pch" /Fa"RelWithDebInfo" /Fo"Horde3D.dir\RelWithDebInfo\" /Fd"D:/Development/SDK/Horde3D/Horde3D SF.net/Horde3D/Binaries/RelWithDebInfo/Horde3D.pdb" /Gd /TP /analyze- /errorReport:queue

Normal code : about 60ms
SSE code : about 53ms
SSE2 code : about 53ms

x64 code : about 53ms
SSE code : about 53ms
SSE2 code : about 53ms

Author:	Siavash [ 26.09.2010, 15:04 ]
Post subject:	MSVC poor generated code performance
Here is the results of a few experiments that I had to see how much /arch:SSE and /arch:SSE2 switches will improve the Horde3D performance. As you have suggested already, software skinning will benefit from such optimizations, so let's enable the SWSkinning in the Chicago sample : Code: // Chicago Sample - crowd.cpp void CrowdSim::init() { ... // Add characters for( unsigned int i = 0; i < 200; ++i ) { Particle p; // Add character to scene and apply animation p.node = h3dAddNodes( H3DRootNode, characterRes ); h3dSetNodeParamI(p.node, H3DModel::SWSkinningI, 1); h3dSetupModelAnimStage( p.node, 0, characterWalkRes, 0, "", false ); // Characters start in a circle formation p.px = sinf( (i / 100.0f) * 6.28f ) * 10.0f; p.pz = cosf( (i / 100.0f) * 6.28f ) * 10.0f; chooseDestination( p ); h3dSetNodeTransform( p.node, p.px, 0.02f, p.pz, 0, 0, 0, 1, 1, 1 ); _particles.push_back( p ); } } Now it's time for compiling and compare the time spent on Geo Updates : Code: Normal code : about 60ms /arch:SSE : about 60ms /arch:SSE2 : about 130ms Well, results are too interesting! There is no such difference between normal and SSE code, but hey why SSE2 code is 2x slower there? Profiler says that most of the time is consumed by ModelNode::updateGeometry() so I decided to compare the SSE and SSE2 generated assembly codes : Here is the SSE disassembly and Here is the SSE2 disassembly So what? First noticeable thing is that with /arch:SSE enabled, compiler has failed to optimize the code and generated the normal code instead that's why there is no difference between the SSE and non-SSE generated code. Second is that with /arch:SSE2 enabled, compiler has done a horrible job there. Why? If you notice, compiler is converting the all of the floats to doubles to perform SSE2 operations on them and lot more problems ... All of the tests are done with MSVC 2010

Author:	swiftcoder [ 26.09.2010, 16:08 ]
Post subject:	Re: MSVC poor generated code performance
What other compiler optimisation flags are set?

Author:	Siavash [ 26.09.2010, 16:25 ]
Post subject:	Re: MSVC poor generated code performance
Default options of a fresh CMake generated project + /arch:SSE2 : Code: /I"D:/Development/SDK/Horde3D/Horde3D SF.net/Horde3D/Source/Horde3DEngine/." /I"D:/Development/SDK/Horde3D/Horde3D SF.net/Horde3D/Source/Horde3DEngine/../Shared" /I"D:/Development/SDK/Horde3D/Horde3D SF.net/Horde3D/Source/Horde3DEngine/../../Bindings/C++" /I"D:/Development/SDK/Horde3D/Horde3D SF.net/Horde3D/Source/Horde3DEngine/../../.." /I"D:/Development/SDK/Horde3D/Horde3D SF.net/CM_BIN_x86" /Zi /nologo /W3 /WX- /O2 /Ob1 /Oy- /D "WIN32" /D "_WINDOWS" /D "NDEBUG" /D "CMAKE" /D "CMAKE_INTDIR=\"RelWithDebInfo\"" /D "Horde3D_EXPORTS" /D "_WINDLL" /D "_MBCS" /Gm- /EHsc /MD /GS /arch:SSE2 /fp:precise /Zc:wchar_t /Zc:forScope /GR /Fp"Horde3D.dir\RelWithDebInfo\Horde3D.pch" /Fa"RelWithDebInfo" /Fo"Horde3D.dir\RelWithDebInfo\" /Fd"D:/Development/SDK/Horde3D/Horde3D SF.net/Horde3D/Binaries/RelWithDebInfo/Horde3D.pdb" /Gd /TP /analyze- /errorReport:queue

Author:	swiftcoder [ 26.09.2010, 17:11 ]
Post subject:	Re: MSVC poor generated code performance
I would try swapping /fp:precise for /fp:fast, as that should let the compiler generate significantly faster floating-point code. You generally only need /fp:precise if you need to make strict maximum accuracy guarantees.

Author:	Siavash [ 26.09.2010, 17:24 ]
Post subject:	Re: MSVC poor generated code performance
Now it's a bit faster : Code: Normal code : about 60ms SSE code : about 53ms SSE2 code : about 53ms EDIT : SSE and SSE2 generated codes are pretty similar, most (may be all?) of the operations are done on single data instead of being SIMD and still results are very far away from hand tuned code.

Horde3D http://horde3d.org/forums/

MSVC poor generated code performance http://horde3d.org/forums/viewtopic.php?f=1&t=1259	Page 1 of 1

Author:	Siavash [ 26.09.2010, 19:04 ]
Post subject:	Re: MSVC poor generated code performance
Round 2 of the experiments : 64bit generated code Code: x64 code : about 53ms SSE code : about 53ms SSE2 code : about 53ms Normal generated code is faster at 64bit mode, but wait, why there is no difference between x64, SSE and SSE2 generated codes? Answer is here, MSVC generates exactly same code there. By default SSE optimizations are done and if you compare the x64 and x86 (SSE) codes you will notice that it is using movaps instruction to load 4 (aligned) floating point values at same time in x64 mode and uses movss to load 1 floating point value in x86 mode. BTW, IMHO it won't make much difference.

Author:	ZONER [ 27.09.2010, 07:26 ]
Post subject:	Re: MSVC poor generated code performance
Siavash, FPU, SSE and SSE2 under x64 are the same because of MSVC compiler. It has limitations on optimization under x64 code - no SSE, no ASM inlines - these things are avaliable only when you are compiling x86 code. Also don't use Maximize Speed /O2 or Full optimization /Ox, just use Custom. I mean - do not always trust MSVC optimizer. Another advice is to use __forceinline instead inline in code(not everywhere, but algebra/math is best thing where to use __forceinline). My compiler options are: x86 code, /arch:SSE2, /fp:precise, Custom optimization, Favor Fast Code /Ot. Yeah, another advice is to use not built-in memory allocator - use you custom memory allocator instead(my allocator is TBB Allocator, and experiments with NedAlloc).

Page 1 of 1	All times are UTC + 1 hour
Powered by phpBB® Forum Software © phpBB Group https://www.phpbb.com/