Horde3D
http://horde3d.org/forums/

utMath elite
http://horde3d.org/forums/viewtopic.php?f=8&t=620
Page 1 of 1

Author:  Siavash [ 22.01.2009, 17:35 ]
Post subject:  utMath elite

Hi, I've performed some high-school level optimizations [not SIMD] on utMath. Any ideas are welcome :wink:

Attachments:
File comment: outdated, only compatible with Horde3D SDK Beta2
utMath_elite.zip [6.23 KiB]
Downloaded 803 times

Author:  swiftcoder [ 22.01.2009, 18:53 ]
Post subject:  Re: utMath elite

Siavash wrote:
Hi, I've performed some high-school level optimizations [not SIMD] on utMath.
I am pretty sure that isn't what you meant, high-level would be the term ;)

Author:  Siavash [ 23.01.2009, 03:50 ]
Post subject:  Re: utMath elite

swiftcoder wrote:
Siavash wrote:
Hi, I've performed some high-school level optimizations [not SIMD] on utMath.
I am pretty sure that isn't what you meant, high-level would be the term ;)
High-level? Is there any performance increase? [there is only some precalculated values, factoring of polynomials and ...]

Author:  Siavash [ 04.04.2009, 15:03 ]
Post subject:  Re: utMath elite

Is there any feel able performance diff between Horde3D beta2 & beta3 utMath libs?

Author:  marciano [ 05.04.2009, 12:31 ]
Post subject:  Re: utMath elite

Siavash wrote:
Is there any feel able performance diff between Horde3D beta2 & beta3 utMath libs?

The most inportant thing in utMath Beta3 is an optimized float to int conversion.

Author:  fullmetalcoder [ 11.04.2009, 12:18 ]
Post subject:  Re: utMath elite

not about the patch above but on the same topic.

The determinant() function looked so... suboptimal that I couldn't restrain myself and factorized it. The resulting code is not that cute but the performance gain is there :

Code:
float fastDeterminant() const
   {
      /*
         factorization result :
            192 -> 20 ptr deref
            96 -> 28 fpmult
            24 -> 18 fpadd
            
            => solid 30% speed improvement
      */
      
      const float * const c0 = c[0];
      const float * const c1 = c[1];
      const float * const c2 = c[2];
      const float * const c3 = c[3];
      
      const float c00 = c0[0];
      const float c01 = c0[1];
      const float c02 = c0[2];
      const float c03 = c0[3];
      
      const float c10 = c1[0];
      const float c11 = c1[1];
      const float c12 = c1[2];
      const float c13 = c1[3];
      
      const float c0011 = c00*c11;
      const float c0012 = c00*c12;
      const float c0013 = c00*c13;
      
      const float c0110 = c01*c10;
      const float c0112 = c01*c12;
      const float c0113 = c01*c13;
      
      const float c0210 = c02*c10;
      const float c0211 = c02*c11;
      const float c0213 = c02*c13;
      
      const float c0310 = c03*c10;
      const float c0311 = c03*c11;
      const float c0312 = c03*c12;
      
      const float c03x12m02x13 = c0312 - c0213;
      const float c03x11m01x13 = c0311 - c0113;
      const float c02x11m01x12 = c0211 - c0112;
      const float c02x10m00x12 = c0210 - c0012;
      const float c03x10m00x13 = c0310 - c0013;
      const float c01x10m00x11 = c0110 - c0011;
      
      const float c20 = c2[0];
      const float c21 = c2[1];
      const float c22 = c2[2];
      const float c23 = c2[3];
      
      return
         c3[0] * ( c03x12m02x13*c21 - c03x11m01x13*c22 + c02x11m01x12*c23 )
         -
         c3[1] * ( c03x12m02x13*c20 - c03x10m00x13*c22 + c02x10m00x12*c23 )
         +
         c3[2] * ( c03x11m01x13*c20 - c03x10m00x13*c21 + c01x10m00x11*c23 )
         -
         c3[3] * ( c02x11m01x12*c20 - c02x10m00x12*c21 + c01x10m00x11*c22 );
      


Now there is probably room for explicit SIMD vectorization here but I'm not familiar enough with this to do it myself atm. By the way, "implicit" vectorization (passing this to gcc : -msse -msse2 -msse3 -mfpmath=sse) does not change the benchmark results.

Also note that some profiling shows (valgrind, under linux 32 bit (arch : core 2) compiled with -O2) that the strategy used for += *= and /= operators is quite suboptimal (I'm not posting "fixes" here as they are extremely simple).

Author:  Siavash [ 11.04.2009, 13:40 ]
Post subject:  Re: utMath elite

Thanks a lot fullmetalcoder for the patch; You made me to perform some benchmarks using Intel Vtune :wink:
[MSVC2008 Express : Debug]
Code:
////////Determinant() Benchmark///////////////
utMath beta3   : 33 ;
fullmetalcoder : 30 ;
utMath elite   : 28 ;
//////////////////////////////////////////////

////////Inverted() Benchmark//////////////////
utMath beta3   : 148 ;
fullmetalcoder : 148 ;
utMath elite   : 132 ;
//////////////////////////////////////////////
Looks that utMath elite rocks 8)

Author:  fullmetalcoder [ 11.04.2009, 13:57 ]
Post subject:  Re: utMath elite

interesting benchmark figures. Could it be that MSVC does a better job at optimizing complex lookup + math ops? here is what I get with the code from SVN :

Quote:
det test : 100000000
normal : result=-0.480000, elapsed=123
fast : result=-0.480000, elapsed=88

elapsed time in ms, 10**8 determinant computations. Code compiled with GCC ( -march=i686 -mtune=generic -O2 ) running on my laptop (core 2 T7700, 3GB DDR, Linux)

edit : just tried replacing regular determinant() with the one from the archive above. Here are the results :
Quote:
det test : 100000000
normal : result=-0.480000, elapsed=112
fast : result=-0.480000, elapsed=87


edit2 : benchmarking in debug mode is generally not a good idea for such simple operations. It is very likely that det = o(debug overhead)

Author:  Siavash [ 11.04.2009, 14:24 ]
Post subject:  Re: utMath elite

fullmetalcoder wrote:
interesting benchmark figures. Could it be that MSVC does a better job at optimizing complex lookup + math ops?
It's too interesting because the system that I was using was an old PentiumIII 750MHz + 256mb SD-RAM on WindowsXP SP3 without any optimizations.
fullmetalcoder wrote:
benchmarking in debug mode is generally not a good idea for such simple operations. It is very likely that det = o(debug overhead)
Yes I know, but if you have a look at disasm of code you will see that MSVC cheats in the release mode [precalculated values and ...].

Author:  fullmetalcoder [ 11.04.2009, 14:27 ]
Post subject:  Re: utMath elite

Siavash wrote:
Yes I know, but if you have a look at disasm of code you will see that MSVC cheats in the release mode [precalculated values and ...].

All compilers cheat. I have been forced to add some extra code to in the loop of determinant compuation to make sure the whole loop was not optimized away (turned into a single call)...

Author:  Siavash [ 11.04.2009, 15:14 ]
Post subject:  Re: utMath elite

I've forced thee compiler to print out the results outside of loop [Release mode /O2]:
Code:
/////// det -  inv /////////
beta3 : 13  -  56
fast  : 9   -  49
elite : 9   -  49
////////////////////////////
Both of fullmetalcoders & elite versions are same :wink:

Author:  Siavash [ 12.04.2009, 18:22 ]
Post subject:  Re: utMath elite

It's a good idea to replace some parts of utMath beta3 with elite version, there is a ~13% performance boost in det & inv functions.

Page 1 of 1 All times are UTC + 1 hour
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/