Horde3D

Next-Generation Graphics Engine
It is currently 16.11.2024, 22:15

All times are UTC + 1 hour




Post new topic Reply to topic  [ 99 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7  Next
Author Message
 Post subject: Re: NOS PACK
PostPosted: 05.10.2008, 19:24 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
Thanks a lot dear marciano !

I've a little question : Why MSVC 2005 compiled code is slower than MinGW 3.4.5 ? It seems that MinGW/GCC rocks in code optimizations !

MinGW 3.4.5 : optimized utmath_rc3 : ~1s
MSVC 2005 : optimized utmath_rc3 : ~4s


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 05.10.2008, 22:40 
Offline
Engine Developer

Joined: 10.09.2006, 15:52
Posts: 1217
Just to be on the safe side: you compiled in Release mode in VC, right?


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 06.10.2008, 10:55 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
marciano wrote:
Just to be on the safe side: you compiled in Release mode in VC, right?

Yes, I've compiled the code in release mode + /Ox


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 06.10.2008, 16:46 
Offline

Joined: 22.11.2007, 17:05
Posts: 707
Location: Boston, MA
Siavash wrote:
marciano wrote:
Just to be on the safe side: you compiled in Release mode in VC, right?
Yes, I've compiled the code in release mode + /Ox
Different compiler implement different optimisations - particularly if you are on an older machine (PIII or earlier), GCC may have more effective optimisations. I wouldn't necessarily expect GCC to beat VS2008 for a modern machine though.

_________________
Tristam MacDonald - [swiftcoding]


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 06.10.2008, 17:11 
Offline
Tool Developer

Joined: 13.11.2007, 11:07
Posts: 1150
Location: Germany
Compiler optimizations may also depend on the complexity of your test code. If you use Visual Studio's PGO you may get big performance improvements on simple test code, but far less speed improvements on more complex code. At least in a short test with some thousands matrix multiplications and no other code the PGO optimized version was about 100 times faster. In a more complex code it nearly made no difference.


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 06.10.2008, 17:18 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
Most of that extra memory copyings has been removed in utMath_rc4 and it's a bit [~1.5x] faster than utMath_rc3.

There is a lot of things that I must to learn about SSE to 100% utilizing my old cpu powers :wink:


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 06.10.2008, 19:14 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
swiftcoder wrote:
Siavash wrote:
marciano wrote:
Just to be on the safe side: you compiled in Release mode in VC, right?
Yes, I've compiled the code in release mode + /Ox
Different compiler implement different optimisations - particularly if you are on an older machine (PIII or earlier), GCC may have more effective optimisations. I wouldn't necessarily expect GCC to beat VS2008 for a modern machine though.

Depending on this topic in msdn forums MSVC is not good with SSE optimizations.


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 08.10.2008, 04:24 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
Hi, I've used unions to remove extra memory loadings in Vec3f class
Code:
   union
   {
        __m128 m128;
   };

Code:
   // ------------
   // Constructors
   // ------------
   Vec3f() : x( 0.0f ), y( 0.0f ), z( 0.0f )
   {
       m128=_mm_setzero_ps();
   }

   explicit Vec3f( const float x, const float y, const float z ) : x( x ), y( y ), z( z )
   {
       m128=_mm_setr_ps(x, y, z, 0.0f);
   }

after benchmarking this piece of code with original utmath in adding two vectors I'm getting this results :
utmath_org : ~200ms
utmath_sse : ~500ms

So I decided change this piece of code too
Code:
   Vec3f operator+( const Vec3f &v ) const
   {
       SSE_ALIGNED( float out1[4] );
       out1[0]=x+v.x; out1[1]=x+v.y; out1[2]=z+v.z;
      return Vec3f( out1[0], out1[1], out1[2] );
   }

Code:
   Vec3f operator+( const Vec3f &v ) const
   {
       SSE_ALIGNED( float out1[4] );
       _mm_store_ps(out1, _mm_add_ps(m128, v.m128));
      return Vec3f( out1[0], out1[1], out1[2] );
   }

after benchmarking this piece of code with original utmath in adding two vectors I'm getting this results :
first code : ~550ms
second code : ~650ms

Anybody knows how to eliminate that SSE_ALIGNED( float out1[4] ) to make the code a bit faster?


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 08.10.2008, 23:06 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
After having a look at fvec.h I decided to eliminate that output arrays by this way :
Code:
   Vec3f extractm128(__m128 myvec) const
   {
      float *fp = (float*)&myvec;
      return Vec3f( *(fp+0), *(fp+1), *(fp+2));
   }

and here is the benchmark result on :
Code:
   Vec3f operator/( const float f ) const
   {
       __m128 a=_mm_mul_ps(m128, _mm_rcp_ps(_mm_set_ps1(f)));
      return extractm128(a);
   }

utmath_org : 190ms
utmath_rc4 : 591ms
utmath_rc4 : 731ms [using unions]
utmath_rc4 : 1012ms [using unions + extractm128]

But it seems that using unions is too much slower than utmath_rc4, it's going to make me mad :evil:


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 09.10.2008, 14:22 
Offline

Joined: 22.11.2007, 17:05
Posts: 707
Location: Boston, MA
Siavash wrote:
After having a look at fvec.h I decided to eliminate that output arrays by this way:
Code:
Vec3f extractm128(__m128 myvec) const
{
   float *fp = (float*)&myvec;
   return Vec3f( *(fp+0), *(fp+1), *(fp+2));
}
That shouldn't help at all, and may hurt a lot. You are forcing the compiler to generate code that saves the SSE register to memory, loads it back into 3 regular registers, and then writes it out again. Why do you need to extract it at all for this sort of operation? Just initialise the new vector directly with the SSE value.

_________________
Tristam MacDonald - [swiftcoding]


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 09.10.2008, 16:03 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
:lol: I'm glad/sorry to tell you that I was a bit wrong with benchmarks. I've tested the code with MSVC too and results was too shocking ! 0s for utmath_org, so I thought that both of MSVC and MinGW/GCC are cheating with normal FPU code because of predefined values and I decided to use random values to avoid the related optimizations :twisted:
I've changed the benchmark code to this :
Code:
#include <iostream>
#include <windows.h>
#include "utMath_org.h"

using namespace std;

int main()
{
    DWORD start,end;

    Vec3f vegas1;

    start=GetTickCount();

    for (int i=0; i<9999999; i++)
    {
        vegas1.x=rand();
        vegas1.z=rand();
        vegas1.y=rand();
        vegas1=vegas1/rand();
    }

    end=GetTickCount();

    cout<<"time :"<<end-start<<endl;

    return 0;
}

Welcome to the desert of the truth :
Compiled with MinGW 3.4.5 @ PIII 750MHz + 256MB RAM + WinXP Pro SP2

utmath_org : ~490ms
utmath_rc4 : ~410ms [using unions]
utmath_rc4 : ~370ms [using DIVPS]
utmath_rc4 : ~330ms [using MULPS and RCPPS instead of DIVPS]

Hmmmm....Yummy ! Atleast 30% performance boost ! IMHO it's better to fully forget about unions because the original utmath_rc4 is fastest on memory loadings [and 2x faster than utmath_rc3].
About using MULPS and RCPPS instead of DIVPS I must to say that results are not too accurate [the max diff is about ~0.01].

I'll put a detailed log of performance boosts a few days later here, wait for your fastest experience with engine :wink:
BTW, Blue pill or Red pill ?


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 17.10.2008, 12:52 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
Here is some useful links for code optimizing enthusiasts :wink:

http://www.agner.org/optimize/
http://gruntthepeon.free.fr/ssemath/


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 21.10.2008, 18:48 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
Finally the fastest SSE optimized utMath fully compatible with current structure of engine has been released, utMath_rc5 !
Most of unused memory loadings has been reduced and replaced by SHUFPS and depending on my benchmarks with utMath_rc5 it's ~10% faster than original utMath.

Now I'm going to redesign the utMath and perform some changes on engine to use m128 variables instead of normal floats used in Vec3f and Matrix4f to fully remove that unnecessary memory loadings to perform everything in cpu cache to fully utilize the real power of SSE.

Anybody wants to help me to perform the related changes on engine [because I'm not too friendly with other sections of engine] :?:


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 21.10.2008, 23:50 
Offline

Joined: 22.11.2007, 17:05
Posts: 707
Location: Boston, MA
Siavash wrote:
Anybody wants to help me to perform the related changes on engine [because I'm not too friendly with other sections of engine] :?:
So the necessary changes should be to convert all accesses to x, y or z of a vector to function calls x(), y() and z() - and similarly for the Matrix, where the function calls extract the necessary float from the m128? I have a feeling that you wont get any better performance this way than letting the compiler generate the same code automatically using unions.

_________________
Tristam MacDonald - [swiftcoding]


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 22.10.2008, 04:09 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
swiftcoder wrote:
So the necessary changes should be to convert all accesses to x, y or z of a vector to function calls x(), y() and z() - and similarly for the Matrix, where the function calls extract the necessary float from the m128?
Yes, I want to do the same thing like this : Accessing the necessary floats from m128 unions by using getX(),getY(),getZ() from Vec3f and getCol()/getRow or overloading [] operator in Matrix4f.And changing the current structure of engine to :
Code:
myVec3f.m128=sse_sqrt(myVec3f.m128)
Instead of :
Code:
myVec3f.m128=sqrt(myVec3f.getX());
myVec3f.m128=sqrt(myVec3f.getY());
myVec3f.m128=sqrt(myVec3f.getZ());

swiftcoder wrote:
I have a feeling that you wont get any better performance this way than letting the compiler generate the same code automatically using unions.
Can you describe your opinion a bit more, it's too interesting :wink:

I've upgraded my compiler from MinGW/GCC 3.4.5 to MinGW/GCC 4.3 alpha. By using which flag compiler enables the auto-vectorize feature? Currently I'm using -fexpensive-optimizations flag.
Why PIII optimized code [-march=pentium3] is slower than using -msse ?


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 99 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7  Next

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 18 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group