Horde3D

Next-Generation Graphics Engine
It is currently 26.04.2024, 07:15

All times are UTC + 1 hour




Post new topic Reply to topic  [ 99 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7  Next
Author Message
 Post subject: Re: NOS PACK
PostPosted: 04.04.2009, 05:27 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
Do apologise for long delay, currently i'm a little busy with uni :?


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 04.04.2009, 15:08 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
After some goooooogling it seems that there is no need to vectorize all of utMath, because vectorizing vector/quaternion classes has no effect because of cpu/mem load/stores. It's better to focus on Matrix class. IMHO this will give the peak performance to utMath.


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 07.04.2009, 14:03 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
Hi, Is there anybody to have a test with this piece of code ?
Code:
// Matrix4f class @ utMath.h
#include <xmmintrin.h>

   static void fastMult43( Matrix4f &dst, const Matrix4f &m1, const Matrix4f &m2 )
   {
      // Note: dst may not be the same as m1 or m2

      float *dstx = dst.x;
      const float *m1x = m1.x;
      const float *m2x = m2.x;

      __m128 temp1,temp2,temp3,temp4;

      __m128 m1x0=_mm_loadu_ps(m1x);
      __m128 m1x4=_mm_loadu_ps(m1x+4);
      __m128 m1x8=_mm_loadu_ps(m1x+8);
      __m128 m1x12=_mm_loadu_ps(m1x+12);

      temp1=_mm_mul_ps(m1x0,_mm_load1_ps(&m2x[0]));
      temp2=_mm_mul_ps(m1x4,_mm_load1_ps(&m2x[1]));
      temp3=_mm_mul_ps(m1x8,_mm_load1_ps(&m2x[2]));
      _mm_storeu_ps(dstx,_mm_add_ps(temp1,_mm_add_ps(temp2,temp3)));

      temp1=_mm_mul_ps(m1x0,_mm_load1_ps(&m2x[4]));
      temp2=_mm_mul_ps(m1x4,_mm_load1_ps(&m2x[5]));
      temp3=_mm_mul_ps(m1x8,_mm_load1_ps(&m2x[6]));
      _mm_storeu_ps(dstx+4,_mm_add_ps(temp1,_mm_add_ps(temp2,temp3)));

      temp1=_mm_mul_ps(m1x0,_mm_load1_ps(&m2x[8]));
      temp2=_mm_mul_ps(m1x4,_mm_load1_ps(&m2x[9]));
      temp3=_mm_mul_ps(m1x8,_mm_load1_ps(&m2x[10]));
      _mm_storeu_ps(dstx+8,_mm_add_ps(temp1,_mm_add_ps(temp2,temp3)));

      temp1=_mm_mul_ps(m1x0,_mm_load1_ps(&m2x[12]));
      temp2=_mm_mul_ps(m1x4,_mm_load1_ps(&m2x[13]));
      temp3=_mm_mul_ps(m1x8,_mm_load1_ps(&m2x[14]));
      temp4=_mm_mul_ps(m1x12,_mm_load1_ps(&m2x[15]));
      _mm_storeu_ps(dstx+12,_mm_add_ps(temp1,_mm_add_ps(temp2,_mm_add_ps(temp3,temp4))));

      dstx[3]=0.0f;
      dstx[7]=0.0f;
      dstx[11]=0.0f;
      dstx[15]=1.0f;
   }
Just wanted to know this is an efficient way of coding.


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 10.04.2009, 04:10 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
Hi, yesterday I've performed some code benchmarking on utMath Beta3 & utMath SIMD versions using Intel Vtune. Unfortunately utMath Beta2 SIMD version is ~2x slower than utMath Beta2/3. There isn't anyway to make it faster unless changing some parts of engine
Code:
/////////////   utMath beta3 // utMath SIMD      /////////////
determinant :      31            79
fastMult43  :      33            67
inverted    :      160            407
/////////////   Total time for a PentiumIII 750   /////////////
I'm sorry, I've failed to optimize the utMath. Looks utMath is not a good way to optimize the engine, as oldman said : "Just optimize the inner Loops"

I'll make ready an optimized version of utMath that only uses __m128 data type [NOS PACK]. [untill summer]


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 11.04.2009, 12:24 
Offline

Joined: 11.04.2009, 08:42
Posts: 14
Location: France
I think the biggest potential for explicit vectorization lie in matrix/vector operation (+, *, dot product, ...) which is not covered by your tests so they do not offer help figuring out the potential of explicit vectorization of utMath.h

It could be interesting to use Eigen either as-is or as a starting basis to craft an efficient vectorized version of utMath (the interesting point is that Eigen is mature, very efficient and extremely portable so it would reduce the hassle). By the way, Eigen is licensed under LGPL so licensing isn't a problem.


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 11.04.2009, 14:05 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
fullmetalcoder wrote:
I think the biggest potential for explicit vectorization lie in matrix/vector operation (+, *, dot product, ...)
Yes, you are right and I'm agree with you.
fullmetalcoder wrote:
which is not covered by your tests so they do not offer help figuring out the potential of explicit vectorization of utMath.h
I've performed full benchmarks with utMath_rc6.2 [SIMD] and in most cases it looks that it's slower than the original one, because of using unaligned data to solve the problem with hp's <vector.h> [included in msvc] & extra cpu/mem data load/stores.
fullmetalcoder wrote:
It could be interesting to use Eigen either as-is or as a starting basis to craft an efficient vectorized version of utMath (the interesting point is that Eigen is mature, very efficient and extremely portable so it would reduce the hassle). By the way, Eigen is licensed under LGPL so licensing isn't a problem.
Eigen looks a little bloated and I don't have any ideas, but I'll have a look 8)


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 11.04.2009, 14:22 
Offline

Joined: 11.04.2009, 08:42
Posts: 14
Location: France
Siavash wrote:
I've performed full benchmarks with utMath_rc6.2 [SIMD] and in most cases it looks that it's slower than the original one, because of using unaligned data to solve the problem with hp's <vector.h> [included in msvc] & extra cpu/mem data load/stores.

Alignement is a requirement to obtain performance gain through vectorization I guess. It should not be an issue though wrapping an alignement macro and putting it in front off Matrix and vec definition should do. Another important point is the NEED to reduce the number of load stores as much as possible in favor of shuffle because SSE registers I/O is slow...

Siavash wrote:
Eigen looks a little bloated and I don't have any ideas, but I'll have a look 8)

Well, using it as-is would probably not be that interesting but there are interesting things. For instance it should be possible to pick the two (SSE and Altivec) SIMD-abstracting files to have a robust and cross-platfrom working base. Then some investigation in the internals could give good insight on how to accelerate properly the bits of linear algebra used in utMath.h


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 11.04.2009, 14:37 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
fullmetalcoder wrote:
Alignement is a requirement to obtain performance gain through vectorization I guess. It should not be an issue though wrapping an alignement macro and putting it in front off Matrix and vec definition should do.
Yes you are right and there isn't any alignment problems with GCCs <vector.h> and MSVC is problematic here [because of using hps old vector.h]; There is a helpful topic on ompf.org about how to solve the problem with aligned data and vector header [std::vector] by using btAlignedObjectArray in Bullet physics library. It's a good idea to support both of SSE & AltiVec.


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 12.04.2009, 22:26 
Offline

Joined: 11.04.2009, 08:42
Posts: 14
Location: France
I have been trying to get some vectorization to work (as in "be as accurate and faster than scalar equivalent") in utMath.h and it is very educational indeed.

Some results thus far (comparisions between Matrix4f::operator *(float) and Matrix4f::operator *= (float) ) :

All tests with GCC
* standard optimizations make it very hard to run benchmarks as the loop are often optimized out if the matrix isn't altered by the iterations
* with no optimizations at all the * operator is approximately twice as fast as the *= operator (yeah, it shocked me too)
* with default optimizations ( -O2 ) the *= operator is approximately 80% faster than the * operator (as expected)
* SSE vectorization leads to worse performance in the * operator, scalar version being 65% faster than vectorized one
* SSE vectorization leads to better performance in the *= operator, vectorized version being almost 5 times as fast as scalar one

As a side note I have crafted a small vectorize.h which wraps SSE and Altivec, provides a generic alignement macro and detects vectorization capability (fbased on compiler macro though, no run-time checks).

I would like to do some more investigations but unfortunately exams are getting real close so that will probably be postponed...


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 13.04.2009, 02:29 
Offline

Joined: 11.04.2009, 08:42
Posts: 14
Location: France
After some tests it appears that any work on vectorization for Horde3D will be useless as long as the engine cannot cope with aligned vector/matrices. I have tried replacing std::vector<> with an equivalent container which works properly with aligned strucures but it does not solve anything : rendering is badly broken. The cause probably lies in egGeometry and egRenderer (the way vertex data is stored and passed to OpenGL is likely to be the biggest issue).

quite sad because I had been able to speed many things up in utMath.h ... Vectorization on hold until someone (probably not me given my schedule for the weeks to come) solves this alignment issues...


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 13.04.2009, 14:12 
Offline
Engine Developer

Joined: 10.09.2006, 15:52
Posts: 1217
Fullmetalcoder and Siavash, thanks for your promising efforts on optimizing our low-level math.

fullmetalcoder wrote:
* with no optimizations at all the * operator is approximately twice as fast as the *= operator (yeah, it shocked me too)
* with default optimizations ( -O2 ) the *= operator is approximately 80% faster than the * operator (as expected)
* SSE vectorization leads to worse performance in the * operator, scalar version being 65% faster than vectorized one
* SSE vectorization leads to better performance in the *= operator, vectorized version being almost 5 times as fast as scalar one

These are very interesting numbers. I wonder how a function like
Code:
static void fastMult43( Matrix4f &dst, const Matrix4f &m1, const Matrix4f &m2 )
compares to the operator-based multiplication functions. Also, to give the compiler more flexibility when optimizing the functions, it would make sense to declare the pointers to the matrices as restricted (__restrict keyword in VC).

fullmetalcoder wrote:
As a side note I have crafted a small vectorize.h which wraps SSE and Altivec, provides a generic alignement macro and detects vectorization capability (fbased on compiler macro though, no run-time checks).

That sounds great. We could add a new file utFastMath.h that contains highly optimized (vectorized) versions of common math functions (e.g. fastMult43 would be moved there). If you want to share some of your code it could be a good start.

fullmetalcoder wrote:
After some tests it appears that any work on vectorization for Horde3D will be useless as long as the engine cannot cope with aligned vector/matrices. I have tried replacing std::vector<> with an equivalent container which works properly with aligned strucures but it does not solve anything : rendering is badly broken. The cause probably lies in egGeometry and egRenderer (the way vertex data is stored and passed to OpenGL is likely to be the biggest issue).

I guess the problem here is that we use 3-component vectors so far and if you align the vectors for SIMD (16 byte alignment) you get 4-component vectors.


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 13.04.2009, 16:30 
Offline

Joined: 11.04.2009, 08:42
Posts: 14
Location: France
marciano wrote:
These are very interesting numbers. I wonder how a function like
Code:
static void fastMult43( Matrix4f &dst, const Matrix4f &m1, const Matrix4f &m2 )
compares to the operator-based multiplication functions. Also, to give the compiler more flexibility when optimizing the functions, it would make sense to declare the pointers to the matrices as restricted (__restrict keyword in VC).

Before hiting the "alignment wall" I had tested vectorization of the following operations :
* Vec3f cross product : scalar version 50% faster (mostly due to expensive shuffling so the numbers could be very different with AltiVec)
* Vec3f dot product : no noticeable difference between scalar and vectorized code
* Matrix4f * Vec3f : vectorization makes it twice as fast
* Matrix4f * Vec3f (33) : idem
* Matrix4f * Matrix4f : the speedup was so big (1000%) that I suspect benchmark inaccuracy

marciano wrote:
If you want to share some of your code it could be a good start.

I'll make the vectorize.h available under the DWTFYW license and the tweaked utMath.h under LGPL but first I'd like to do some more tests to check that it does not break anything and I'd like to make sure the alignment issue is solved or it will be pretty much useless...

marciano wrote:
I guess the problem here is that we use 3-component vectors so far and if you align the vectors for SIMD (16 byte alignment) you get 4-component vectors.

Indeed. I actually tried many things to figure out what detail triggered the rendering hell caused by alignment enforcing policy of Vec3f and it turns out that simply adding an extra float field leads to the exact same results, which means two things :
* it should be relatively easy (especially for you and other people familiar with Horde internals) to fix the spots where this change brings side effects
* parts of the code base are not as robust as they should be. I have noticed that you used VBOs intensively and most of the troubles are likely to come from there but I have been unable to find the right offset/stride tweaking to fix egRenderer.cpp


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 13.04.2009, 17:28 
Offline

Joined: 08.11.2006, 03:10
Posts: 384
Location: Australia
fullmetalcoder wrote:
* parts of the code base are not as robust as they should be. I have noticed that you used VBOs intensively and most of the troubles are likely to come from there but I have been unable to find the right offset/stride tweaking to fix egRenderer.cpp
It would be best to re-align (un-align?) the data before packing into a VBO. Anything that puts data into a VBO would need an "Aligned Vec4 to Unaligned Vec3/4" conversion function.


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 13.04.2009, 18:06 
Offline

Joined: 11.04.2009, 08:42
Posts: 14
Location: France
DarkAngel wrote:
It would be best to re-align (un-align?) the data before packing into a VBO. Anything that puts data into a VBO would need an "Aligned Vec4 to Unaligned Vec3/4" conversion function.

I'm not sure alignment is an issue in VBO. I'm quite sure however that the way vector data is handled is one since the offset/stride used in egRenderer.cpp do not take structure size into account but instead make the (unsafe) assumption that the compiler will gently pack the data the way the it is declared and it does not even bother checking the size of Vec3 and make the (even more unsafe) assumption that it will be that of three floats. So anyone with knoweldge of the internals should be able to solve this using a couple of sizeof().

Now of course it would be possible to convert the data when sending it to (or redinig it from) the GPU but it would probably end up wasting a lot of cycles with no obvious advantage.

edit : I may have fixed it at last. I'll do some more tests and keep you informed.

In egRenderer.cpp somewhere around line 1630 modify the code to look like that (no addition/deletion, just some parameter changes)

Code:

         glVertexAttribPointer( 1, 3, GL_FLOAT, GL_FALSE, sizeof(Vec3f), (char *)0 + vertCount * sizeof(Vec3f) );
         glVertexAttribPointer( 2, 3, GL_FLOAT, GL_FALSE, sizeof(Vec3f), (char *)0 + vertCount * sizeof(Vec3f) * 2 );
         glVertexAttribPointer( 3, 3, GL_FLOAT, GL_FALSE, sizeof(Vec3f), (char *)0 + vertCount * sizeof(Vec3f) * 3 );
         
         glVertexAttribPointer( 4, 4, GL_FLOAT, GL_FALSE,
                                sizeof( VertexDataStatic ), (char *)0 + vertCount * sizeof(Vec3f) * 4 + 8 );
         glVertexAttribPointer( 5, 4, GL_FLOAT, GL_FALSE,
                                sizeof( VertexDataStatic ), (char *)0 + vertCount * sizeof(Vec3f) * 4 + 24 );
         glVertexAttribPointer( 6, 2, GL_FLOAT, GL_FALSE,
                                sizeof( VertexDataStatic ), (char *)0 + vertCount * sizeof(Vec3f) * 4 );
         glVertexAttribPointer( 7, 2, GL_FLOAT, GL_FALSE,
                                sizeof( VertexDataStatic ), (char *)0 + vertCount * sizeof(Vec3f) * 4 + 40 );
         


At least it now runs fine with a Vec3f having four float fields. I'll see if alignment break things.

edit 2 : alignment is not a concern apparently but regardless particule rendering is broken and I have troubles figuring out how to fix that (looks like glUniform does not take a stride argument :( ) any hints?


Top
 Profile  
Reply with quote  
 Post subject: Re: NOS PACK
PostPosted: 13.04.2009, 19:01 
Offline
Engine Developer

Joined: 10.09.2006, 15:52
Posts: 1217
I have quickly fixed the problem before I saw now that you came up with the same solution :)
But the particles are also working in the svn version...


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 99 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7  Next

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 28 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group