Horde3D • View topic

View unanswered posts | View active topics

Board index » Horde3D Development » Developer Discussion

All times are UTC + 1 hour

NOS PACK

Page 6 of 7

[ 99 posts ]

Go to page Previous 1 ... 3, 4, 5, 6, 7 Next

Print view

Previous topic | Next topic

Author

Message

Siavash

Post subject: Re: NOS PACK

Posted: 04.04.2009, 05:27

Joined: 21.08.2008, 11:44
Posts: 354

Do apologise for long delay, currently i'm a little busy with uni

Top

Siavash

Post subject: Re: NOS PACK

Posted: 04.04.2009, 15:08

Joined: 21.08.2008, 11:44
Posts: 354

After some goooooogling it seems that there is no need to vectorize all of utMath, because vectorizing vector/quaternion classes has no effect because of cpu/mem load/stores. It's better to focus on Matrix class. IMHO this will give the peak performance to utMath.

Top

Siavash

Post subject: Re: NOS PACK

Posted: 07.04.2009, 14:03

Joined: 21.08.2008, 11:44
Posts: 354

Hi, Is there anybody to have a test with this piece of code ?

Code:

// Matrix4f class @ utMath.h
#include <xmmintrin.h>

   static void fastMult43( Matrix4f &dst, const Matrix4f &m1, const Matrix4f &m2 )
   {
      // Note: dst may not be the same as m1 or m2

      float *dstx = dst.x;
      const float *m1x = m1.x;
      const float *m2x = m2.x;

      __m128 temp1,temp2,temp3,temp4;

      __m128 m1x0=_mm_loadu_ps(m1x);
      __m128 m1x4=_mm_loadu_ps(m1x+4);
      __m128 m1x8=_mm_loadu_ps(m1x+8);
      __m128 m1x12=_mm_loadu_ps(m1x+12);

      temp1=_mm_mul_ps(m1x0,_mm_load1_ps(&m2x[0]));
      temp2=_mm_mul_ps(m1x4,_mm_load1_ps(&m2x[1]));
      temp3=_mm_mul_ps(m1x8,_mm_load1_ps(&m2x[2]));
      _mm_storeu_ps(dstx,_mm_add_ps(temp1,_mm_add_ps(temp2,temp3)));

      temp1=_mm_mul_ps(m1x0,_mm_load1_ps(&m2x[4]));
      temp2=_mm_mul_ps(m1x4,_mm_load1_ps(&m2x[5]));
      temp3=_mm_mul_ps(m1x8,_mm_load1_ps(&m2x[6]));
      _mm_storeu_ps(dstx+4,_mm_add_ps(temp1,_mm_add_ps(temp2,temp3)));

      temp1=_mm_mul_ps(m1x0,_mm_load1_ps(&m2x[8]));
      temp2=_mm_mul_ps(m1x4,_mm_load1_ps(&m2x[9]));
      temp3=_mm_mul_ps(m1x8,_mm_load1_ps(&m2x[10]));
      _mm_storeu_ps(dstx+8,_mm_add_ps(temp1,_mm_add_ps(temp2,temp3)));

      temp1=_mm_mul_ps(m1x0,_mm_load1_ps(&m2x[12]));
      temp2=_mm_mul_ps(m1x4,_mm_load1_ps(&m2x[13]));
      temp3=_mm_mul_ps(m1x8,_mm_load1_ps(&m2x[14]));
      temp4=_mm_mul_ps(m1x12,_mm_load1_ps(&m2x[15]));
      _mm_storeu_ps(dstx+12,_mm_add_ps(temp1,_mm_add_ps(temp2,_mm_add_ps(temp3,temp4))));

      dstx[3]=0.0f;
      dstx[7]=0.0f;
      dstx[11]=0.0f;
      dstx[15]=1.0f;
   }

Just wanted to know this is an efficient way of coding.

Top

Siavash

Post subject: Re: NOS PACK

Posted: 10.04.2009, 04:10

Joined: 21.08.2008, 11:44
Posts: 354

Hi, yesterday I've performed some code benchmarking on utMath Beta3 & utMath SIMD versions using Intel Vtune. Unfortunately utMath Beta2 SIMD version is ~2x slower than utMath Beta2/3. There isn't anyway to make it faster unless changing some parts of engine

Code:

/////////////   utMath beta3 // utMath SIMD      /////////////
determinant :      31            79
fastMult43  :      33            67
inverted    :      160            407
/////////////   Total time for a PentiumIII 750   /////////////

I'm sorry, I've failed to optimize the utMath. Looks utMath is not a good way to optimize the engine, as oldman said : "Just optimize the inner Loops"

I'll make ready an optimized version of utMath that only uses __m128 data type [NOS PACK]. [untill summer]

Top

fullmetalcoder

Post subject: Re: NOS PACK

Posted: 11.04.2009, 12:24

Joined: 11.04.2009, 08:42
Posts: 14
Location: France

I think the biggest potential for explicit vectorization lie in matrix/vector operation (+, *, dot product, ...) which is not covered by your tests so they do not offer help figuring out the potential of explicit vectorization of utMath.h

It could be interesting to use Eigen either as-is or as a starting basis to craft an efficient vectorized version of utMath (the interesting point is that Eigen is mature, very efficient and extremely portable so it would reduce the hassle). By the way, Eigen is licensed under LGPL so licensing isn't a problem.

Top

Siavash

Post subject: Re: NOS PACK

Posted: 11.04.2009, 14:05

Joined: 21.08.2008, 11:44
Posts: 354

fullmetalcoder wrote:

I think the biggest potential for explicit vectorization lie in matrix/vector operation (+, *, dot product, ...)

Yes, you are right and I'm agree with you.

fullmetalcoder wrote:

which is not covered by your tests so they do not offer help figuring out the potential of explicit vectorization of utMath.h

I've performed full benchmarks with utMath_rc6.2 [SIMD] and in most cases it looks that it's slower than the original one, because of using unaligned data to solve the problem with hp's <vector.h> [included in msvc] & extra cpu/mem data load/stores.

fullmetalcoder wrote:

It could be interesting to use Eigen either as-is or as a starting basis to craft an efficient vectorized version of utMath (the interesting point is that Eigen is mature, very efficient and extremely portable so it would reduce the hassle). By the way, Eigen is licensed under LGPL so licensing isn't a problem.

Eigen looks a little bloated and I don't have any ideas, but I'll have a look

Top

fullmetalcoder

Post subject: Re: NOS PACK

Posted: 11.04.2009, 14:22

Joined: 11.04.2009, 08:42
Posts: 14
Location: France

Siavash wrote:

Alignement is a requirement to obtain performance gain through vectorization I guess. It should not be an issue though wrapping an alignement macro and putting it in front off Matrix and vec definition should do. Another important point is the NEED to reduce the number of load stores as much as possible in favor of shuffle because SSE registers I/O is slow...

Siavash wrote:

Eigen looks a little bloated and I don't have any ideas, but I'll have a look

Well, using it as-is would probably not be that interesting but there are interesting things. For instance it should be possible to pick the two (SSE and Altivec) SIMD-abstracting files to have a robust and cross-platfrom working base. Then some investigation in the internals could give good insight on how to accelerate properly the bits of linear algebra used in utMath.h

Top

Siavash

Post subject: Re: NOS PACK

Posted: 11.04.2009, 14:37

Joined: 21.08.2008, 11:44
Posts: 354

fullmetalcoder wrote:

Yes you are right and there isn't any alignment problems with GCCs <vector.h> and MSVC is problematic here [because of using hps old vector.h]; There is a helpful topic on ompf.org about how to solve the problem with aligned data and vector header [std::vector] by using btAlignedObjectArray in Bullet physics library. It's a good idea to support both of SSE & AltiVec.

Top

fullmetalcoder

Post subject: Re: NOS PACK

Posted: 12.04.2009, 22:26

Joined: 11.04.2009, 08:42
Posts: 14
Location: France

I have been trying to get some vectorization to work (as in "be as accurate and faster than scalar equivalent") in utMath.h and it is very educational indeed.

Some results thus far (comparisions between Matrix4f::operator *(float) and Matrix4f::operator *= (float) ) :

All tests with GCC
* standard optimizations make it very hard to run benchmarks as the loop are often optimized out if the matrix isn't altered by the iterations
* with no optimizations at all the * operator is approximately twice as fast as the *= operator (yeah, it shocked me too)
* with default optimizations ( -O2 ) the *= operator is approximately 80% faster than the * operator (as expected)
* SSE vectorization leads to worse performance in the * operator, scalar version being 65% faster than vectorized one
* SSE vectorization leads to better performance in the *= operator, vectorized version being almost 5 times as fast as scalar one

As a side note I have crafted a small vectorize.h which wraps SSE and Altivec, provides a generic alignement macro and detects vectorization capability (fbased on compiler macro though, no run-time checks).

I would like to do some more investigations but unfortunately exams are getting real close so that will probably be postponed...

Top

fullmetalcoder

Post subject: Re: NOS PACK

Posted: 13.04.2009, 02:29

Joined: 11.04.2009, 08:42
Posts: 14
Location: France

After some tests it appears that any work on vectorization for Horde3D will be useless as long as the engine cannot cope with aligned vector/matrices. I have tried replacing std::vector<> with an equivalent container which works properly with aligned strucures but it does not solve anything : rendering is badly broken. The cause probably lies in egGeometry and egRenderer (the way vertex data is stored and passed to OpenGL is likely to be the biggest issue).

quite sad because I had been able to speed many things up in utMath.h ... Vectorization on hold until someone (probably not me given my schedule for the weeks to come) solves this alignment issues...

Top

marciano

Post subject: Re: NOS PACK

Posted: 13.04.2009, 14:12

Engine Developer

Joined: 10.09.2006, 15:52
Posts: 1217

Fullmetalcoder and Siavash, thanks for your promising efforts on optimizing our low-level math.

fullmetalcoder wrote:

* with no optimizations at all the * operator is approximately twice as fast as the *= operator (yeah, it shocked me too)
* with default optimizations ( -O2 ) the *= operator is approximately 80% faster than the * operator (as expected)
* SSE vectorization leads to worse performance in the * operator, scalar version being 65% faster than vectorized one
* SSE vectorization leads to better performance in the *= operator, vectorized version being almost 5 times as fast as scalar one

These are very interesting numbers. I wonder how a function like

Code:

static void fastMult43( Matrix4f &dst, const Matrix4f &m1, const Matrix4f &m2 )

compares to the operator-based multiplication functions. Also, to give the compiler more flexibility when optimizing the functions, it would make sense to declare the pointers to the matrices as restricted (__restrict keyword in VC).

fullmetalcoder wrote:

As a side note I have crafted a small vectorize.h which wraps SSE and Altivec, provides a generic alignement macro and detects vectorization capability (fbased on compiler macro though, no run-time checks).

That sounds great. We could add a new file utFastMath.h that contains highly optimized (vectorized) versions of common math functions (e.g. fastMult43 would be moved there). If you want to share some of your code it could be a good start.

fullmetalcoder wrote:

I guess the problem here is that we use 3-component vectors so far and if you align the vectors for SIMD (16 byte alignment) you get 4-component vectors.

Top

fullmetalcoder

Post subject: Re: NOS PACK

Posted: 13.04.2009, 16:30

Joined: 11.04.2009, 08:42
Posts: 14
Location: France

marciano wrote:

These are very interesting numbers. I wonder how a function like

Code:

static void fastMult43( Matrix4f &dst, const Matrix4f &m1, const Matrix4f &m2 )

Before hiting the "alignment wall" I had tested vectorization of the following operations :
* Vec3f cross product : scalar version 50% faster (mostly due to expensive shuffling so the numbers could be very different with AltiVec)
* Vec3f dot product : no noticeable difference between scalar and vectorized code
* Matrix4f * Vec3f : vectorization makes it twice as fast
* Matrix4f * Vec3f (33) : idem
* Matrix4f * Matrix4f : the speedup was so big (1000%) that I suspect benchmark inaccuracy

marciano wrote:

If you want to share some of your code it could be a good start.

I'll make the vectorize.h available under the DWTFYW license and the tweaked utMath.h under LGPL but first I'd like to do some more tests to check that it does not break anything and I'd like to make sure the alignment issue is solved or it will be pretty much useless...

marciano wrote:

I guess the problem here is that we use 3-component vectors so far and if you align the vectors for SIMD (16 byte alignment) you get 4-component vectors.

Indeed. I actually tried many things to figure out what detail triggered the rendering hell caused by alignment enforcing policy of Vec3f and it turns out that simply adding an extra float field leads to the exact same results, which means two things :
* it should be relatively easy (especially for you and other people familiar with Horde internals) to fix the spots where this change brings side effects
* parts of the code base are not as robust as they should be. I have noticed that you used VBOs intensively and most of the troubles are likely to come from there but I have been unable to find the right offset/stride tweaking to fix egRenderer.cpp

Top

DarkAngel

Post subject: Re: NOS PACK

Posted: 13.04.2009, 17:28

Joined: 08.11.2006, 03:10
Posts: 384
Location: Australia

fullmetalcoder wrote:

* parts of the code base are not as robust as they should be. I have noticed that you used VBOs intensively and most of the troubles are likely to come from there but I have been unable to find the right offset/stride tweaking to fix egRenderer.cpp

It would be best to re-align (un-align?) the data before packing into a VBO. Anything that puts data into a VBO would need an "Aligned Vec4 to Unaligned Vec3/4" conversion function.

Top

fullmetalcoder

Post subject: Re: NOS PACK

Posted: 13.04.2009, 18:06

Joined: 11.04.2009, 08:42
Posts: 14
Location: France

DarkAngel wrote:

It would be best to re-align (un-align?) the data before packing into a VBO. Anything that puts data into a VBO would need an "Aligned Vec4 to Unaligned Vec3/4" conversion function.

I'm not sure alignment is an issue in VBO. I'm quite sure however that the way vector data is handled is one since the offset/stride used in egRenderer.cpp do not take structure size into account but instead make the (unsafe) assumption that the compiler will gently pack the data the way the it is declared and it does not even bother checking the size of Vec3 and make the (even more unsafe) assumption that it will be that of three floats. So anyone with knoweldge of the internals should be able to solve this using a couple of sizeof().

Now of course it would be possible to convert the data when sending it to (or redinig it from) the GPU but it would probably end up wasting a lot of cycles with no obvious advantage.

edit : I may have fixed it at last. I'll do some more tests and keep you informed.

In egRenderer.cpp somewhere around line 1630 modify the code to look like that (no addition/deletion, just some parameter changes)

Code:

         glVertexAttribPointer( 1, 3, GL_FLOAT, GL_FALSE, sizeof(Vec3f), (char *)0 + vertCount * sizeof(Vec3f) );
         glVertexAttribPointer( 2, 3, GL_FLOAT, GL_FALSE, sizeof(Vec3f), (char *)0 + vertCount * sizeof(Vec3f) * 2 );
         glVertexAttribPointer( 3, 3, GL_FLOAT, GL_FALSE, sizeof(Vec3f), (char *)0 + vertCount * sizeof(Vec3f) * 3 );
         
         glVertexAttribPointer( 4, 4, GL_FLOAT, GL_FALSE,
                                sizeof( VertexDataStatic ), (char *)0 + vertCount * sizeof(Vec3f) * 4 + 8 );
         glVertexAttribPointer( 5, 4, GL_FLOAT, GL_FALSE,
                                sizeof( VertexDataStatic ), (char *)0 + vertCount * sizeof(Vec3f) * 4 + 24 );
         glVertexAttribPointer( 6, 2, GL_FLOAT, GL_FALSE,
                                sizeof( VertexDataStatic ), (char *)0 + vertCount * sizeof(Vec3f) * 4 );
         glVertexAttribPointer( 7, 2, GL_FLOAT, GL_FALSE,
                                sizeof( VertexDataStatic ), (char *)0 + vertCount * sizeof(Vec3f) * 4 + 40 );
         

At least it now runs fine with a Vec3f having four float fields. I'll see if alignment break things.

edit 2 : alignment is not a concern apparently but regardless particule rendering is broken and I have troubles figuring out how to fix that (looks like glUniform does not take a stride argument

) any hints?

Top

marciano

Post subject: Re: NOS PACK

Posted: 13.04.2009, 19:01

Engine Developer

Joined: 10.09.2006, 15:52
Posts: 1217

I have quickly fixed the problem before I saw now that you came up with the same solution

But the particles are also working in the svn version...

Top

Page 6 of 7

[ 99 posts ]

Go to page Previous 1 ... 3, 4, 5, 6, 7 Next

Board index » Horde3D Development » Developer Discussion

All times are UTC + 1 hour

Who is online

Users browsing this forum: No registered users and 28 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum