Horde3D

Next-Generation Graphics Engine
It is currently 29.03.2024, 11:16

All times are UTC + 1 hour




Post new topic Reply to topic  [ 22 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: 11.10.2008, 03:59 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
Hi, depending on benchmarks I've performed the original utmath_rc4 is a bit slower than original utmath, so I've performed some optimizations on utmath_rc4 and it's ~1:1 [~5% slower] with original utmath in low level functions [add,sub,mul] and atleast 10% faster in other functions.

I've a little question : Is there any performance differences between my old PIII and new Core2Duo series SSE units ? I want to know will my code [SSE] run faster than original utmath [FPU] in new CPUs or benchmark results will stay 1:1 again ? [I've googled a bit about this and it seems that AMD SSE units are a bit slower than Intel]

Thanks for your time :wink:


Top
 Profile  
Reply with quote  
PostPosted: 11.10.2008, 19:48 
Offline
Tool Developer

Joined: 13.11.2007, 11:07
Posts: 1150
Location: Germany
I don't have a PIII available, but I can tell you that in a short test with the Chicago Demo and software skinning your current utMath_rc4 runs with about 4.5 fps, when enabling the /arch:SSE2 parameter in MSVC it even drops down to a maximum of 4 fps. With the original utMath and disabled /arch:SSE2 parameter it runs with ~7 fps and with enabled /arch:SSE2 and original utMath it drops to ~5.5 fps. All tests were run on a Asus A8JS with a Core2 Duo 7200 with 2 Ghz and a Geforce7700.

I don't have any experience with SSE optimizations so I can't tell you more about differences or further optimizations based on SSE/SSE2


Top
 Profile  
Reply with quote  
PostPosted: 11.10.2008, 22:34 
Offline

Joined: 22.11.2007, 17:05
Posts: 707
Location: Boston, MA
Siavash wrote:
I've a little question : Is there any performance differences between my old PIII and new Core2Duo series SSE units ? I want to know will my code [SSE] run faster than original utmath [FPU] in new CPUs or benchmark results will stay 1:1 again?
Individual SSE instructions tend to maintain fairly steady relative performance between chip releases. However, you do have access to a much larger set of SSE instructions on newer chips, some of which might prove useful.

_________________
Tristam MacDonald - [swiftcoding]


Top
 Profile  
Reply with quote  
PostPosted: 12.10.2008, 13:47 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
Volker wrote:
I don't have a PIII available, but I can tell you that in a short test with the Chicago Demo and software skinning your current utMath_rc4 runs with about 4.5 fps, when enabling the /arch:SSE2 parameter in MSVC it even drops down to a maximum of 4 fps. With the original utMath and disabled /arch:SSE2 parameter it runs with ~7 fps and with enabled /arch:SSE2 and original utMath it drops to ~5.5 fps. All tests were run on a Asus A8JS with a Core2 Duo 7200 with 2 Ghz and a Geforce7700.

I don't want to compare SSE with SSE2.For ex PIII only supports MMX and SSE, on other hand new PentiumIV series supports MMX,SSE,SSE2 [and SSE3 I think] and HT technology.I want to compare the P3 SSE unit with new P4 and Core2Duo series SSE units performance [not SSE2 or SSE3].
I want to know is there any improvments on new cpus SSE units by cpu manufactures ? There is a few benchmarks in Intel manuals that they show HT enabled SIMD code runs 44% faster than normal SIMD codes and ....
Btw, I want to know that my code runs faster than FPU or remains ~1:1 like my old P3 ?
swiftcoder wrote:
Individual SSE instructions tend to maintain fairly steady relative performance between chip releases. However, you do have access to a much larger set of SSE instructions on newer chips, some of which might prove useful.

There isn't any benefits from SSE2 to engine.Because most of variables are in float format [not double] and SSE2 only adds ~140 new instructions to perform operations on double format [m128d and m128i] variables. But SSE3 adds 14 new useful instructions that they are useful for engine too.For ex Horizontal operations such as ADD-SUB, ADD and SUB on float variables [m128] too.
I'm not sure but I think because of engine's hardware limitations [PCI-Express enabled mainboards] target cpus support SSE3 and HTT well.Is there any hardware exceptions ?
About SSE4.x I must to say that is only supported in new core2duo series [not old LGA pentium4 cpus] and IMHO this only adds to the engine's limitations :?

I'm performing some optimizations on utmath_rc4 and it's atleast 1:1 with FPU code and in most of functions new utmath will compete very well with compiler auto vectorized FPU code.The job is near to be finished and I'll put it on forum next days :wink:


Top
 Profile  
Reply with quote  
PostPosted: 12.10.2008, 18:44 
Offline

Joined: 22.11.2007, 17:05
Posts: 707
Location: Boston, MA
Siavash wrote:
I'm not sure but I think because of engine's hardware limitations [PCI-Express enabled mainboards] target cpus support SSE3 and HTT well.Is there any hardware exceptions ?
About SSE4.x I must to say that is only supported in new core2duo series [not old LGA pentium4 cpus] and IMHO this only adds to the engine's limitations :?
I you are going to go to the trouble of using SSE, you need to be able to conditionally enable the various optimisations, and at that point you can support all SSE versions. Don't forget, it is entirely possible that some very recent processors (especially budget models) might have no SIMD support at all, as a cost and power saving measure.

_________________
Tristam MacDonald - [swiftcoding]


Top
 Profile  
Reply with quote  
PostPosted: 12.10.2008, 19:04 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
swiftcoder wrote:
I you are going to go to the trouble of using SSE, you need to be able to conditionally enable the various optimisations, and at that point you can support all SSE versions. Don't forget, it is entirely possible that some very recent processors (especially budget models) might have no SIMD support at all, as a cost and power saving measure.

Currently I'm trying to perform some optimizations on engine without damaging the whole project and we need to change the engines structure to utilize other SSE versions.

About SSE3 I must to say that it comes handy in a few sections of engine but SSE[1] is too much essential for engine.

About SIMD support I must to say that most of engine target cpus are Pentium4 D, Pentium4 M, Core2Duo, Core2Quad and extreme series of Intel and Sempron, AthlonX2, AthlonFX series from AMD and most of these 100% support SSE and SSE3 [I'm not too sure about SSE3 by sempron and celeron]

BTW, we can select this at runtime by using CPUID :wink:


Top
 Profile  
Reply with quote  
PostPosted: 12.10.2008, 19:16 
Offline
Tool Developer

Joined: 13.11.2007, 11:07
Posts: 1150
Location: Germany
I don't want to slow down your enthusiasm, but I think before we change the structure of the engine significantly it would be good if we have a significant performance improvement. Currently it seems to me that the SSE version is less fast than the original version. And I think it is not enough to have a very small test application that just does some matrix multiplications or vector additions. I guess the best test sample is using the optimized version in the Chicago sample with enabled software skinning. This is quite processor intensive and should benefit noticeable from the optimizations. I know you currently don't have the hardware to run the samples, but comparing the different utMath versions only based on very small test applications might not show the results we can later reproduce within the engine code. Just my two cents.


Top
 Profile  
Reply with quote  
PostPosted: 12.10.2008, 19:25 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
Volker wrote:
I don't want to slow down your enthusiasm, but I think before we change the structure of the engine significantly it would be good if we have a significant performance improvement. Currently it seems to me that the SSE version is less fast than the original version. And I think it is not enough to have a very small test application that just does some matrix multiplications or vector additions. I guess the best test sample is using the optimized version in the Chicago sample with enabled software skinning. This is quite processor intensive and should benefit noticeable from the optimizations. I know you currently don't have the hardware to run the samples, but comparing the different utMath versions only based on very small test applications might not show the results we can later reproduce within the engine code. Just my two cents.

I'm 100% agree with you dear Volker.The main thing that makes the utmath_rc3 and utmath_rc4 slower than original one is bad management of arrays [there is a lot memory loadings] and I'm reducing them and replacing the DIVPS and SQRTPS functions with RCPPS and RSQRTPS to make the code as fast as possible :wink:


Top
 Profile  
Reply with quote  
PostPosted: 16.10.2008, 19:08 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
Is there anybody to answer the main question ?


Top
 Profile  
Reply with quote  
PostPosted: 16.10.2008, 20:11 
Offline

Joined: 14.04.2008, 15:06
Posts: 183
Location: Germany
Siavash wrote:
I've a little question : Is there any performance differences between my old PIII and new Core2Duo series SSE units ? I want to know will my code [SSE] run faster than original utmath [FPU] in new CPUs or benchmark results will stay 1:1 again ? [I've googled a bit about this and it seems that AMD SSE units are a bit slower than Intel]

Yes, there are mayor differences: Many CPUs, especially old / power conserving ones, have only one SSE math unit and must do the 128bit operation in (at least) two 64 bit steps. The Core2Duo has a real 128 bit unit with some other improvements to do things like packed multiply, packed add, packed load and packed store together with a conditional jump in ONE cycle.
Other factors to consider:
- much improved delays and transfer rates to / from RAM / caches, bigger caches
- sometimes higher CPU frequency
- better jump prediction / speculative execution
So most of the time your code should be faster. But to really get peak performance you must write different code for different CPUs and benchmark it in several real world situations.


Another idea: Try compiling Horde3D with the compiler from Intel and compare that to your optimizations. Intel has also a nice profiler to help avoiding pipeline stalls, cache misses etc.


Have you thought about the problem, that a PowerPC does not have a SSE unit but use the Altivec instruction set?


Top
 Profile  
Reply with quote  
PostPosted: 17.10.2008, 00:26 
Offline

Joined: 08.11.2006, 03:10
Posts: 384
Location: Australia
Seeing this is all about optimisation, you might be interested in this series of articles. It is *very* long, but it goes into great detail about the CPUs cache memory.

At one point he uses matrix multiplication as an example of how to optimise the cache usage. He gets a 76.6% boost in speed just by rearranging the order that data is used, another 6.1% by getting rid of some copying, and another 7.8% from using SSE-type instructions
Quote:
Code:
          Original          Transposed       Sub-Matrix       Vectorized
Cycles    16,765,297,870    3,922,373,010    2,895,041,480    1,588,711,750
Relative  100%              23.4%            17.3%            9.47%

Table 6.2: Matrix Multiplication Timing


Codepoet wrote:
Have you thought about the problem, that a PowerPC does not have a SSE unit but use the Altivec instruction set?
Yes, the math code would have to be written 3 times - plain C++, SSE optimised C++, Altivec optimised C++...
I wonder if there are any high-performance math libraries out there that already do this?


Top
 Profile  
Reply with quote  
PostPosted: 17.10.2008, 09:05 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
Thanks a lot dear Codepoet and DarkAngel about those great detailed informations and articles :wink:

Unfortunately I'm not familiar with AltiVec, so I can't help with AltiVec optimized code but it's great to be the first next-gen engine to provide a high performance math library that supports both SSE and AltiVec.
Depending on wikipedia GCC 4.x auto-vectorisation feature produces the Altivec code too :
Wikipedia wrote:
Recent versions of the GNU Compiler Collection, IBM Visual Age Compiler and other compilers provide intrinsics to access AltiVec instructions directly from C and C++ programs. As of version 4, the GCC also includes auto-vectorisation capabilities that attempt to intelligently create Altivec accelerated binaries without the need for the programmer to use intrinsics directly.
And I've heard somewhere that AltiVec optimized code is ~20% faster than SSE ones :o

I'm going to have a closer look at Intel manuals about cacheability and prefetching instructions and perform some memory optimizations.


Top
 Profile  
Reply with quote  
PostPosted: 17.10.2008, 14:22 
Offline

Joined: 14.04.2008, 15:06
Posts: 183
Location: Germany
Maybe you should take a look at the math library used by bullet physics and compare its performance to your own code: http://www.bulletphysics.com/Bullet/php ... &f=&t=1322
They provide several implementations using plain C / C++, SSE, ...


Top
 Profile  
Reply with quote  
PostPosted: 17.10.2008, 16:20 
Offline

Joined: 21.08.2008, 11:44
Posts: 354
WOW ! It's really the fastest math library and learning resource that I've seen in my life :idea:


Top
 Profile  
Reply with quote  
PostPosted: 17.10.2008, 17:39 
Offline

Joined: 14.04.2008, 15:06
Posts: 183
Location: Germany
What do you think about integrating that library - or parts of it - into Horde after doing real world tests instead of writing your own version?


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 22 posts ]  Go to page 1, 2  Next

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 33 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group