PentiumIII SSE unit vs. PentiumIV and Core2Duo SSE units

Siavash · **Joined:** 21.08.2008, 11:44 **Posts:** 354

Hi, depending on benchmarks I've performed the original utmath_rc4 is a bit slower than original utmath, so I've performed some optimizations on utmath_rc4 and it's ~1:1 [~5% slower] with original utmath in low level functions [add,sub,mul] and atleast 10% faster in other functions.

I've a little question : Is there any performance differences between my old PIII and new Core2Duo series SSE units ? I want to know will my code [SSE] run faster than original utmath [FPU] in new CPUs or benchmark results will stay 1:1 again ? [I've googled a bit about this and it seems that AMD SSE units are a bit slower than Intel]

Thanks for your time :wink:

Volker · **Posted:** 11.10.2008, 19:48

I don't have a PIII available, but I can tell you that in a short test with the Chicago Demo and software skinning your current utMath_rc4 runs with about 4.5 fps, when enabling the /arch:SSE2 parameter in MSVC it even drops down to a maximum of 4 fps. With the original utMath and disabled /arch:SSE2 parameter it runs with ~7 fps and with enabled /arch:SSE2 and original utMath it drops to ~5.5 fps. All tests were run on a Asus A8JS with a Core2 Duo 7200 with 2 Ghz and a Geforce7700.

I don't have any experience with SSE optimizations so I can't tell you more about differences or further optimizations based on SSE/SSE2

swiftcoder · **Posted:** 11.10.2008, 22:34

Siavash wrote:

I've a little question : Is there any performance differences between my old PIII and new Core2Duo series SSE units ? I want to know will my code [SSE] run faster than original utmath [FPU] in new CPUs or benchmark results will stay 1:1 again?

Individual SSE instructions tend to maintain fairly steady relative performance between chip releases. However, you do have access to a much larger set of SSE instructions on newer chips, some of which might prove useful.

Siavash · **Joined:** 21.08.2008, 11:44 **Posts:** 354

Volker wrote:

I don't have a PIII available, but I can tell you that in a short test with the Chicago Demo and software skinning your current utMath_rc4 runs with about 4.5 fps, when enabling the /arch:SSE2 parameter in MSVC it even drops down to a maximum of 4 fps. With the original utMath and disabled /arch:SSE2 parameter it runs with ~7 fps and with enabled /arch:SSE2 and original utMath it drops to ~5.5 fps. All tests were run on a Asus A8JS with a Core2 Duo 7200 with 2 Ghz and a Geforce7700.

I don't want to compare SSE with SSE2.For ex PIII only supports MMX and SSE, on other hand new PentiumIV series supports MMX,SSE,SSE2 [and SSE3 I think] and HT technology.I want to compare the P3 SSE unit with new P4 and Core2Duo series SSE units performance [not SSE2 or SSE3].
I want to know is there any improvments on new cpus SSE units by cpu manufactures ? There is a few benchmarks in Intel manuals that they show HT enabled SIMD code runs 44% faster than normal SIMD codes and ....
Btw, I want to know that my code runs faster than FPU or remains ~1:1 like my old P3 ?

swiftcoder wrote:

Individual SSE instructions tend to maintain fairly steady relative performance between chip releases. However, you do have access to a much larger set of SSE instructions on newer chips, some of which might prove useful.

There isn't any benefits from SSE2 to engine.Because most of variables are in float format [not double] and SSE2 only adds ~140 new instructions to perform operations on double format [m128d and m128i] variables. But SSE3 adds 14 new useful instructions that they are useful for engine too.For ex Horizontal operations such as ADD-SUB, ADD and SUB on float variables [m128] too.
I'm not sure but I think because of engine's hardware limitations [PCI-Express enabled mainboards] target cpus support SSE3 and HTT well.Is there any hardware exceptions ?
About SSE4.x I must to say that is only supported in new core2duo series [not old LGA pentium4 cpus] and IMHO this only adds to the engine's limitations

I'm performing some optimizations on utmath_rc4 and it's atleast 1:1 with FPU code and in most of functions new utmath will compete very well with compiler auto vectorized FPU code.The job is near to be finished and I'll put it on forum next days :wink:

swiftcoder · **Posted:** 12.10.2008, 18:44

Siavash wrote:

I'm not sure but I think because of engine's hardware limitations [PCI-Express enabled mainboards] target cpus support SSE3 and HTT well.Is there any hardware exceptions ?
About SSE4.x I must to say that is only supported in new core2duo series [not old LGA pentium4 cpus] and IMHO this only adds to the engine's limitations

I you are going to go to the trouble of using SSE, you need to be able to conditionally enable the various optimisations, and at that point you can support all SSE versions. Don't forget, it is entirely possible that some very recent processors (especially budget models) might have no SIMD support at all, as a cost and power saving measure.

Siavash · **Joined:** 21.08.2008, 11:44 **Posts:** 354

swiftcoder wrote:

I you are going to go to the trouble of using SSE, you need to be able to conditionally enable the various optimisations, and at that point you can support all SSE versions. Don't forget, it is entirely possible that some very recent processors (especially budget models) might have no SIMD support at all, as a cost and power saving measure.

Currently I'm trying to perform some optimizations on engine without damaging the whole project and we need to change the engines structure to utilize other SSE versions.

About SSE3 I must to say that it comes handy in a few sections of engine but SSE[1] is too much essential for engine.

About SIMD support I must to say that most of engine target cpus are Pentium4 D, Pentium4 M, Core2Duo, Core2Quad and extreme series of Intel and Sempron, AthlonX2, AthlonFX series from AMD and most of these 100% support SSE and SSE3 [I'm not too sure about SSE3 by sempron and celeron]

BTW, we can select this at runtime by using CPUID :wink:

Volker · **Posted:** 12.10.2008, 19:16

I don't want to slow down your enthusiasm, but I think before we change the structure of the engine significantly it would be good if we have a significant performance improvement. Currently it seems to me that the SSE version is less fast than the original version. And I think it is not enough to have a very small test application that just does some matrix multiplications or vector additions. I guess the best test sample is using the optimized version in the Chicago sample with enabled software skinning. This is quite processor intensive and should benefit noticeable from the optimizations. I know you currently don't have the hardware to run the samples, but comparing the different utMath versions only based on very small test applications might not show the results we can later reproduce within the engine code. Just my two cents.

Siavash · **Joined:** 21.08.2008, 11:44 **Posts:** 354

Volker wrote:

I don't want to slow down your enthusiasm, but I think before we change the structure of the engine significantly it would be good if we have a significant performance improvement. Currently it seems to me that the SSE version is less fast than the original version. And I think it is not enough to have a very small test application that just does some matrix multiplications or vector additions. I guess the best test sample is using the optimized version in the Chicago sample with enabled software skinning. This is quite processor intensive and should benefit noticeable from the optimizations. I know you currently don't have the hardware to run the samples, but comparing the different utMath versions only based on very small test applications might not show the results we can later reproduce within the engine code. Just my two cents.

I'm 100% agree with you dear Volker.The main thing that makes the utmath_rc3 and utmath_rc4 slower than original one is bad management of arrays [there is a lot memory loadings] and I'm reducing them and replacing the DIVPS and SQRTPS functions with RCPPS and RSQRTPS to make the code as fast as possible :wink:

Siavash · **Joined:** 21.08.2008, 11:44 **Posts:** 354

Is there anybody to answer the main question ?

Codepoet · **Posted:** 16.10.2008, 20:11

Siavash wrote:

I've a little question : Is there any performance differences between my old PIII and new Core2Duo series SSE units ? I want to know will my code [SSE] run faster than original utmath [FPU] in new CPUs or benchmark results will stay 1:1 again ? [I've googled a bit about this and it seems that AMD SSE units are a bit slower than Intel]

Yes, there are mayor differences: Many CPUs, especially old / power conserving ones, have only one SSE math unit and must do the 128bit operation in (at least) two 64 bit steps. The Core2Duo has a real 128 bit unit with some other improvements to do things like packed multiply, packed add, packed load and packed store together with a conditional jump in ONE cycle.
Other factors to consider:
- much improved delays and transfer rates to / from RAM / caches, bigger caches
- sometimes higher CPU frequency
- better jump prediction / speculative execution
So most of the time your code should be faster. But to really get peak performance you must write different code for different CPUs and benchmark it in several real world situations.

Another idea: Try compiling Horde3D with the compiler from Intel and compare that to your optimizations. Intel has also a nice profiler to help avoiding pipeline stalls, cache misses etc.

Have you thought about the problem, that a PowerPC does not have a SSE unit but use the Altivec instruction set?

DarkAngel · **Posted:** 17.10.2008, 00:26

Seeing this is all about optimisation, you might be interested in this series of articles. It is *very* long, but it goes into great detail about the CPUs cache memory.

At one point he uses matrix multiplication as an example of how to optimise the cache usage. He gets a 76.6% boost in speed just by rearranging the order that data is used, another 6.1% by getting rid of some copying, and another 7.8% from using SSE-type instructions

Quote:

Code:

          Original          Transposed       Sub-Matrix       Vectorized
Cycles    16,765,297,870    3,922,373,010    2,895,041,480    1,588,711,750
Relative  100%              23.4%            17.3%            9.47%

Table 6.2: Matrix Multiplication Timing 

Codepoet wrote:

Have you thought about the problem, that a PowerPC does not have a SSE unit but use the Altivec instruction set?

Yes, the math code would have to be written 3 times - plain C++, SSE optimised C++, Altivec optimised C++...
I wonder if there are any high-performance math libraries out there that already do this?

Siavash · **Joined:** 21.08.2008, 11:44 **Posts:** 354

Thanks a lot dear Codepoet and DarkAngel about those great detailed informations and articles :wink:

Unfortunately I'm not familiar with AltiVec, so I can't help with AltiVec optimized code but it's great to be the first next-gen engine to provide a high performance math library that supports both SSE and AltiVec.
Depending on wikipedia GCC 4.x auto-vectorisation feature produces the Altivec code too :

Wikipedia wrote:

Recent versions of the GNU Compiler Collection, IBM Visual Age Compiler and other compilers provide intrinsics to access AltiVec instructions directly from C and C++ programs. As of version 4, the GCC also includes auto-vectorisation capabilities that attempt to intelligently create Altivec accelerated binaries without the need for the programmer to use intrinsics directly.

And I've heard somewhere that AltiVec optimized code is ~20% faster than SSE ones

I'm going to have a closer look at Intel manuals about cacheability and prefetching instructions and perform some memory optimizations.

Codepoet · **Posted:** 17.10.2008, 14:22

Maybe you should take a look at the math library used by bullet physics and compare its performance to your own code: http://www.bulletphysics.com/Bullet/php ... &f=&t=1322
They provide several implementations using plain C / C++, SSE, ...

Siavash · **Joined:** 21.08.2008, 11:44 **Posts:** 354

WOW ! It's really the fastest math library and learning resource that I've seen in my life :idea:

Codepoet · **Posted:** 17.10.2008, 17:39

What do you think about integrating that library - or parts of it - into Horde after doing real world tests instead of writing your own version?

Horde3D

PentiumIII SSE unit vs. PentiumIV and Core2Duo SSE units

Who is online