Rauy wrote:
I think SIMD only really works in heavily compute intensiv parts, otherwise, it is often memory limited. The software skinning is an obvious possibility but I think it is already SSEd, isn't it?
It isn't vectorized yet, but I wrote an optimized version which utilizes SSE and SSE3 sometime ago and it was about 30% faster than normal code on GCC [note that GCC has made the data 16 byte aligned automatically there to make it faster, which is impossible in MSVC]. For me just 30% isn't satisfying, to gain much more speed, engine needs to use aligned SOA data structures and that requires a lot of work and motivation.
Another simple solution is to make the code parallel which IMHO requires less effort. I had a few experiments to make the software skinning parallel with Intel TBB
here which was easier and took less time than vectorized version to code. Again results wasn't satisfying, just 30% speed up for a quad core, also cores weren't utilized very well. After playing with Intel Parallel Studio, I found somewhere in code [can't remember ATM] which was putting a delay, just commented out that line and compiled again. This time CPU usage was at 100% and frame-rate was pretty smooth, only draw back was that overlays weren't there anymore
Rauy wrote:
Perhaps the calcCropMatrix could also profit, as it does not use too much memory and all those mins, maxs and clamps (currently jumps) can be SSEd quite well.
I doubt about it with current unaligned AOS data structure, because IMHO it isn't a heavy computing task and most of the CPU cycles will be wasted on loading and storing the data from memory to CPU registers and vice versa.
Multi-threaded Horde3D FTW!!!