NEON matrix palette skinning

Indroduction

What is ARM NEON? – The ARM® NEON™ general-purpose SIMD engine … – in other words it is an extended instruction set similar to the x86 CPU SSE/SSE2 etc.

Why?

One my friend from time to time asked me about: What do you think about ARM NEON optimization for your 3d math functions?
My answers were:

  • FPS in my project in the normal range
  • Profiler doesn’t show hot spots in a math functions
  • Data required to be aligned on 16 bytes, my code was not ready for this

A few weeks ago i added FSAA (full screen antialiasing) to game and FPS immediately fell under 20. That was a problem. After one week of optimizations FPS increased to 25 again. FSAA ate all of my GPU power, and I had only one way to speed up the performance – optimize the code for CPU.

Usually when i run xCode profiler i saw ~10% of CPU time inside matrix palette skinning block. This code looked very optimized and my attention shifted to other places. One week ago my friend came to me and said – “Hey, yesterday i spent a lot of time to learn asm commands for ARM NEON and i feel like i can help you write that code. Let’s try to optimize your matrix palette skinning block”.
We sat together near my laptop and we started.

C/C++

Plain C++ code for matrix palette skinning:

Continue reading