SIMD
Posted: Sun Oct 18, 2009 2:46 pm
Hi there,
Has anyone investigated the viability of adding SIMD computation to Chipmunk?
Frequently, 4-way SIMD is used to do XYZW in one go, which is not really applicable to a 2d system of course.
However, a better use of 4-way SIMD is to use SoA, or structures of arrays, approach.
In a 2D particle system, e.g., you would store all x-coords of all the particles in an array, and the same for y-coords, x-velocity and y-velocity,etc.
SIMD instructions will then process 4 particles in one instruction.
To apply this to Chipmunk:
When I profile my app with Shark for iPhone, I see that I spend 50% in cpArbiterApplyImpulse.
I expect numContacts is typically one or two here, but if it was frequently 4 or more, a good tactic would be to process 4 contacts in 1 go.
Bram
PS: A good intro to SoA versus AoS is in IBM's CBE programming tutorial:
https://www-01.ibm.com/chips/techlib/te ... A80061F788
In short, it comes down to this:
struct particle { float x, float y, float z, float vel_x, float vel_y, float vel_z };
struct particle particles[1024];
this array of structures is much much slower than:
float x[1024];
float y[1024];
float z[1024];
float vel_x[1024];
float vel_y[1024];
float vel_z[1024];
because in the latter, you can always process 4 particles simultaneously using 4-way SIMD.
E.g. test whether particle is below ground plane z=0, can be done in parallel with a vector compare.
a SIMD xyzw notation will not help you in that case, but SoA does.
Bram
Has anyone investigated the viability of adding SIMD computation to Chipmunk?
Frequently, 4-way SIMD is used to do XYZW in one go, which is not really applicable to a 2d system of course.
However, a better use of 4-way SIMD is to use SoA, or structures of arrays, approach.
In a 2D particle system, e.g., you would store all x-coords of all the particles in an array, and the same for y-coords, x-velocity and y-velocity,etc.
SIMD instructions will then process 4 particles in one instruction.
To apply this to Chipmunk:
When I profile my app with Shark for iPhone, I see that I spend 50% in cpArbiterApplyImpulse.
I expect numContacts is typically one or two here, but if it was frequently 4 or more, a good tactic would be to process 4 contacts in 1 go.
Bram
PS: A good intro to SoA versus AoS is in IBM's CBE programming tutorial:
https://www-01.ibm.com/chips/techlib/te ... A80061F788
In short, it comes down to this:
struct particle { float x, float y, float z, float vel_x, float vel_y, float vel_z };
struct particle particles[1024];
this array of structures is much much slower than:
float x[1024];
float y[1024];
float z[1024];
float vel_x[1024];
float vel_y[1024];
float vel_z[1024];
because in the latter, you can always process 4 particles simultaneously using 4-way SIMD.
E.g. test whether particle is below ground plane z=0, can be done in parallel with a vector compare.
a SIMD xyzw notation will not help you in that case, but SoA does.
Bram