cpHastySpace parallelization, data race?

winkle · Post by **winkle** » Mon Jun 27, 2016 9:58 am

Chipmunk habitat:

I am interested in parallelizing the chipmunk impulse solver by using cpHastySpace in my code and wanted to clarify its use and a potential data race condition.

1) The cpHastySpace implementation is based on dividing iterations amongst threads, thus the comment in the code that >~50 iterations would be a minimum to suggest using a cpHastySpace. Could you briefly outline in which scenarios one would want to use >50 iterations? I am using a space with >5k bodies, which in pairs have 4 constraints (2 bodies comprise a "cell" that wants to act as one rigid body via 4 constraints between them). For example, could one increase iterations rather than decrease dt time step to squelch overlap?

2) While looking at the work function for a thread in cpHastySpace.c, it seems that more than one thread could access the same body->v parameter via the cpArbiterApplyImpulse() -> apply_impulse() function call sequence, where the latter is non-atomic read-modify-write on body->v. Is this an "innocent" race condition if at worst it does not add an impulse added from another thread? The race is bound to happen, but perhaps not frequently and thus you only miss an iteration or two, on only a few bodies each time step? (I can see this as a good compromise vs. an expensive mutex).

3) I have not yet tried to profile the impulse solver within cpSpaceStep(). Would you estimate, that for a large number (>5k) bodies (=> >10k constraints), all of which are "crowded" together (ie, very few not touching any other cells, thus a large number of arbiters?), that most time is spent in the iterations of the solver, even for the default 10 iterations?

Thanks for any info!
--winkle

Post by **slembcke** » Mon Jun 27, 2016 11:59 am

1) It's the number of constraints to solve, not the iterations. If there is too little work to do, then splitting up threads was slower. That said, it was a number picked based on minimal testing on an iPhone 4... So don't read too much into it.

2) Yes, there is definitely a race. You had the right idea though that it doesn't matter much since error from a single iteration is quite small. That said... (oof, a second time) I don't have any formal proof that it's "safe" other than it's ran for hundreds of hours without any issue on several different OSs and CPUs without any issues with one exception. It fails rather catastrophically and immediately on the iPhone simulator. Adding extremely fine grained locking (i.e. per body) would be very very slow.

3) The solver is *definitely* the most expensive part of any physics system unless you have a lot of bodies and very few interactions. While parallel collision detection algorithms do exist, my impression is that they are generally somewhat brute force and require a lot of CPUs before you can break even against a single CPU running an efficient, though highly serial spatial data structure such as a BVH.

Anyway. Profile, adjust your simulation if you can, and then invoke the name of multi-threading. Remember: "I solved a problem with threads, and now I have 3 problems."

winkle · Post by **winkle** » Mon Jun 27, 2016 12:39 pm

Thanks for the prompt reply. I agree the race condition appears safe since it only writes-back over body->v, which would only miss an occasional impulse (small, as you say) in the reduction.

1) It's the number of constraints to solve, not the iterations.

Could you clarify what you mean here? Each iteration seems to see the same arbiter and constraint arrays and these arrays are not divided amongst threads. I am referring to the per-thread work function:

Code: Select all

 static void Solver(...)
 ... 
	unsigned long iterations = (space->iterations + worker_count - 1)/worker_count;
	for(unsigned long i=0; i<iterations; i++){
		for(int j=0; j<arbiters->num; j++){
...
		for(int j=0; j<constraints->num; j++){

Thus, the number of total iterations is divided amongst threads, where each thread sees every arbiter and constraint in each iteration. Do you mean that walking the constraints array is what takes the most time in an iteration?

I will be testing cpHastySpace in the coming weeks (I am running code on linux, on dept. compute nodes, compiled with g++) and will respond here with any info on any speedup for the large number of bodies, constraints, and arbiters that I have in my simulations. Thx!
--winkle

Post by **slembcke** » Mon Jun 27, 2016 4:06 pm

winkle wrote:Could you clarify what you mean here?

If there are fewer than 50 constraints (including contact pairs), then it runs the single threaded solver otherwise it runs several solver threads.

Say your space is set up to use 10 iterations and 4 CPUs. If there are less than 50 constraints, it will run all 10 iterations on a single CPU. If you have 1000 constraints, it will run 3 iterations (10 iterations / 4 CPUs, rounded up), on each of the 4 threads. All constraints are run every thread, but the iteration count is split up between the threads.

The "correct" thing to do is to split up all the bodies into distinct groups by analyzing the contact graph. For instance, when you have 4 piles of objects, run each pile on a separate CPU. The problem is that now you need to schedule potentially many small tasks, like splitting up 1000 piles onto 2 CPUs. In the worst case, like yours, you have one huge pile that can't be split up.

winkle · Post by **winkle** » Mon Jun 27, 2016 5:33 pm

Ah yes. I see now you were referring to the conditional to start multiple threads. For reference to others, this is in the function: void cpHastySpaceStep(...) in cpHastySpace.c:

Code: Select all

...
		if((unsigned long)(arbiters->num + constraints->num) > hasty->constraint_count_threshold){
			RunWorkers(hasty, Solver);
		} else {
			Solver(space, 0, 1);
		}
...

Thus, use of the parallelization in my case of > 10k constraints is likely to give significant speedup (even for a small number of iterations). I will post results back when this gets tested. Thanks again for the clarification.
--winkle

Chipmunk Physics

cpHastySpace parallelization, data race?

cpHastySpace parallelization, data race?

Re: cpHastySpace parallelization, data race?

Re: cpHastySpace parallelization, data race?

Re: cpHastySpace parallelization, data race?

Re: cpHastySpace parallelization, data race?

Who is online