1) It's the number of constraints to solve, not the iterations. If there is too little work to do, then splitting up threads was slower. That said, it was a number picked based on minimal testing on an iPhone 4... So don't read too much into it.
2) Yes, there is definitely a race. You had the right idea though that it doesn't matter much since error from a single iteration is quite small. That said... (oof, a second time) I don't have any formal proof that it's "safe" other than it's ran for hundreds of hours without any issue on several different OSs and CPUs without any issues with one exception. It fails rather catastrophically and immediately on the iPhone simulator. Adding extremely fine grained locking (i.e. per body) would be very very slow.
3) The solver is *definitely* the most expensive part of any physics system unless you have a lot of bodies and very few interactions. While parallel collision detection algorithms do exist, my impression is that they are generally somewhat brute force and require a lot of CPUs before you can break even against a single CPU running an efficient, though highly serial spatial data structure such as a BVH.
Anyway. Profile, adjust your simulation if you can, and then invoke the name of multi-threading. Remember: "I solved a problem with threads, and now I have 3 problems."