Race Conditions Can Be Useful for Parallelism

Race Conditions Can Be Useful for Parallelism(shwestrick.github.io)

85 points by g0xA52A2A 3 years ago | 68 comments

Slight nitpick - the definition of "race condition" on Wikipedia [0] is:

    [...] the condition of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events

If we take the first example - Parallel BFS - the correctness of the output could be considered "system's substantive behavior". Properly implemented atomic checks (as demonstrated) would still guarantee to lead to correct output in all possible combinations of events. Therefore, the system's "substantive behavior" is not dependent on the sequence or timing of other uncontrollable events. Therefore, there is no "race condition" involved.

Of course, the term "race condition" here is taken colloquially, for the sake of familiarity to the reader - the article has correctly recognized that the appropriate term for this kind of behavior is "non-determinism".

[0] https://en.wikipedia.org/wiki/Race_condition

chrisseaton 3 years ago | |

I don't know where Wikipedia got this 'substantive behaviour' requirement from, but for example it explicitly isn't part of the definition in the industry's definitive reference for parallelism - Padua.

> A race condition occurs in a parallel program execution when two or more threads access a common resource, e.g., a variable in shared memory, and the order of the accesses depends on the timing, i.e., the progress of individual threads.

> The disposition for a race condition is in the parallel program. In different executions of the same program on the same input access events that constitute a race can occur in different order, which may but does not generally result in different program behaviors (non-determinacy).

Sometimes you deliberately program a full-on data race (which isn't a bug by definition, as the article says) for performance reasons.

> Data races that are not manifestations of program bugs are called benign data races.

ajross 3 years ago | | |

That definition seems to conflate "determinism" (something largely impossible to achieve in asynchronously parallel systems) with "correctness" (an abstracted property of a system that doesn't have anything to do with determinism per se). It just doesn't seem useful.

No, in overwhelmingly common usage, programmers use the term "race condition" as a category of software bug. We mean it in the correctness sense, not the one used in the linked article nor your reference. You'd be met with some very weird stares if you tried to explain how arbitrary SMP ordering of log entries or whatever was a "race condition".

User23 3 years ago | | |

It’s worth noting that algorithms can be designed correct under nondeterministic execution. For example, quicksort is correct with a randomly selected pivot. And for associative and commutative functions it doesn’t matter what order they’re executed in the final result is always the same.

Dijkstra’s guarded commands don’t specify an order for the conditional. The semantics is that the process is free to execute any one of the cases that has a true guard. Nevertheless he found them useful for developing and describing many algorithms.

remram 3 years ago | | |

By this definition, every access to a shared resource is a race condition, e.g. even when properly acquiring a lock. It is common knowledge that you introduce locks to remove race conditions so I would say something is definitely missing from the definition.

bheadmaster 3 years ago | | |

Hey, I've just downloaded PADUA (http://dx.doi.org/10.1145/2633685), and skimming through it, I can't find a single mention of the phrase "race condition".

Is this the paper you're referring to? If not, could you please provide a reference to which PADUA you're referring to? I'd really like to read more on the subject, especially if the source is, as you claim, an industry reference.

shwestrick 3 years ago | |

Author here. I think it would be very strange to say that this code does not have a race condition. The whole point of the term is to identify circumstances where non-deterministic timing of events influences how you reason about correctness, which is exactly what we're doing here.

bheadmaster 3 years ago | | |

I've personally only heard the term "race condition" used to refer to bugs that have their source in non-deterministic execution of programs. In most cases, they refer to a specific sequence of events that the programmer did not foresaw, which lead to incorrect computation.

Using the term "race condition" in context of correct programs would make it cover exactly the same universe of programs as the term "non-determinism". I that think the distinction, however trivial (race condition = incorrect behavior, non-determinism = correct behavior), is still useful.

Great article, by the way. I did not mean to criticize it in any way. My "slight nitpick" about meaning of the words is really just that - a nitpick :)

kazinator 3 years ago | | |

If we put on a cynical hat, your page reads like "I've never heard of lock-free algorithms based on atomic operations. I therefore must have just invented it and I get to name it: how about beneficial use of race conditions?"

worewood 3 years ago | |

Why not use the right term, then? "Coloquial" shouldn't be a thing with technical terms.

Race conditions are hard enough to explain to people and misusing the term just makes it more difficult.

mizzao 3 years ago | |

Interesting, it's almost like one could call this a "randomized algorithm" (which we know can be faster than deterministic algorithms) but the non-determinism comes from the input data and code execution rather than a RNG.

squeaky-clean 3 years ago | |

> This article needs additional citations for verification. (July 2010).

The definition you quote has no linked citation on Wikipedia. Usually a good sign that you should not treat those statements as definitive. A good Wikipedia article should not state any "facts" without a direct means of verification. Otherwise it's considered "original research" and against the wiki policy for a high quality article.

https://en.m.wikipedia.org/wiki/Wikipedia:No_original_resear...

bheadmaster 3 years ago | | |

I used Wikipedia in order to have some kind of reference, but I was fairly sure of the meaning beforehand.

Searching the internet for "race condition definition" and taking the top few results brings several definitions that all agree in spirit with the Wikipedia one (see below).

If you know of any more reliable source that doesn't agree with Wikipedia on the definition of "race condition", please post it here. This is a honest request - I am always grateful to those who correct my mistakes (in good faith).

    wordnik [0]:  A flaw in a system or process whereby the output or result is unexpectedly and critically dependent on the sequence or timing of other events.

    techtarget [1]: A race condition is an undesirable situation that occurs when a device or system attempts to perform two or more operations at the same time, but because of the nature of the device or system, the operations must be done in the proper sequence to be done correctly.

    techterms [2]: A race condition occurs when a software program depends on the timing of one or more processes to function correctly.

    javatpoint [3]: When the output of the system or program depends on the sequence or timing of other uncontrolled events, this condition is called Race Condition.

    technopedia [4]: A race condition is a behavior which occurs in software applications or electronic systems, such as logic systems, where the output is dependent on the timing or sequence of other uncontrollable events.

[0] https://www.wordnik.com/words/race%20condition

[1] https://www.techtarget.com/searchstorage/definition/race-con...

[2] https://techterms.com/definition/race_condition

[3] https://www.javatpoint.com/what-is-race-condition

[4] https://www.techopedia.com/definition/10313/race-condition

compressedgas 3 years ago |

This reminds me of Kuper and Newton's LVars paper that introduces lattice variables in Haskell: https://users.soe.ucsc.edu/~lkuper/papers/lvars-fhpc13.pdf http://dx.doi.org/10.1145/2502323.2502326

shwestrick 3 years ago | |

Yes! It's a very similar idea. If I remember correctly, LVars are restricted enough to enforce determinism statically, which is quite nice.

layer8 3 years ago |

> I’m not talking about data races. Data races are typically bugs, by definition.

One notable exception is the Racy Single-Check Idiom: http://javaagile.blogspot.com/2013/05/the-racy-single-check-...

It is particularly suitable for lazy initialization in code that is typically (but not necessarily) executed single-threaded, and is famously used in Java’s String.hashCode() implementation.

shwestrick 3 years ago | |

That's a nice example. It seems that data races in Java don't "catch fire"; is that correct? The catch-fire problem is pretty bad for languages like C/C++, which have undefined behavior for data races, and in this sense data races are "bugs by definition" in those languages.

kaba0 3 years ago | | |

Java’s primitives and references are guaranteed to be “tear-free”, which guarantees no “out-of-thin-air” values. So a field set to 1 and being written by several threads to 2 and 3 can only ever be observed as 1,2 or 3, no other value. Is that what you mean under not catching fire?

heydenberk 3 years ago |

I appreciate a provocative title, but I think the practical lesson in almost all cases is the inverse: fixing race conditions can introduce performance bottlenecks.

shwestrick 3 years ago | |

Author here. It all depends on what the goal is. If performance is the goal, then perhaps race conditions can be considered acceptable, if the gains are significant enough.

I would hope that the primary takeaway from this post is that race conditions are not necessarily bugs. Race conditions are not necessarily something that need to be "fixed".

touisteur 3 years ago |

Tell me about non guaranteed order of operations in GPU reductions and floating point results changing slightly between two runs. Yes it's useful and you get the goddamn FP32 TFLOPS, but damn it makes testing, validating, qualifying systems harder. And yes, I know one shouldn't rely and test on equality, but not knowing the actual order of FP operations makes numerical analysis of the actual error harder (just take the worst case of every reduction, ugh).

EDIT: and don't get me started on tensor cores and clever tricks to have them do 'fp32-alike' accuracy. Yes, wonderful magic but how do you reason about these new objects without a whole new slew of tools.

hegelstoleit 3 years ago |

If you're just doing BFS why do you care who the parent is? Why not just choose the parent to be the predecessor? I.e if you visit 4 from 1, then 1 is the parent. Why do you need to check a list of potential parents?

shwestrick 3 years ago | |

When visiting vertices in parallel, there might be multiple potential parents that all attempt to visit the same vertex simultaneously. So, we need a way of picking which parent "wins".