The Broken Promises of MRI/REE/YARV

The Broken Promises of MRI/REE/YARV(timetobleed.com)

149 points by ice799 15 years ago | 65 comments

iam 15 years ago |

I think this is a problem that exists across any VM that implements a GC, not just Ruby.

.NET CLR has the exact same problem (perhaps a harder one, since CLR has a moving GC), so anytime they touch GC references (pointers to objects that are collectible) it's always wrapped in an explicit GC stack frame (think GC struct that lives on the stack). Furthermore, all reads/writes are carefully done with macros (which of course expands to volatile + some other stuff) to make sure the compiler doesn't optimize it away.

On the one hand, this is nice because they don't need to scan the C-stack (it scans the VM stack and the fake GC frame stacks -- well it's one stack but you skip the native C frames), on the other hand this means that any time a GC object is used in C code (ok, actually it's C++) they have to be real careful to guard it.

Of course bugs crop up all the time where an object gets collected where it shouldn't have, it happens so often that there is a name for it -- "GC Hole".

Astute readers and users of p/invoke may remark that they don't have to set up any "GC frames" -- that is because this complicated scheme is not exposed outside of the CLR source. Regular users of .NET who want to marshal pointers between native/managed can simply request that a GC reference gets pinned, at which point I'm mostly sure it won't get collected until it's unpinned.

The bad news is I'm almost positive there is nothing you can do with just C here to make this problem go away. You'd want stuff to magically just happen under the hood, and C++ is the right way to go for that.

It's probably possible to create an RAII style C++ GC smart pointer that would be 99% foolproof at the expense of some performance. It gets a little bit trickier if we are doing a moving collector. I am thinking it could ref/unref at creation/destruction, and disallow any direct raw pointer usage not to shoot yourself in the foot.

Of course the people writing the GC still need to worry about this..

tsuyoshi 15 years ago | |

Anyone who has written an extension to a garbage-collected language in C will have run into this issue. Personally I've written extensions for Guile, OCaml, Ruby, MLton, and Java, and all of them have tricky rules for making your C code safe for garbage collection. Using volatile is the wrong way to do this though... this tells me that the people figuring this stuff out for Ruby don't really know C that well.

tptacek 15 years ago | |

A very similar pattern bit me in the ass with the ObjC GC and libevent.

fanf2 15 years ago | |

There are other ways to structure the VM's API so that all VM objects are connected to VM data structures at all times. A good example is Lua, where you manipulate Lua objects on the Lua stack - they are never referred to by a raw C pointer.

thibaut_barrere 15 years ago |

I do appreciate the technicality of the article, but I'm not sure to agree with the first point of conclusion: how does it makes MRI (and related) 'fatally flawed' though? (real question).

What makes it 1/ irreversible and 2/ bad for today's users?

EDIT: as well, I wouldn't stop using Ruby because of that; I would use JRuby or Rubinius or IronRuby (if I understand well, these ones are not affected?)

jjore 15 years ago | |

A fairly commonsensical approach is to just require all extension authors to annotate their code properly. At some basic level, this happens with Perl with its oft-maligned DSL for generating C code that happens to do all the right declarations. You might then end up writing your code using more macros. It's certainly not pretty but it is sound.

A plausible rewrite of that function in an XS for ruby would leave the function declaration and wrapper code up to your equivalent of xsubpp to execute your DSL and transform the wrapped code to fully functional C. If you build a C using extension from Perl, you'll find an XS file like http://cpansearch.perl.org/src/SIMON/Devel-Pointer-1.00/Poin... which during the `perl Makefile.PL && make` step is transformed via `xsubpp Pointer.xs > Pointer.c` and then compiled as normal C.

phillmv 15 years ago | |

It's a bit hysterical.

Shit! MRI/YARV/REE are inherently fatally flawed! All that code I have running in production must be a FIGMENT OF MY IMAGINATION! SAVE YOURSELVES

benblack 15 years ago | | |

I am running this code in production, hence it cannot have bugs. QED.

Yours in perpetual bogglement,

Lil' B

dlikhten 15 years ago | |

I fail to see how this is an all hands abandon ship issue. If its a critical issue in all 3 interpreters they should be fixed asap if possible. At worst with a flag.

If rubinius/ironruby/jruby have no issues, this may become moot eventually as rubinius is gaining lots of traction recently and is becoming faster by the release outperforming standard ruby vms in many cases.

evanphx 15 years ago | | |

Neither Rubinius nor JRuby (and probably IronRuby too) have this issue because they all use accurate garbage collection rather than conservative. Accurate requires much more bookkeeping since all pointers must always be properly identified, but if you start writing a system with accurate GC, it's pretty easy. Bugs like this are a direct result of a conservative GC strategy (and these bugs, as I'm sure you got reading Joe's post, really really suck to find).

jleader 15 years ago | | |

I think the author has a valid point that the "conservative" garbage collection approach has a flaw in its assumptions about the behavior of C compiler optimizations, and it doesn't sound like something easy to fix without a rewrite (i.e. switching to "accurate" GC). This sort of flaw will continue producing new surprising bugs, potentially any time the code is changed, or any time the compiler's optimizations change. These sorts of bugs are frustrating to track down, because they depend both on details of code optimization, and on details of memory allocation/deallocation history. If you compile with debugging options, you may change what optimizations are used; if you insert debug prints for some old-school log-based analysis, you may change the allocation/deallocation history, so the GC gets triggered in a different place.

davesims 15 years ago |

This post is a weird mix of careful technical analysis and douchey, Zed Shaw-style hysterical overstatement.

However, I would like to see Matz' response to the recommended steps for a fix at the end. Sounds like a reasonable goal to add for Ruby 2.0.

Note to self: Listening to Papoose while writing a technical blog post turns your otherwise important observations into a Chicken Littleish, end-of-the world rant.

Nelson69 15 years ago | |

I kind of branded it a bit "douchey" at first too but then as I thought about it, it seemed remarkably restrained considering he debugged this issue. It's not like this happened all the time, had to get kind of lucky and build and calibrate a system just right to capture it.

I don't intend this to be an inflammatory question, I'm sort of a perpetual ruby novice, it's never been my day job and I've never managed to sort of catch up with the community, as soon as I feel pretty good with something I find it's been obsoleted a couple times. I like it but how does the community at large deal with stuff like this? This guy found a real bug and invested some time in it, do other rubyists just deal with crashes and restart their stuff? Do they just consider it part of "being on the cutting edge?" Or do they not even notice?

msbarnett 15 years ago | | |

In practice crashes due to this issue simply do not occur very often. I think I've had the VM segfault twice in the last two or three years.

That's what makes the hyperbolic tone of this article so douchey; he wrote up an interesting dissection of an edge case issue as though it were an ongoing catastrophe, mostly just to inject a bunch of chest-thumping rock-star bravado that added nothing of value to the discussion.

kingkilr 15 years ago |

I think this goes to a pretty simple point: anything you have to do by hand you will eventually get wrong. Thus, to a first approximation anything that can be automated, probably ought to. To show off this principle I'm going to show off some of the PyPy source code: https://bitbucket.org/pypy/pypy/src/default/pypy/module/sele...

This is the implementation of `select.epoll`. Somethings you'll notice there's no GC details (allocations outside the GC of C level structs are handled nicely with a context manager), and we have a declarative (rather than imperative) mechanism for specifying argument parsing to Python level methods, this ensures consistency in readability as well error handling, etc.

wingo 15 years ago |

Cute. The Boehm-Demers-Weiser collector has GC_reachable_here for this reason. Guile has scm_remember_upto_here since before it switched to libgc. I'm sure other systems have their things too.

That said, I like Handle, the RAII thing that V8 uses. It also allows for compacting collection. Too bad C doesn't do RAII.

thibaut_barrere 15 years ago | |

.Net has GCHandle [1] and I believe the JVM calls to JNI have a similar mechanism (GetXXCritical [2])

[1] http://www.shafqatahmed.com/2008/05/memory-control.html

[2] http://publib.boulder.ibm.com/infocenter/javasdk/v5r0/index....

onedognight 15 years ago | |

While C doesn't support RAII, gcc does: https://secure.wikimedia.org/wikipedia/en/wiki/Resource_Acqu...

wonnage 15 years ago |

Can someone dissect this a little more? My understanding is the pointer to str never gets written to the stack, and so str on the heap might get freed before zstream_append_input makes use of it. But how could the GC see this/what is the faulty assumption?

eonwe 15 years ago | |

My understanding is that Ruby GC just runs through its heap of Ruby objects and sees which of them are reachable based on other objects in the Ruby heap and C-stack/registers.

Faulty assumption seems to be that counting references only to RVALUEs (Ruby objects in heap) is enough to determine if a part of memory can be freed. This breaks down in C-extensions where macros extract some part of the object or something pointed by it for use. In this case RSTRING_PTR extracts the C char-array used by str for zstream_append_input to use (lets call it arr).

If zstream_append_input or any calls underneath it tries to allocate a new Ruby object, GC may get called and str (and thus arr) may get freed because there are no references left to it anymore (no heap/stack/register because the register value was overwritten).

And this seems to require all Ruby C-extension writers to lock the objects they're using through macros with RB_GC_GUARD.

Edit: note that there are no references left to str

fhars 15 years ago | |

The point is that the GC cannot see that and so assumes that the object is no longer referenced and can be freed. A conservative collector works by scanning the live memory of the process for things that look like pointers into the same live memory and then assumes that all objects that are not the target of any of these pointers are garbage. Tough luck if the only reference to a live object lives in a register.

ice799 15 years ago | | |

registers are scanned, too. the bug is not that the ref is in a register. the bug is that there are no refs anywhere. not on the stack and not in any register.

xpaulbettsx 15 years ago |

So, what this really seems to boil down to, is:

The Ruby C API is returning objects that are not correctly reference-counted for a short period of time and are incorrectly subject to GC.

This doesn't seem fatal to me, just not reasonably fixable from the GC side. It might be true, that a new API is needed to hold refs in the C side.

benblack 15 years ago | |

I am apparently in that foolish minority that believes language runtimes should not segfault/corrupt themselves while running correct code. That this problem requires significant effort just to hack around, while actually fixing it would take a major architectural change, is what elevates this from mere "lolwut?" to fatally flawed. There are good alternative runtimes for Ruby, such as the JVM and the CLR, that do not suffer from this problem. Y'all should use them.

Funktacularly yours,

Lil' B

pmjordan 15 years ago | | |

I can crash a JVM or CLR program instantly by calling out to some careless C code. This bug is exactly such an instance: the C code for one of the library functions is flawed. The only way you can stay safe is by (a) having a flawless VM and (b) never calling out of it. The former is extremely unlikely, the latter extremely impractical as it inhibits any kind of I/O.

davesims 15 years ago | | |

If edge case segfaults were fatal flaws Windows should never have shipped. I say 'edge case' because obviously there are millions of lines of Ruby code running for years on MRI/YARV/REE that have not encountered this error often enough to cause the kind of breathless panic you seem to think is appropriate.

BTW the CLR is not a good alternative runtime for Ruby, might not ever be: http://www.zdnet.com/blog/microsoft/whats-next-for-microsoft...

You did good work here -- don't hurt your credibility with overstatement.

CPlatypus 15 years ago |

"Very few people out there know that the volatile type qualifier exists"? Only if there are "very few" kernel programmers, embedded programmers, and others who have used C for anything low-level and/or multi-threaded. Otherwise, no. Sorry, but knowing about it doesn't make you special.

"Volatile" is the wrong fix, by the way. That's just depending on yet another non-required behavior. There is in fact no further reference to "str" between the function call and the reassignment at the start of the next iteration, so there's nothing for "volatile" to chew on. This particular version of this particular compiler just happens to add an extra pair of stack operations in this case, but it's not truly required to. A real fix would not only mark the variable as volatile but also add a reference after the function call. The same "(void)str;" type of statement that's often used to suppress "unused argument/variable" warnings should count as a reference to force correct behavior here.

softbuilder 15 years ago |

Well plus one for a blog post with a theme song, anyway.