Ruby 2.6.0-preview2 released with JIT(ruby-lang.org) |
Ruby 2.6.0-preview2 released with JIT(ruby-lang.org) |
Speaking of optimizing method calls: now that it's been a few years, I wonder what Ruby folks think about refinements. Are you using them? Are they helpful? Horrible?
I remember reading from the JRuby folks that refinements would make Ruby method calls slower---and not just refined calls, but all method calls [1], although it sounds like that changed some before they were released [2]. It seems like people stopped talking about this after they came out, so I'm wondering if refinements are still a challenge when optimizing Ruby? I guess MRI JIT will face the same challenges as the Java implementation?
It might seem strange that they lead with this new feature, but `yield_self` can greatly improve the Ruby chainsaw, and `then` makes it more accessible.
The style of writing a Ruby method as series of chained statements has a non-trivial effect on readability and conciseness. `then` lets you stick an arbitrary function anywhere in the chain. They become more flexible and composable, with less need to interrupt them with intermediate variables that break the flow.
I've been using `ergo` from Ruby Facets for years ... largely the same thing ... and the more I used it the more readable I find my old code now. Funny how adding one very simple method can have more effect than so many other complex, high effort changes.
An order of magnitude as in .. 10x? This seems too good to be true. Half the arguments against Rails melt away like butter if that's truly the case.
Anyone with a better understanding of the details care to comment on the likelihood of these performance gains being actually realised, and if not, what we might realistically expect?
Once he finished his doctorate at the EPFL, off to Stripe he went, bye bye Scala. Tough industry, on the one hand Scala benefits from a revolving door of high level EPFL doctoral students, and on the other the talent pool shifts around as students come and go.
Money talks, companies like Stripe have a leg up in that they can fund full-time engineers to work on projects, whereas institution backed projects typically have a much smaller pool of long-term engineers to rely on (JetBrains, for example, has something like 40 full-time engineers working on Kotlin/KotlinJS/Kotlin Native).
[0] https://github.com/DarkDimius [1] https://github.com/lampepfl/dotty
From my uneducated perspective, seems like Graal VM could become the de facto Ruby deployment stack.
If the communities doesn't like where things are going, they could fork the whole thing and call it something else, like Coffee.
Graal is EPL, GPLv2, LGPL licensed.
> Compatibility of the structure of AST nodes are not guaranteed.
Not sure if it means its going to be any more stable/complete than ruby_parser / ruby2ruby
Rubex - A Ruby-like language for writing Ruby C extensions.
Oh dear god.
This isn't even remotely true. Crystal syntax looks like Ruby. Crystal's semantics (the bit that matters) are not like Ruby.
Oracle's plan for world domination via JVM is completely changing the performance landscape for dynamic languages.
Between Ruby 1.8 and 2.5, performance has improved around 13x in tight loops[2]. The Rails performance issue has been massively overblown since 1.9 was released.
Ruby 1.8 was a tree walking interpreter, so the move to a bytecode VM in 1.9 was a huge leap in performance. Twitter bailed to the JVM before moving to 1.9. A lot of those 10-100x performance differences to the JVM are gone thanks to the bytecode VM and generational GC.
Bytecode VMs all have the same fundamental problem of instruction dispatch overhead, they're basically executing different C functions depending on input.
Doing _anything_ to reduce this improves performance dramatically, even just spitting out the instruction source code into a giant C function, compiling it, and calling that in place of the original method. Another 10x improvement on tight loops should not be a problem.
[1] https://www.techempower.com/benchmarks/#section=data-r15&hw=...
[2] https://github.com/mame/optcarrot/blob/master/doc/benchmark....
Nor did i know that twitter jumped out of rails before ruby got performant. Which means the argument that twitter outgrew rails isn't so correct anymore.
still, thanks for this insightful comment.
Yeah no kidding.. https://samsaffron.com/archive/2018/06/01/an-analysis-of-mem...
With 2.6 and sorbet [1] coming down the line, it's exciting to be a Rubyist again!
It does if you ignore the overhead of JIT compilation itself. However, my understanding is that writing a JIT implementation that performs better than a good interpreter is surprisingly difficult. You have to have a lot of complicated logic for tracking hotspots and using JIT judiciously in short-running scripts.
https://www.techempower.com/benchmarks/#section=test&runid=a...
Another big win is the bootsnap gem, which is a cache of previous VM runs that loads faster than parsing all invariant pieces of code again.
Golang plus gin, sure. However there are other Go frameworks on the charts that blast the Ruby competition out of the water. Ruby isn't really on the podium at all with C, C++, Rust, Golang, C#, and Java about an order of magnitude out in the lead on fortunes.
Martini isn't much of a framework itself either, so lets forget the full featured nonsense. Almost none of the ecosystem is in play with these benchmarks. You could build a system up around fasthttp just as well as net/http, and ASP.NET certainly can't be accused of being a for-purpose contender.
The most impressive thing IMHO is how well Ruby is doing on maximum latency. I can't quite reconcile that considering fasthttp is pretty much zero-allocation and golangs stop the world is in the microseconds.. Pretty impressive.
Ruby (MRI) will have to reinvent the wheel in order to get a panoply of optimizations that some very smart people have already baked in: like the ability to target almost any platform from the same library.. GCC requires cross-compiling per target.
Test suite still passes on it though, so upgrading shouldn't be a huge deal at least へ‿(ツ)‿ㄏ
But that doesn't mean you can't use a conventional compiler stack like LLVM as a JIT and get excellent code - it' just going to take its own sweet time doing so.
Can anyone think of any reasonably common stacks using LLVM as a JIT? There's mono, but that's a non-default mode; not sure if it's typically used. The python unladen-swallow experiment failed. Webkit had a short-lived FLT javascript optimization pass, but that was replaced by B3.
Which is just a long-winded way to suggest that LLVM is not likely to be ideal as a JIT, at least based on what past projects have done.
(Not trying to imply that writing C to disk is better, but it may well be simpler & more flexible - not worthless qualities for an initial implementation).
I know very little about ruby specifically but IME for this kind of dynamic language you get most of the initial gains by :
- removing (by analysis or speculation) dynamic dispatch
- unboxing / avoiding allocations in the easy cases
Once you've done that, you can generate pretty dumb assembly and still come out way ahead of your interpreter (and avoid very costly optimization / instruction selection / regalloc / scheduling).
Most of what llvm / gcc do only make sense when you've got your code down close to whatever you would actually write in C.
> The main purpose of this JIT release is to provide a chance to check if it works for your platform and to find out security risks before the 2.6 release
Performance is disappointing, though.
Care to elaborate?
> Unstable interfaces. An LLVM JIT is already used by Rubicon. A lot of efforts in preparation of code used by RTL insns (an environment)
https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch#a-few...
The *nix philosophy has long been towards trying to provide choice wherever possible, so that people can use the tool that best meets their needs.
Fast GC is critical to Ruby performance so a ton work went into it. Ruby 2.2+ has a very short STW phase thanks to generational GC + incremental marking.
CRuby's GIL doesn't really matter for serving web requests since it's run with one process per core like NodeJS is. It's less memory efficient but doesn't really affect throughput so much. Also, JRuby has no GIL.
Charles Nutter's early tests using JRuby on the GraalVM sound like there's another big step in performance coming without a huge amount of work.
I haven't had a chance to use Bootsnap yet but it sounds really promising.
They don't "dump to disk", if you mean an actual storage device. By default they store data to a "file system in memory" (a tmpfs), so it never gets written to a long-term storage device (not even an SSD). Even if you do "dump to disk", on a modern OS storing things in a file just puts it in memory and schedules it for eventual long-term storage. Of course, doing things this way has overheads, but it may not be so bad.
The C frontend has to parse things, of course, but it looks like they're heavily optimizing this. "To simplify JIT implementation the environment (C code header needed to C code generated by MJIT) is just an vm.c file. A special Ruby script minimize the environment (Removing about 90% of the declarations). One worker prepares a precompiled code of the minimized header, which starts at the MRI execution start".
Their current results are that "No Ruby program real time execution slow down because of MJIT" and "The compilation of small ISEQ takes about 50-70 ms on modern x86-64 CPUs". You're of course using more CPU (to do the compilations in parallel), and you have to have a compilation suite available at runtime, but in many circumstances that is perfectly reasonable.
IIRC, the gcc C compiler doesn't generate machine code itself either; it generates assembly code, which is then farmed out to a separate assembly process (using using GNU assembler aka GAS). Farming out compilation work to other processes is not new.
It seems to me that this is a really plausible trade. This approach means that they can add a just-in-time compiler "relatively" quickly, and one that should produce pretty decent code once they add some actual optimizations (because it's building on very mature C compilers). The trade-off is that this approach requires more run-time CPU and time to create each compiled component (what you term as overhead). For many systems, this is probably an appropriate trade. As I posted earlier, I'm very interested in seeing how well this works - I think it's promising.
Twitter, even back in those days would have still outgrown today's Rails. It was Ruby that has gotten a lot faster. Not necessarily Rails.
Now Android has Google's own J++, limiting what kind of Java libraries are portable to the platform.
At the same time, some OEMs are adopting Android instead of Embedded Java, thus increasing the fragmentation about what Java libraries are actually portable.
Google just though they could let Sun close doors and get away with how they created their own J++.
Having said that, Graal and its related projects are all open source, with a license listing available in its README:
It's the instruction dispatch overhead that's the real unavoidable problem. LuaJIT, for example, uses a bunch of tricks to minimize it in the bytecode VM, and it's significantly faster than the standard Lua VM but still far, far slower than basic JIT compilation.
Lua JIT is one of the most sophisticated dynamic language JITs out there, so it's hardly evidence that a simple implementation of a JIT will perform better than a good bytecode interpreter.
The problem is less acute for server side apps because the programs run for a long time, so that the initial compilation overhead is insignificant. However, there's a reason that you need a JIT to make Ruby fast rather than an ahead of time compiler. Ruby has so few compile-time guarantees that you need to do a lot of dynamic specialization to get really significant performance improvements. So compilation might still be triggered even after a script has been running for a long time.
I'd add that PyPy, which is also very sophisticated, is often not much faster than CPython, and in fact is slower for some types of code. Writing good JIT-based implementations for dynamic languages is really a tough problem. See e.g. the following post for some explanation of why:
Yes.
> Lua JIT is one of the most sophisticated dynamic language JITs out there, so it's hardly evidence that a simple implementation of a JIT will perform better than a good bytecode interpreter.
I meant that even a basic JIT can offer the same speedup as LuaJIT's interpreter, and a lot more work went into the latter.
> The problem is less acute for server side apps because the programs run for a long time, so that the initial compilation overhead is insignificant. However, there's a reason that you need a JIT to make Ruby fast rather than an ahead of time compiler. Ruby has so few compile-time guarantees that you need to do a lot of dynamic specialization to get really significant performance improvements. So compilation might still be triggered even after a script has been running for a long time.
The initial results of MJIT for simply removing the instruction dispatch overhead and doing some basic optimizations are a 30-230% performance increase on a small but real-world benchmark. No type specialization and specular optimization required.
> I'd add that PyPy, which is also very sophisticated, is often not much faster than CPython, and in fact is slower for some types of code. Writing good JIT-based implementations for dynamic languages is really a tough problem. See e.g. the following post for some explanation of why:
Most of the discussion about PyPy is completely irrelevant for the discussion about MJIT. PyPy isn't a method JIT. PyPy traces the interpreter itself and tries to produce a specialized interpreter. It works even worse at optimizing Ruby code via Topaz.
Ideally, you'd serialize directly from the database, bypassing the application entirely. Easily doable in ActiveRecord, but it's an explicit action, not the default. Not even sure if it's available in other databases besides PostgreSQL.
It's faster to hand-generate machine code straight from an interpreter than to invoke a C compiler. But that is not the only issue. As with everything else, this is a trade-off, and I'm eager to see how it works out. I can see some positive reasons to do this:
1. The Ruby developers get highly-optimized machine code, with relatively little effort on their part. Many, many man-years have been spent to make C compilers generate highly optimal code.
2. The C language, as an interface, is extremely stable, so once it works it should just keep working. Compare that to the constantly-changing interfaces of many alternatives.
3. Debugging is WAY easier. If there's a problem in generated code, it's way easier to read intermediate C code (especially after going through a pretty-printer) than many other kinds of intermediate formats, and millions of people already know it.
In short, this approach means that they can very rapidly produce a system that can run tight loops very quickly, one that resists interface instability (so the approach should keep working), and one that's easy to debug (so it should be reliable). For many applications, the fact that it takes a little more time to do the compilation may be unimportant, especially since that work is embarrassingly parallelizable.
I'm very interested in seeing how this plays out. If this works well for Ruby, I suspect some other language implementations will start considering using this approach. I'm sure it's not the best approach in all circumstances, but it might work very well for Ruby - and maybe for some other languages like it.
"If it works, it isn't stupid".
Not for machine generated code. C compilers work well on human generated code, and not as well as Ruby -> C "translations".
That depends on the machine generated code. C compilers are optimized for whatever the C compiler authors perceive as a common construct. If the generated C code uses constructs similar to what humans do, it's often quite good. If not, you can change the code that generates C, or in some cases you can convince the C compiler authors to optimize that situation as well.
Generating C is part of the bootstrapping process, it isn't used at runtime, the JIT generates the usual machine code directly.
There are Common Lisp implementations that support similar mechanism of generating C code (ECL, Kyoto CL...), but I don't think any of then compiles C into .so which then gets dlopened right away as poor-mans JIT.
See here, starting on P. 36: http://www.softwarepreservation.org/projects/LISP/kcl/doc/kc...
When KCL compiles a lambda expression, it generates a C file called "gazonk.lsp" and compiles that.
(The above paper report is a little confusing; in some places it claims that an object file has a .o suffix, but then with regard to this gazonk implicit name, it claims that the fasl file is gazonk.fasl.)
>(defun foo (a) (* a 42))
FOO
>(compile 'foo)
Compiling /tmp/gazonk_24158_0.lsp.
End of Pass 1.
End of Pass 2.
OPTIMIZE levels: Safety=0 (No runtime error checking), Space=0, Speed=3
Finished compiling /tmp/gazonk_24158_0.lsp.
Loading /tmp/gazonk_24158_0.o
start address -T 0x888488 Finished loading /tmp/gazonk_24158_0.o
#<compiled-function FOO>
NIL
NILSun did what they could to save their face.
"Triangulation 245: James Gosling"
https://www.youtube.com/watch?v=ZYw3X4RZv6Y&feature=youtu.be...
Also doesn't change the fact that even with Android 8.1, I as Java developer cannot take a random jar from Maven Central and be certain it won't crash and burn on Android, regardless of the version.
But I don't see how you can tolerate that contradiction. Either you agree with Oracle that the Java APIs were copyrighted and Google should not have been allowed to reconstruct them. Or you worry about fragmentation coming from an incompatible Java implementation. Doing both is nonsensical.
Google should have paid Sun instead of playing a Microsoft's move fostering Sun's downfall, period.
And in doing so, Android would have been JavaSE compliant plus whatever additional libraries they would think to drop on top of it.
Topaz was easily the fastest Ruby JIT before TruffleRuby, beating the JRuby and Rubinius JITs. It was very impressive.
It'a shame Topaz was never really "finished".
So, this amounts to a small improvement for some types of code. Indeed, it is "easy" to get that by "just" using some basic JIT techniques. The trick is to get consistently better performance across the board. Relevant tweet at https://medium.com/@k0kubun/the-method-jit-compiler-for-ruby...:
>I've just committed the initial JIT compiler for Ruby. It's not still so fast yet (especially it's performing badly with Rails for now), but we have much time to improve it until Ruby 2.6 (or 3.0) release.
This will come with the rest of the opimizations Takashi has planned for Ruby 2.6. Ruby-Ruby method inlining, which is almost finished, is a huge one for improving Rails performance. IMHO there's no real point talking about Rails until it's working in some form.
> >I've just committed the initial JIT compiler for Ruby. It's not still so fast yet (especially it's performing badly with Rails for now), but we have much time to improve it until Ruby 2.6 (or 3.0) release.
It turned out this wasn't even testing MJIT with Rails because https://twitter.com/samsaffron/status/963219086833434624
The main place where LLVM bites you is compatibility. There simply is none. This is a constaint drain on your resources and a lot of projects can't afford to keep up. There is even a project on LLVM's own home page which is was on 3.4 for a long time and has just recently upgraded to 3.8 [2].
But if the alternative is shelling out to a C compiler? I'll take LLVM any day. The issue is not just the overhead of a call to an external program, it's all the extra complexity that comes along with that. It is very, very easy for this approach to break, especially when you consider the breadth of C compilers that exist, and all the possible ways they can be configured. In contrast, LLVM is "just" a library that you link to.
If anything, I'd bet plain C is much simpler because it hasn't changed much, and is very unlikely to ever to anything very suprising on any future platform - which cannot be said of raw LLVM.
And of course shelling out is a a bit of a hassle, but hey; it's a well-trodden path on unix. It's not the fastest, greatest interop in the world, but it's good enough for a lot of things.
(and wow- terra sounds impressive!)
I'll just say that my views come mainly from experience, specifically ECL (Embeddable Common Lisp, a CL implementation) and (this was further back, so my memory is fuzzy) a tool for generating executables from Perl scripts. I don't think I'm using an especially unusual setup, or unusual compilers, and I would guess that these tools probably target a very narrow subset of C. Despite this, my experience with these sorts of tools has been anything but "works out of the box". On the contrary, there appear to be a great number of degrees of freedom, even with standard-ish setups, that can trip up these tools. Because of the additional layers of abstraction, the error messages you get are very poor. Some header file is missing or in an unexpected place, or worse some generated code fails to compile. As an end-user, it's basically impossible to debug these in a reasonable way.
You can certainly have internal errors using LLVM, but in my experience fewer of them are platform-dependent. Therefore there is a greater chance that something that works for the developer will work for the user. Also, if error handling is done properly, if a failure does occur it can often mapped back to the original source program. This is much better as far as usability goes, since the user almost never wants to debug some compiler's generated code.
Yea, it's annoying. For PostgreSQL I've decided to focus on the C API wherever possible exactly out of that reason. A bit more painful to write, but not even remotely as quickly moving. Obviously there's parts where that's not possible - but even there I've decided to localize that as much as possible.
[1]: https://www.khronos.org/registry/spir-v/specs/1.0/SPIRV.pdf
We just added LLVM based JIT to PostgreSQL. Don't think we have quite the same issues as JITing generic interpreted languages though, because the planner gives us much more information about the likely cost of executing a query. So the need for a super-fast baseline JIT isn't as big.
> But that doesn't mean you can't use a conventional compiler stack like LLVM as a JIT and get excellent code - it' just going to take its own sweet time doing so.
I think that's partially due to people using the expensive default pipeline when using optimization. A lot of those either don't make sense for the source language, or not for the first JIT foreground JIT pass.
The biggest issue I have with LLVM wrt around JITing is that it's error handling isn't really good enough. It's fine to just fatal error if you're in a AOT compiler world, but that's much less acceptable inside a database. There's moves to make at least parts of LLVM exception safe, but ...
PostgreSQL - although i doubt that's the sort of thing you had in mind!
After LLVM 3.4 or so with the forcible move to “MCJIT” (now ORCJIT maybe?) it suddenly got even more painful though. While the Module system in LLVM was always abused by the JIT, it was a sad day for many of us who instead pinned to 3.4 for a while. I haven’t followed up in a while to see how the newer JITs have progressed, but I believe the last-layer JIT for Safari uses LLVM as well.
tl;dr: for the right time versus execution speed trade-off, LLVM is still awesome.
Since you have some experience - do you think shelling out would have been much more painful?
Shelling out (which I’ve also done) is okay, but you never get to really teach the backend what you know. That is, no matter how hard you try, you can’t teach gcc, icc, or clang that you know it’s safe to just fetch this function pointer off a struct and that it’s stable. Writing a simple pass in LLVM though is incredibly straightforward. You can even do a simple inliner, that knows how to inline just the runtime callsites you care about.
Like the WebKit folks and the HHVM folks before them: dynamic languages have enough complexity that you often get most of the win from a “basic compilation” (compared to say C/C++) so after you’ve proven out what you need, you roll your own.
Shelling out though would be strictly worse than the LLVM in-memory approach, since it gets you no additional benefit (in some ways it’s harder, since you can’t just say “jump to this address”), you lose a lot of upside (custom passes, letting you tune optimizations and instruction selection beyond simply -O0, -O1, etc.), and then you get to require users to have a compiler on their box.
I’d personally look at nanojit or the other JIT libraries before shelling out to a regular compiler.
On top of that, I don't see for what google would have had an obligation to pay.
Contradiction: Google broke some imaginary copyright by re-implementing APIs, but Google is bad because the re-implementation was not 100% equal to the original causing fragmentation. Either the fragmentation was harmful, then the API copyright was the problem. Or the API copyright violation was the problem, then fragmentation was the explicit goal and Google's try to minimize it the problem. Both can't be true at the same time outside lawyer lala land.
1 - Google did not pay for Java licenses, when it should. Even Andy Rubin admits that on his emails.
2 - To this day Android is not Java SE compliant, thus creating a fragmentation between Android Java and Java. Just like Sun managed to prevent with J++
3 - Being a Java license as Google avoided to be, and still isn't (many Java APIs are not yet available on Android), would have required Android to be fully Java SE compliant
So to conclude, Google tricked Sun and fragmented the Java eco-system.
They should pay and provide a 100% Java SE compliant implementation, or be honest about it and fully migrate to Kotlin, Dart or whatever they feel like it.
Code doesn't even need to be "hot" to make it worth it. WebKit switches from interpreter to cheap baseline compilation, without an specular optimizations or type information, after only 6 calls of a function: https://webkit.org/blog/3362/introducing-the-webkit-ftl-jit/
This article literally describes how baseline JIT is worth it simply to remove the bytecode VM dispatch overhead.
Relevant post here:
https://rfk.id.au/blog/entry/pypy-js-faster-than-cpython/
A simple JIT can get you to the point where you reliably outperform a bytecode interpreter for certain types of code. What takes a lot more engineering effort is reliably performing at least as fast as a bytecode VM for all types of code.
PyPy is a much more ambitious design, completely replacing CPython, and using an unusual JIT scheme of tracing the interpreter itself and trying to produce an interpreter optimized for particular traces of your code.
It was much harder for that approach to reach the same level of general performance than it seems to have been for CRuby & MJIT.