The cost of dynamic vs. static dispatch in C++(eli.thegreenplace.net) |
The cost of dynamic vs. static dispatch in C++(eli.thegreenplace.net) |
1) The virtual function address lookup requires a load from an address which is itself loaded. If neither location is cached, this has the unavoidable latency of two uncached memory accesses. Even at best, this incurs two cached L1 accesses, which is about 8-16 cycles on modern architectures.
2) The function call itself is dependent on the final address loaded above. None of that can proceed until the branch address is known. If cached, all is good and the core correctly predicts execution of a large number of instructions. Best case, the core may still block predicted execution shortly after due to running out of non-dependent instructions, until it knows for sure the address it should have branched to. Worst case, the branch can't proceed until the two memory accesses access.
In any case, nearly all of this is dwarfed by the cost to the compiled code itself: in most cases you can't inline, so simple transformations which could eliminate the function call altogether can't happen.
You seem very familiar with these issues, but this doesn't sound right to me. Maybe I'm not understanding your terminology, but don't all modern processors support speculative execution? All instructions (including dependent) are executed, but the results are held in the Reorder Buffer until the branch choice is confirmed. If this is still a large issue, why don't Eli's measurements show it to be?
The reason the measurements don't show it is the micro-benchmark will be predicting very well. In fact it's quite difficult to defeat prediction even for giant codebases, and you probably have bigger issues with L1 thrashing at that point. The more subtle problem is even with prediction, there's a (quite high) limit to the number of unretired speculated instructions. Again, a micro-benchmark won't show that up - you'd need a large function in the inner loop.
I'm making it sound like there's no cost to virtual functions in real applications, but it's there, usually measurable and every little adds up. If anything, I think a better reason to not simply spray "virtual" everywhere is it demonstrates that the author didn't understand the data structures they created.
I'm not overly experienced with complicated OO systems, but sometimes it seems the OO is just an abstraction for convenience, but runtime will always take a particular path.
Does anybody know why JIT isn't done in classically AOT compilers? Is JIT overhead generally higher than cost savings of the optimizations?
mov r1, n(bp) ; get vtable
mov r2, n(r2) ; get method pointer
call (r2) ; call
that's a really bad pipe break a double indirect load and a call - but branch prediction may be your friend ...However some of the code we saw (I think it came from a Borland compiler)
mov r1, n(bp) ; get vtable
push n(r2) ; get method pointer
ret ; call
an extra memory write/read but always caught in L1 and on the register poor x86 it saves a register right> ... but on most CPUs of the time you're screwed for the branch prediction - CPUs had a return cache, a cheap way to predict the branch target of a return - by doing a return without a call you've popped the return cache leaving it in a bad state - EVERY return in an enclosing method is going to mispredict as well - the code will run, but slowlyIn the end it can't hurt to generate a bad jump prediction off of the return cache, it's no worse than being idle - the effect of messing with the cache though can cause it to always fail so as a result you get no advantage from it
for (unsigned i = 0; i < N; ++i) {
for (unsigned j = 0; j < i; ++j) {
obj->tick(j);
}
}
I wouldn't go quite so far as to say that benchmarks with tight inner loops like this are completely useless, but they are nearly so.The author is clearly aware that the real world of performance is much bigger & more complex than his simple Petri dish. Credit to him for mentioning that. It's also really refreshing to see him analysing the optimised assembly.
The trouble with this approach is that it's tempting to draw simple conclusions. In this case, you might be tempted to conclude "CRTP always faster than virtual dispatch", when the truth is likely to be much more situation dependent.
I have seen a biggish project go though a lot of effort to switch to CRTP, only to see a negligible performance impact.
I've seen plenty of software (especially systems software) that does spend much of it's time in tight inner loops. Pulling out all the optimization stops there can give measurable gains. I've personally seen measurable gains on real applications from tricks like reordering branches so that the more predictable branches go first.
This is explicit support for confirmation bias.
See Feynman's discussion of measuring the charge of the electron in Cargo Cult Science:
"Why didn't they discover the new number was higher right away? It's a thing that scientists are ashamed of—this history—because it's apparent that people did things like this: When they got a number that was too high above Millikan's, they thought something must be wrong—and they would look for and find a reason why something might be wrong. When they got a number close to Millikan's value they didn't look so hard. And so they eliminated the numbers that were too far off, and did other things like that..."
Virtual dispatch is useful for type erasure, when using abstract types from plugins, DLLs or generally "somebody elses code". IMHO, the valid use cases within a standalone program are actually fairly small.
http://imagine-rt.blogspot.co.uk/2013/08/c-virtual-function-...
And this was with billions of calls to the functions...
String.charAt(int index)
was a virtual call inside of strlen().From which my takeaway would be: In inner-loopy code for which an extra nanosecond or so per call is critical, you should avoid virtual function calls. For anything else, don't worry about it.
while(...) {
(obj->vtable[0])(...)
}
we could have void(*fn)(...) = obj->vtable[0]
while(...) {
fn(...)
}
which would avoid two redirections per inner loop! Actually, I'm almost sure that is what LuaJIT would do, and many other high-level programming languages could perform this optimization as well. However, maybe C is too low-level to be able to do that, and I don't know about C++.A key thing here is that inlining is what enables zero-cost abstractions in C++. A virtual call is slower than a regular call, but the main problem is that it builds a barrier that effectively stops inlining.
It'll be interesting to see how devirtualization in GCC will do for real world programs.
did you try the intel compiler? for raw low level optimisation it sometimes massively out performs the ms, gcc or clang versions...
i'd imagine these problems are worse on ARM chips, and dynamic dispatch is even less effective there - certainly on PPC architectures I've seen much worse performance than on similarly powered Intels in precisely this situation. the caches are less and slower...
i'm not 100% but i think i've seen virtual calls 'devirtualised' by the MS compiler a couple of years ago... I might be thinking of something else though, it was a while back now. I was unpicking some CRTP mess in something that /was not performance critical in anyway/...
The other case where the dynamic type is known is in the constructor itself of course.
Dynamic linking has more indirection than you might expect because the function addresses can't always just be put at the call site during the library load (the places where you would want to write the address can be in code that is read-only mmapped to aid in sharing memory between processes and to avoid loading unused stuff from disk).
The danger is that benchmarks like this encourage naïve programmers to use complex constructs as a matter of course, when simpler would usually be better. "Premature optimisation is the root of all evil" and all that...
Are there any open source projects amongst your examples?
In the end, it's just another tool which is the right one in particular circumstances, and the wrong one in all others.
http://www.azulsystems.com/blog/cliff/2010-04-08-inline-cach...
An old blog post by one of the CLR engineers[1] states:
"We don't inline across virtual calls. The reason for not doing this is that we don't know the final target of the call. We could potentially do better here (for example, if 99% of calls end up in the same target, you can generate code that does a check on the method table of the object the virtual call is going to execute on, if it's not the 99% case, you do a call, else you just execute the inlined code), but unlike the J language, most of the calls in the primary languages we support, are not virtual, so we're not forced to be so aggressive about optimizing this case."
I guess things haven't changed. My testing with the CLR indicates that for best performance, you should make sure your IL is already inlined. The CLR does much better with huge function bodies.
1: http://pastebin.com/98c7Bt7f 2: http://blogs.msdn.com/b/davidnotario/archive/2004/11/01/2503...
Additionally, it's not always so easy to drop to a low-level language. If your architecture is enormous and complicated, it might be totally unfeasible to change languages for hot parts.
Regarding the size of the ROB, I was wondering about the size a while ago and found an interesting post from someone who measured it for modern Intel processors: Ivy Bridge (168), Sandy Bridge (168), Lynnfield (128), Northwood (126), Yorkfield (96), Palermo (72), and Coppermine (40). http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/
I agree with you on the virtual part. I'm actually a C programmer more interested in how to implement efficient dispatch for interpreters. Eli (the original author) has some good posts on that as well: http://eli.thegreenplace.net/2012/07/12/computed-goto-for-ef...
So it's situational, but IME it's pretty predictable where you'll see this kind of optimization help. By all means profile and use whatever tools at hand to help you along, and don't apply the optimization blindly - but despite the whole "black art" label optimization sometimes gets this kind of thing really is pretty straightforward.
One (admittedly incomplete) answer is that AOT compilers try to replicate many of the wins that JIT compilers get from runtime specialization by including a profile-guided optimization pass instead, which specializes ahead of time, using data logged from what you hope is a representative example of runtime.
Good JIT compilers can do things like optimizing fast paths, discovering latent static classes in highly dynamic languages, etc. These kinds of optimizations can also be done AOT, if you have good profile data and suitable analysis & optimization passes.
The pros/cons of each approach are not entirely resolved, and you will find varying opinions. Part of the problem with making a direct comparison is that there are large infrastructural inconveniences with switching from one approach to the other. A good JIT is a quite pervasive beast, not something you can just tack on as a nice-to-have. PGO is somewhat infrastructurally easier to add to an existing AOT compiler. Therefore, if you can do most of what JIT does via PGO, you would prefer to do that, were you the maintainer of an existing AOT compiler. Whether you really can is afaik a bit of an open question.
And then the point is that most optimizations a JIT can do that an AOT cannot are particulaly important where the language semantics are "too" flexible. If your code has lots and lot of virtual calls; lots of exceptions with unpredictable code flow - well, sure, it's really important to elide that flexibility where it's not actually used. That's kind of like JS Vm's nowadays speculatively type their untyped objects - it's a huge win, and not possible statically.
But the point is - these optimizations are critical because the languages don't allow (or encourage) code to disable these dynamic features. In C++ this can be helpful; but how often is dynamic devirtualization really going to matter? I mean, you can statically devirtualize certain instances (e.g. whole-program optimization reveals only two implementations and replaces a virtual call with an if), but the real code-could by any subtype but actually isn't scenario just isn't one that comes up often.
The consequence is that C++ gets most of the benefits of a JIT simply because the JIT is solving problems C++ compilers don't need to solve. The cost is that the compiler wastes inordinate amounts of time compiling your entire program as optimally as it can, even though it only has a few hotspots.
But, I can't think of a way to get an unknown struct type into a generic function without boxing (which makes it no longer a struct type). So for an AOT compiler it shouldn't be an issue. And there's probably some clever way to emit generic code for structs, too.
But generics have their own implementation difficulties and Mono didn't support them for a while in AOT.
See [1] for an example where the CLR fully inlines a virtual call (through an interface, specifically)
The call is most definitely virtual (or dynamic if you prefer that term), not statically-dispatched. It just happens to be performed through an interface. I suspect the CLR optimizes this because interfaces are incredibly common (IEnumerable, etc.)
Do you have any actual examples of making an interface call that gets inlined? This post[1], dated 2004 (later than the MSDN article you referenced) from Eric Gunnerson says:
"all the compiler knows is that it has an IProcessor reference, which could be pointing to any instance of an type that implementes IProcessor. There is therefore no way for the JIT to inline this call - the fact that there is a level of indirection in interfaces prevents it. That's the source of the slowdown."
He goes on to say that Sun does do something since Java makes everything virtual, and the CLR could do it in theory, but doesn't.
I skimmed through the linked article you provided but didn't find any mention of inlining interface method calls. On the excellent performance of virtual/interface calls, it says:
"the combination of caching the virtual method and interface method dispatch mechanisms (the method table and interface map pointers and entries) and spectacularly provident branch prediction enables the processor to do an unrealistically effective job"
1: http://blogs.msdn.com/b/ericgu/archive/2004/03/19/92911.aspx...
Or are you saying the interface call is 5x slower than a virtual call? That definitely isn't right.
In the vast majority of cases where a virtual function call is actually monomorphic or bimorphic at runtime, the JVM JIT can observe that and potentially inline the method (with an if statement in the bimorphic case). It puts guards around the inlined method and deoptimizes in the event that a newly loaded class renders the optimization incorrect.
In my simple program doing a loop, calling an Add function on an interface, it is definitely making a function call each time. It unrolls 4 times, and loads the function pointer once per iteration - I'd have though it would only load it once overall. Loop is 89 bytes. There is no conditional inside the loop to check for the type.[1]
If I change it to not use the interface (don't cast to the interface type), it's unrolled and inlined. Loop is 34 bytes.[2]
It's the same on 32-bit, except there's no unrolling. The non-virtual loop body is 2 instructions (inc, add). The interface has a push, 3 movs and a call. The virtual one requires two extra movs (to load the function pointer - with an interface the address is embedded as a literal).
Shrug. Maybe it still doesn't work with value types? I started it without VS then broke in with the debugger to get the disassembly.
The loop is doing "y = x.Add(y, i)" where y is a local.
Edit: Aha! Using an interface method (not virtual) and strings, I was able to get inlining. I guess the CLR is still weak in dealing with value types.
1: Start of the loop using an interface:
lea r8d,[rdi-1]
mov rbx,qword ptr [FFEEFE60h]
mov edx,eax
mov rcx,rsi ; rsi is the object pointer
lea r11,[FFEEFE60h] ; I am embarrassed to admit I don't know what r11 is doing
call rbx ; just does lea eax[rdx+r8], ret
; similarly 3 more times then loop
2: Without using the interface, the loop body: lea eax,[r8-1] ; r8's the counter
add ecx,eax
lea edx,[rcx+r8]
lea ecx,[rdx+rax]
lea eax,[r8+2]
add ecx,eax
; then loop