The cost of dynamic vs. static dispatch in C++

The cost of dynamic vs. static dispatch in C++(eli.thegreenplace.net)

125 points by mekishizufu 12 years ago | 70 comments

pslam 12 years ago |

A big extra cost of virtual functions in the underlying CPU not mentioned in the article: they effectively create a branch target dependency on a pointer chase. Put another way:

1) The virtual function address lookup requires a load from an address which is itself loaded. If neither location is cached, this has the unavoidable latency of two uncached memory accesses. Even at best, this incurs two cached L1 accesses, which is about 8-16 cycles on modern architectures.

2) The function call itself is dependent on the final address loaded above. None of that can proceed until the branch address is known. If cached, all is good and the core correctly predicts execution of a large number of instructions. Best case, the core may still block predicted execution shortly after due to running out of non-dependent instructions, until it knows for sure the address it should have branched to. Worst case, the branch can't proceed until the two memory accesses access.

In any case, nearly all of this is dwarfed by the cost to the compiled code itself: in most cases you can't inline, so simple transformations which could eliminate the function call altogether can't happen.

nkurz 12 years ago | |

Best case, the core may still block predicted execution shortly after due to running out of non-dependent instructions, until it knows for sure the address it should have branched to. Worst case, the branch can't proceed until the two memory accesses access.

You seem very familiar with these issues, but this doesn't sound right to me. Maybe I'm not understanding your terminology, but don't all modern processors support speculative execution? All instructions (including dependent) are executed, but the results are held in the Reorder Buffer until the branch choice is confirmed. If this is still a large issue, why don't Eli's measurements show it to be?

pslam 12 years ago | | |

If the branch target is an address loaded from memory, and there is no cached result for the branch instruction, then there's no way it can predict which instruction to execute next. The target could be anywhere in valid memory.

The reason the measurements don't show it is the micro-benchmark will be predicting very well. In fact it's quite difficult to defeat prediction even for giant codebases, and you probably have bigger issues with L1 thrashing at that point. The more subtle problem is even with prediction, there's a (quite high) limit to the number of unretired speculated instructions. Again, a micro-benchmark won't show that up - you'd need a large function in the inner loop.

I'm making it sound like there's no cost to virtual functions in real applications, but it's there, usually measurable and every little adds up. If anything, I think a better reason to not simply spray "virtual" everywhere is it demonstrates that the author didn't understand the data structures they created.

MichaelGG 12 years ago | |

Can profile-guided optimization realise that a certain virtual function almost always resolves to a specific implementation and have a conditional check to inline or optimize when needed?

I'm not overly experienced with complicated OO systems, but sometimes it seems the OO is just an abstraction for convenience, but runtime will always take a particular path.

seanmcdirmid 12 years ago | | |

Note you've just described inline caching [1]. A big research topic in the 90s, I'm sure this is pretty much a non-issue these days.

[1] http://en.wikipedia.org/wiki/Inline_caching

froydnj 12 years ago | | |

Microsoft's PGO/LTCG implementation does just this. GCC can do something similar as well.

adamtj 12 years ago | | |

My understanding is that good virtual machines basically do this sort of profiling and optimization at runtime and JIT compile specializations as necessary.

Does anybody know why JIT isn't done in classically AOT compilers? Is JIT overhead generally higher than cost savings of the optimizations?

Taniwha 12 years ago |

I worked on serious x86 clone once - we took a lot of real-world trace and ran it through our various microarchitectures to see how it would fly - dynamic C++ dispatch was interesting normally you expect something like

   mov r1, n(bp) ; get vtable
   mov r2, n(r2) ; get method pointer 
   call (r2)     ; call

that's a really bad pipe break a double indirect load and a call - but branch prediction may be your friend ...

However some of the code we saw (I think it came from a Borland compiler)

   mov r1, n(bp) ; get vtable
   push n(r2)    ; get method pointer 
   ret           ; call

an extra memory write/read but always caught in L1 and on the register poor x86 it saves a register right> ... but on most CPUs of the time you're screwed for the branch prediction - CPUs had a return cache, a cheap way to predict the branch target of a return - by doing a return without a call you've popped the return cache leaving it in a bad state - EVERY return in an enclosing method is going to mispredict as well - the code will run, but slowly

mappu 12 years ago | |

I use push/ret idiom all the time to stdcall off the stack.. did not realise there was a return cache, that's very interesting.

Taniwha 12 years ago | | |

depends on the CPU - but it's relatively trivial thing to build (especially because unlike other caches it's a stack) on x86s return nominally is ALWAYS a bad pipe bubble: a pop followed by an indirect jump - the pop gets resolved at the end of its micro-op and the jump wants to be resolved early on so as to start decoding the next instruction

In the end it can't hurt to generate a bad jump prediction off of the return cache, it's no worse than being idle - the effect of messing with the cache though can cause it to always fail so as a result you get no advantage from it

Taniwha 12 years ago | | |

(I should add - it's an x86, you're really register poor - sometimes you do have to do stuff like that - but if you have a register "mov reg, a;jmp (reg)" is better than "push a;ret")

alextingle 12 years ago |

    for (unsigned i = 0; i < N; ++i) {
      for (unsigned j = 0; j < i; ++j) {
        obj->tick(j);
      }
    }

I wouldn't go quite so far as to say that benchmarks with tight inner loops like this are completely useless, but they are nearly so.

The author is clearly aware that the real world of performance is much bigger & more complex than his simple Petri dish. Credit to him for mentioning that. It's also really refreshing to see him analysing the optimised assembly.

The trouble with this approach is that it's tempting to draw simple conclusions. In this case, you might be tempted to conclude "CRTP always faster than virtual dispatch", when the truth is likely to be much more situation dependent.

I have seen a biggish project go though a lot of effort to switch to CRTP, only to see a negligible performance impact.

eliben 12 years ago | |

And I have seen projects whose performance was crippled by layers upon layers of endless virtual calls. YMMV ;-)

army 12 years ago | | |

Agreed, for almost all code it doesn't matter, but for the remaining small fraction it's worth thinking about these things. It sounds pretty insane to go with a blanket approach of removing virtual calls throughout an entire codebase without understanding which ones are the problematic ones. Especially since some ways of solving the problem could potentially lead to other problems like increased compiled code size.

I've seen plenty of software (especially systems software) that does spend much of it's time in tight inner loops. Pulling out all the optimization stops there can give measurable gains. I've personally seen measurable gains on real applications from tricks like reordering branches so that the more predictable branches go first.

kbutler 12 years ago |

"If anything doesn’t feel right, or just to make (3) more careful, use low-level counters to make sure that the amount of instructions executed and other such details makes sense given (2)."

This is explicit support for confirmation bias.

See Feynman's discussion of measuring the charge of the electron in Cargo Cult Science:

"Why didn't they discover the new number was higher right away? It's a thing that scientists are ashamed of—this history—because it's apparent that people did things like this: When they got a number that was too high above Millikan's, they thought something must be wrong—and they would look for and find a reason why something might be wrong. When they got a number close to Millikan's value they didn't look so hard. And so they eliminated the numbers that were too far off, and did other things like that..."

http://neurotheory.columbia.edu/~ken/cargo_cult.html

nkurz 12 years ago | |

And as an alternative, would you suggest laboriously using low level counters to verify that every measurement you think is correct is indeed correct? Given finite resources, what's a better approach than concentrating on the apparent anomalous measurements? I'm not sure I see the parallel.

tezka 12 years ago | |

it was funny you felt the need to post your wisdom both here and under the actual post.

nly 12 years ago |

When you think you can use CRTP instead of virtual dispatch in your program, you didn't need virtual dispatch to begin with... you needed a generic algorithm to operate over your object classes. That's exactly what run_crtp() is, the CRTPInterface class is completely redundant except that it provides some degree of compile-time concept checking (which we'll hopefully get in C++17)

Virtual dispatch is useful for type erasure, when using abstract types from plugins, DLLs or generally "somebody elses code". IMHO, the valid use cases within a standalone program are actually fairly small.

jamesaguilar 12 years ago | |

Unit testing is my #1 use for virtual functions. "somebody else's code" a.k.a. standard ML modules is a distant second.

berkut 12 years ago |

I've done benchmarks on this fairly recently, and with the functions actually doing a lot of work (ray intersection for a raytracer), I saw practically no difference between CRTP and Virtual Functions:

http://imagine-rt.blogspot.co.uk/2013/08/c-virtual-function-...

And this was with billions of calls to the functions...

blt 12 years ago | |

Yes, the penalty is most glaring for calls that do a tiny amount of work. Imagine if

  String.charAt(int index)

was a virtual call inside of strlen().

gjm11 12 years ago |

So he found that dynamic dispatch was a lot more expensive. Fair enough and not very surprising. But let's quantify it a bit in absolute terms. The dynamic version of the code took 1.25s to run, during which time it performed approximately 8 x 10^8 virtual function calls. That translates to a cost per call of 1.5 nanoseconds.

From which my takeaway would be: In inner-loopy code for which an extra nanosecond or so per call is critical, you should avoid virtual function calls. For anything else, don't worry about it.

MichaelGG 12 years ago | |

1.5 nanoseconds per call in the best case. In some huge monstrosity where you've got to go chase down object headers not in the cache, things may be quite different.

tomp 12 years ago |

Instead of devirtualization, a simpler optimization, which would additionally also help in the dynamic case, is simple loop hoisting of the method pointer fetch. Instead of doing

    while(...) {
      (obj->vtable[0])(...)
    }

we could have

    void(*fn)(...) = obj->vtable[0]
    while(...) {
      fn(...)
    }

which would avoid two redirections per inner loop! Actually, I'm almost sure that is what LuaJIT would do, and many other high-level programming languages could perform this optimization as well. However, maybe C is too low-level to be able to do that, and I don't know about C++.

eliben 12 years ago | |

That would save the indirection, but I hope the article shows that by far the biggest cost comes from the lack of inlining. The latter would not be solved by your function pointer.

jheriko 12 years ago | |

this is not about C... C++ certainly. C has no such thing as a virtual function or dynamic dispatch (unless you implement it yourself).

vinkelhake 12 years ago |

This is a nice article and props for including and dissecting generated assembler!

A key thing here is that inlining is what enables zero-cost abstractions in C++. A virtual call is slower than a regular call, but the main problem is that it builds a barrier that effectively stops inlining.

It'll be interesting to see how devirtualization in GCC will do for real world programs.

namuol 12 years ago |

Observation: the intricacies of our technologies are growing to such complexity that analysis of the things we once had a direct hand in the design of plays out much like the analysis of some sort of mysterious natural phenomenon.

jheriko 12 years ago |

its interesting to see a break down of this - especially using modern compilers on the intel platform.

did you try the intel compiler? for raw low level optimisation it sometimes massively out performs the ms, gcc or clang versions...

i'd imagine these problems are worse on ARM chips, and dynamic dispatch is even less effective there - certainly on PPC architectures I've seen much worse performance than on similarly powered Intels in precisely this situation. the caches are less and slower...

i'm not 100% but i think i've seen virtual calls 'devirtualised' by the MS compiler a couple of years ago... I might be thinking of something else though, it was a while back now. I was unpicking some CRTP mess in something that /was not performance critical in anyway/...

pmjordan 12 years ago | |

You may be thinking of this: IIRC the standard recommends that compilers omit dynamic dispatch when the dynamic type is known at compile time - this essentially boils down to the case where a virtual method call follows creation of the object with 'new' or as an automatic variable. In my experience, this is commonly implemented correctly in compilers.

The other case where the dynamic type is known is in the constructor itself of course.

eliasmacpherson 12 years ago | |

thanks again for your comment the other day about memcpy(), I am after finding a deep and rich seam of optimisation out of it!

cma 12 years ago |

I'd like to see a comparison of calling a dynamically linked function call vs a non-dynamically linked virtual call.

Dynamic linking has more indirection than you might expect because the function addresses can't always just be put at the call site during the library load (the places where you would want to write the address can be in code that is read-only mmapped to aid in sharing memory between processes and to avoid loading unused stuff from disk).

zwieback 12 years ago | |

In an ideal world the OS could still replace the call sites with straight calls to the loaded library, circumventing a jump table altogether. I don't remember what this is called, maybe something like a thunk, but I've seen it happen in the debugger where the first call causes a fault which rewrites the call site with the target address and subsequent calls are straight to the lib. This can work even if the chunk of code containing the call sites is shared and readonly, as long as the OS can override that.

pmjordan 12 years ago | |

The calls into jump tables are generally static, so the jump table itself can be prefetched. The jump table code is then a regular function pointer call, which is also monomorphic and so can be reliably predicted. I'd expect the impact to be small compared to a regular monomorphic function pointer call.

vicaya 12 years ago |

This _could_ be another case of premature optimization, as gcc 4.9+ could automagically devirtualize non-overridden virtual functions. icc could do that for years.

nkurz 12 years ago | |

That's not the way the phrase 'premature optimization' is usually used. Usually, it means spending time optimizing something that is not a limiting factor, or that otherwise will not make a difference in the final result. Keeping your code simple in the hope that eventually it will become fast is something else, probably falling closer to 'Sufficiently Smart Compiler' http://c2.com/cgi/wiki?SufficientlySmartCompiler.

pmjordan 12 years ago | |

Presumably this needs to be done at link-time? (And you'd have to disable it if you're planning to load code dynamically)

simfoo 12 years ago |

I really like the "Mandatory precaution about benchmarks", it's spot on

rottyguy 12 years ago |

anything similar for higher level languages (c# or the likes)?