How much does Rust's bounds checking cost?(blog.readyset.io) |
How much does Rust's bounds checking cost?(blog.readyset.io) |
If you're not in a HPC or heavily resource constrained context you can safely ignore the performance implications of choosing whatever programming language you like.
The benchmark went from 28.5ms to 32.9ms.
That as a percentage is 15% and is huge, it’s not noise.
The test is flawed in some way, the article is disappointing in that the author didn’t investigate further.
Benchmarking that needs special care, and planning for whatever it is you want to measure. A million trivial queries and a dozen very heavy queries are going to do significantly different things, and have different tradeoffs and performance characteristics.
go build -gcflags=-B
and see if it helps. Generally the assembly looks better, but it doesn't really run faster on a modern chip.Do your own test, and keep the results in mind next time somebody on Hacker News dismisses Go because of the "overwhelming cost of bounds checking".
That’s certainly one criticism I don’t remember ever seeing.
I agree completely that benchmarks need care, hence my point that the article is disappointing.
The author missed the chance to investigate why removing bounds checks seemed to regress performance by 15%, and instead wrote it off as “close enough.”
It would have been really interesting to find out why, even if it did end up being measurement noise.
"Around Easter 1961, a course on ALGOL 60 was offered … After the ALGOL course in Brighton, Roger Cook was driving me and my colleagues back to London when he suddenly asked, "Instead of designing a new language, why don't we just implement ALGOL60?" We all instantly agreed--in retrospect, a very lucky decision for me. But we knew we did not have the skill or experience at that time to implement the whole language, so I was commissioned to design a modest subset. In that design I adopted certain basic principles which I believe to be as valid today as they were then.
"(1) The first principle was security: The principle that every syntactically incorrect program should be rejected by the compiler and that every syntactically correct program should give a result or an error message that was predictable and comprehensible in terms of the source language program itself. Thus no core dumps should ever be necessary. It was logically impossible for any source language program to cause the computer to run wild, either at compile time or at run time. A consequence of this principle is that every occurrence of every subscript of every subscripted variable was on every occasion checked at run time against both the upper and the lower declared bounds of the array. Many years later we asked our customers whether they wished us to provide an option to switch off these checks in the interests of efficiency on production runs. Unanimously, they urged us not to -- they already knew how frequently subscript errors occur on production runs where failure to detect them could be disastrous. I note with fear and horror that even in 1980, language designers and users have not learned this lesson. In any respectable branch of engineering, failure to observe such elementary precautions would have long been against the law."
-Tony Hoare, 1980 Turing Award Lecture (https://www.cs.fsu.edu/~engelen/courses/COP4610/hoare.pdf)
Here we are, 42 years later, and bounds checks are still not the default in some languages. Because performance, or something. And our computers are literally 1000x as fast as they were in 1980. So instead of paying 2% in bounds checks and getting a merge 980x faster, we get 2-3x more CVEs, costing the economy billions upon billions of dollars a year.
You can remove bounds checks when you can prove that the index won't ever get out of bounds; this is possible in many cases, such as iteration with known bounds.
This is the big one. You pay a 50% penalty for actual CPU bound, iteration heavy code with bounds checking enabled.
If the compiler can't do that by itself, a library should do it.
The real issue is whether the information about the true size of the memory region involved is available at the point where it is needed. This may come down to how good the language is at capturing desired semantics in a library. Rust still has a long way to go to catch up with C++ on this axis, and C++ is not waiting around.
Rust claims responsibility for enforcing safety in the compiler, with libraries using "unsafe" to delegate some of that to themselves. Users then trust the compiler and libraries to get it right. In C++, the compiler provides base semantics while libraries take up the whole responsibility for safety. Users can trust libraries similarly as in Rust, to similar effect.
Modern C++ code typically does no visible operations with pointers at all, and most often does not index directly in arrays, preferring range notation, as in Rust, achieving correctness by construction. A correct program is implicitly a safe program.
Bounds checking should be the default, and then only when someone has proved through benchmarking and profiling that it's actually a problem for their application, should they even consider turning it off.
If your conclusion is “no signal, just noise” boost the input until the signal becomes apparent. If that means writing such a massive loop that the program takes an hour to run, fine.
...which has evolved its own, much worse, horrors instead.
Before that, there was Java.
Do not want.
IIRC it was in the List.Add method, a very commonly used function in the C# core libs. First one programmer refactored it to very slightly reduce how many instructions were output when compiled. Then a second programmer working on the jit compiler optimizations which also affected this Add method making it a little smaller as well.
Alone, each change was hard to even measure, but seemed like they should be a net win at least in theory. Combined, the two changes made the Add method small enough to be an in-lining candidate! Which meant in real programs sometimes very measurable performance improvements result.
As others in this post have noted, a removed bounds check might also unblock vectorization optimizations in a few cases. One might be able to construct a test case where removing the check speeds thing up by a factor of 16!
This got improved in Rust 1.65 just this month, but the point stands.
edit: ARM64 compilation is even sillier. https://rust.godbolt.org/z/PEsbeGxWP
Right. The myth that bounds checking is expensive may have come from some terrible compilers in the early days. Berkeley Pascal was a notable example. Each bounds check was a subroutine call.
The common cases for bounds checks are:
- It's in an inner loop iterating over arrays. That's the case where the highest percentage of the time goes into the bounds check. It's also the case likely to be optimized out. This is the case people worry about.
- It's in code that doesn't do a lot of subscript operations. So it doesn't matter.
- It's in non-optimizable code that does a lot of subscript operations. That's unusual, but it does come up. An modern case might be Unreal Engine's Nanite meshes, which have lots of small offsets within the data stream. On the other hand, if you don't check that stuff, it's a great attack vector.
... that being said, I'd argue that the most beneficial memory-safety feature of Rust is about temporal things (i.e. prevents UAF etc) instead of spatial ones.
news_app/ranges_and_joins/cached
time: [28.583 ms 29.001 ms 29.526 ms]
thrpt: [277.45 Kelem/s 282.48 Kelem/s 286.61 Kelem/s]
news_app/ranges_and_joins/cached
time: [33.271 ms 33.836 ms 34.418 ms]
thrpt: [238.01 Kelem/s 242.11 Kelem/s 246.22 Kelem/s]
Given that 33.836/(1000/(242.11*1000)) ~= 8192, my understanding is the time reported here is how long it takes to do 8192 queries. Also it reports three metrics (should be min, median and max). All these means the benchmark harness did run the test for a lot of times and the 5 ms different is not random at all.This kind of bounds check are normally not ever violated (in well formed code) so branch prediction predicts them correctly nearly always.
It also is (normally) just jumping in the bad case, which means with a correct branch predictions thy can be really cheap.
And then cpu "magic" tends to be optimized for that kind of checks at they appear in a lot of languages (e.g. Java).
Then in many cases the compiler can eliminate the checks partially.
For example any many kinds of for-each element iterations the compiler can infer that the result of the conditionally loop continuation check implies the bounds check. Combine that with loop unrolling which can reduce the number of continuation checks and you might end up with even less.
Also bounds checks tend to be an emergency guard, so you tend to sometimes do checks yourself before indexing and the compiler can often use that to eliminate the bounds check.
And even if you ignore all optimizations it's (assuming in bounds) "just" at most one int/pointer cmp (cheap) followed by a conditional branch which doesn't branch (cheap).
The unchecked version is fully unrolled and vectorized using multiple registers. The checked version must use a loop.
Part of what's going on here is that panics are "recoverable." If the out-of-bounds write occurs at index 61, this will panic, but the writes to lower indexes must have gone through. This means the panic cannot be hoisted out of the loop.
This can make a big difference if you can hoist bounds checks out of an inner loop. You get the performance without adding any unsafe {}.
We've learned to accept that when you turn on optimizations, you lose some lines and variables from your debug info. This is a pretty similar trade-off.
I would think a Rust compiler could hoist the check outside of the loop at least sometimes (it might not want to do it if the loop made changes that are visible from outside the loop, such as changing a volatile variable or doing I/O)
You also do not really care about flushing the pipe on an out of bounds index, since very likely normal operations can not go on and you move over to handling/reporting the error, which likely has no need for significant throughput.
Also I would just like to note that safe arrays aren't a unique rust feature. Even writing your own in C++ is not hard.
These days, C++ really should be compiled with bounds-checked indexing and iterators by default. Unfortunately, this is still not a scenario that is well-supported by tooling.
https://learn.microsoft.com/en-us/cpp/standard-library/check...
The hard part is changing the mentality from whoever sits at the keyboard.
Note that you often only branch in the "bad" case, which means even on systems without branch prediction it tends to be not very expensive, and compilers can also eliminate a lot of bounds checks.
So it matters whether the code generator produces dead branches that can be retired cheaply. Probably, optimizers take this into account for built-in operations, but they know less about the happy path in libraries.
This is a motivation for the "likely" annotations compilers support. The likely path can then be made the one where the branch is not taken. Code on the unhappy path can be stuck off in some other cache line, or even another MMU page, never fetched in normal operation.
The cost seen here is likely from something else, though. Keeping array size in a register costs register pressure, or comparing to a stack word uses up cache bandwidth. Doing the comparison burns an ALU unit, and propagating the result to a branch instruction via the status register constrains instruction order.
Even those might not be at fault, because they might not add any extra cycles. Modern processors spend most of their time waiting for words from memory: just a few cycles for L1 cache, many more for L2 or L3, an eternity for actual RAM. They can get a fair bit done when everything fits in registers and L1 cache, and loops fit in the micro-op cache. Blow any of those, and performance goes to hell. So depending how close your code is to such an edge, extra operations might have zero effect, or might tank you.
Results of measurements don't generalize. Change something that looks like it ought to make no difference, and your performance goes up or down by 25%. In that sense, the 10% seen here is noise just because it is hard to know what might earn or cost you 10%.
Also branch paths which lead guaranteed to an panic tend to be treated as "unlikely" but not sure how far this is guaranteed.
1. if (x >= 0) && (x < arr_len(arr))
2. get element from array index x
3. else
4. throw exception
5. do more stuff
The compiler deduces that at line 5 0 <= x < arr_len(arr). From that it can deduce that abs(x) is a no op, that 2*x won't overflow (cause arrays can only have 2^32 elements), etc. Without bounds checking the compiler emits: 1. get element from array index x
2. do more stuff
So the compiler doesn't know anything about x, which is bad. The solution which apparently is not implemented in Rust (or LLVM, idk) is to emit code like the following: 1. assert that 0 <= x < arr_len(arr)
2. get element from array index x
3. do more stuff 1. if (x >= 0) && (x < arr_len(arr))
2. get element from array index x
3. else
4. core::hint::unreachable_unchecked
5. do more stuff
Where unreachable_unchecked transmits precisely such information to the optimizer: https://doc.rust-lang.org/stable/std/hint/fn.unreachable_unc...The solution to what? You appear to be suggesting that compilers should insert bounds checks even when the programmer doesn't ask for them, which, sure, I'm down for that, but the whole point of this discussion is that the bounds checking that currently gets done has a negligible cost in practice.
like, why bother? CPUs in next 2 years will win that performance anyway
and your software will be safer
Even very good static analysis tools have a hard time doing this. In a language like C++ this would effectively mean that very few index operations can be done naively and compile times are significantly increased. Performance is likely reduced as well over the trivial alternative of using a safe array.
https://github.com/google/wuffs#what-does-compile-time-check...
and using a range check manually before an index will normally optimize the internal bounds check away
A check performed in a library, or a condition ensured in a library, is wholly as good as the same work done in the compiler. Compilers are not magic, they are just programs.
One issue on that front is a question of reliability / consistency: on a small benchmark chances are the compiler will always trigger to its full potential because there’s relatively little code, codegen could be dodgier in a context where code is more complicated or larger.
Then again the impact of the bounds check would also likely be lower on the non-trivial code (on the other hand there are also threshold effects, like branch predictor slots, icache sizes, …).
How about you do something about that? Pin it to a single core? Run it a few thousand times?
Or maybe the unsafe access acts like volatile in C and disables any optimization/reordering because the compiler thinks it’s accessing a register.
Secondly, even assuming you want runtime bounds checking everywhere, then this is still a useful analysis because if you learn that bounds-checking has no relevant overhead - great! No need to look at that if you need to optimize. But if you learn that it _does_ have an overhead, then you have the knowledge to guide your next choices - is it enough to be worth spending any attention on? If you want the safety, perhaps there's specific code paths you can restructure to make it easier for the compiler to elide the checks, or the branch predictor to make em smaller? Perhaps you can do fewer indexing operations altogether? Or perhaps there's some very specific small hot-path you feel you can make an exception for; use bounds-checking 99% of the time, but not in that spot? All of these avenues are only worth even exploring if there's anything to gain here in the first place.
And then there's the simple fact that having a good intuition for where machines spend their time makes it easier to write performant code right off the bat, and it makes it easier to guess where to look first when you're trying to eek out better perf.
Even if you like or even need a technique like bounds checking, knowing the typical overheads can be useful.
I think the most preferable solution (although not always possible) would be to use iterators as much as possible. This would allow rustc to "know" the entire range of possible indexes used at runtime, which makes runtime bounds checking redundant.
Some old benchmarks here: https://parallel-rust-cpp.github.io/v0.html#rustc
But, occasionally, you have some loop that can't be done with an iterator, AND its part of a tight loop where removing a single conditional jump matters to you. When that matters it is a real easy thing to use an unsafe block to index into the array without the check. The good news is then at least in your 1 million line program, the unsafe parts are only a few lines that you are responsible for being sure are correct.
Bound checks prevent some optimizations, since they're a branch with a significant side effect that the compiler must preserve.
And there’s folks who do exactly that.
Because it is faster. Worst case you are triggering a branch miss, which is quite expensive.
>It's like talking about how much faster a car would go if it didn't have to carry the extra weight of brakes.
So? Every wasted CPU cycle costs money and energy. Especially for very high performance applications these costs can be very high. Not every car needs brakes, if it doesn't need to stop by itself and crashing hurts nobody they are just waste.
Because of this you might find some rust code which opts out of bounds check for such code using unsafe code.
But this code tends to be fairly limited and often encapsulated into libraries.
So I agree that for most code doing so is just plain stupid. In turn I believe doing it on a program or compilation unit level (instead of a case by case basis) is (nearly) always a bad idea.
Other then that, I doubt that any reasonably pragmatic and experienced C programmer will ever argue against runtime bounds checking from a performance point of view. Even in hot loops one can usually move the bounds checking to a place outside the loop.
If you care about security of processing outside input the are many other options (or you can use sanitizers or safe practices in C). For significant part of the C programming world it's just not important though.
better algorithms, better data structures, multi-threading, branchless programming (except safety), data-oriented design
and then elimination of checks, not first.
Running this with 1.65 on an Intel 12400 gets a nearly 4x speedup when bounds checking is not needed. Just wow.
Bounds checking avoidance is important when it becomes a significant chunk of your hot-path.
Although, if one is building their own libstd anyway, then I believe compiling with `--build-std-features panic_immediate_abort` should also have the same effect.
The purpose of "volatile" is to mark MMIO so that the memory reads and writes don't get optimised out because they actually perform I/O. Everywhere you see volatile abused to do something else (yes including Unix signal handlers) that's because C has a hammer and so now everything looks like a nail to C programmers. In a few cases this abuse is guaranteed to work (and results in a lot of heavy lifting elsewhere to achieve that) in most cases it's just luck.
Rust has generic intrinsics which perform memory read/write that won't get optimised out, which is the actual thing C needed when volatile was invented.
I think the issue is more the side effects than the panic message. I have tried making a side effect free loop like for i in 0..44 { v[i]; } but it compiled down to a "if array length is larger than limit X, then call panic_bounds_check". On the other hand, if you replace the v[i] with soon-stable black_box(v[i]), you see that the loop remains. It doesn't know what black_box is doing so it has to run the code. The optimizer is in this case very happy if you have a check before the loop.
The tests aren't bunk. There are a variety of reasons why the assertions generated by array index checks can be useful for LLVM, and there is also a fair amount of noise in any end to end test like this. The main point is that it clearly isn't a primary bottleneck (which should be pretty obvious in a test that takes 30 ms and performs under 2000 bounds checks).
In C++, I kind of always strived to use the higher level abstractions, with support for bounds checking. First the compiler provided frameworks from pre-C++98, then configuring builds so that STL types also bounds check in release.
In C is harder, but a mix of asserts, using TU as if they were modules exposing ADTs (Modula-2/Pascal units style), can also be a way to help mitigate issues. Kudos to the Code Complete book for some of the ideas[0].
However, trying to sell this experience always feels like quixotic in those environments, so kudos for actually doing it.
[0] - "Code Complete. A Practical Handbook of Software Construction"
I wonder if there's some pass that's missing or not done at the lower opt level.
In my experience with them, the performance hit is far more substantial that a few percent, except in situations where the compiler can elide the checks altogether. For example, simply iterating over std::vector with _ITERATOR_DEBUG_LEVEL=1 is twice as slow if you work with iterators explicitly instead of writing it as a range-for.
And I'm not sure if it can be substantially better, given that C++ as designed simply needs to do more checks to ensure validity - to catch cases like comparing iterators belonging to different containers, or iterators getting invalidated when containers get resized or when the corresponding element is deleted outright. This all can't be done with simple ranges or slices, which is why VC checked iterators maintain a reference to the parent container.
The performance hit from bounds checking was never an issue for most applications I have done in C++.
Why C++? WinDev likes it a lot, alongside COM, and not everything is exposed to .NET.
(OTOH for cases where you would notice, given the perf penalty, you might as well just write it in C# then.)
And that's the rub, isn't it? Getting that last bit of perf in the inner hot loop has traditionally often required people to write the entire thing in languages that are unsafe (ffi overhead often being enough that you can't just wrap up the hot loop and call the bit that needs to be performant from elsewhere).
I wonder how much good interop could get us.
Big problem as I see it is compiler writers using C++ with it's completely broken language and compilation model. Of course they think speed is the only important metric because C++ compilers compiling C++ compilers is very very slow. And they're dealing with trusted data 100% of the time and not the guys desperately trying to patch a zero day vulnerability at 2am.
Edit:
Based on the benchmark code linked to in dahfizz 's comment, it seems that Rust does the above for vectors, but there are situations where the bounds can't be proved, so boundary checking can't be removed. How common is this case in practice?
We trust the library author to get it right, despite (in Rust) wrapping accesses in "unsafe" or (in C++) not. Compilers are not particularly better at everything than library authors.
Some compilers have pretty sophisticated analyses aimed at just that: determining affine relations to statically bound indexed accesses. Failing that, some compilers will resort to loop versioning, generating two versions of the loop and then partitioning the iteration space into the definitely-in-bounds range from possibly-out-of-bounds range, then selecting which portions of which loop to execute by prefixing both with dynamic checks. Couple all of that with peeling and unrolling, and bounds checks start disappearing and getting amortized away.
The norms there, from what I gather, are that you compile with runtime checks enabled unless you've used the SPARK prover tools to verify the absence of runtime errors, in which case you can safely disable runtime checks in your builds.
[0] https://docs.adacore.com/spark2014-docs/html/ug/en/usage_sce...
That said, this is unavoidable in C/C++ too.
In C/C++, the array can change. It might be moved, de-allocated or resized in the current or a synchronous thread. So the pointer that iterates until it is equal to the end pointer, might point to invalid data if the size, location or existence of the vector changes.
for i in 0..v.len() {
println!("{:?}", v[i]);
}
you check the length at the top (the `v.len()`) and also for each `v[i]`. The first is unavoidable, but the second can be skipped when using an iterator instead, because it can be statically guaranteed that, even if you don't know at compile time what concretely the length is, whatever it ends up being, the index will never exceed it. Rust specifically differs from C++ in this respect, because nothing in that language prevents the underlying vector's length from changing while the iterator exists, so without per-access bounds checks it's still possible for an iterator to walk past the end.Yes, some architectures (including x86) have instructions that hint to the branch predictor, but I think they still end up influencing branch predictor state.
There are four versions of a conditional branch: to be predicted, almost always taken, almost never taken, and too random to waste a predictor on. Compilers nowadays have "likely" and "unlikely" intrinsics, but offer, yet, none for the last. I think x86 now ignores its "likelihood" branch prefix because back when introduced it was too often wrong; and now compilers don't emit it because it they know it is ignored.
The "too random to predict" would be good for sorting, where it could provide a 2x speedup. To get the effect today you need to use conditional-move instructions, which have become unfashionable. Getting your compiler to emit a cmov is tricky.
Intel added a special instruction (maybe in skylake?) to use for spinning on an atomic flag update, that just halts until the cache line being watched gets clobbered by a message from some other cache. Compilers don't emit it, unfortunately, even when spinning on what they know is an atomic flag.
There are zillions of such instructions that were good ideas but were never taken up by programmers who they were meant for.
Wouldn’t vectors have the same advantage in Rust? If we’re iterating over a vector, it’s proveable at compile time that the length is not being modified during the iteration.
let arr = [1, 2, 3, 4];
// Will not compile a bound check
let x = arr [3];
let v = vec! [1, 2, 3, 4];
// Imagine other code separates these two lines
// Might compile a bound check
let x = v [3]; fn foo_arr(v: &[u8; 5]) -> u8 {
v[2]
}
fn foo_vec(v: &Vec<u8>) -> u8 {
v[2]
}
In the foo_arr case, the index lookup can be optimized out. In the foo_vec case, it can't be, because theoretically you might pass something with only 2 elements or less to foo_vec, and access to the third element will fail.Same goes for e.g. the slice::windows vs slice::array_windows functions. In windows, you get a slice in the closure, and while the size is guaranteed by the implementation, without there being inlining the optimizer doesn't know about this guarantee. With array_windows, this size guarantee is communicated.
What catch up does Rust need to do?
Rust has slice that know the size of its data built in the language, while C++ doesn't. And Rust has stricter const and mutability rules that facilitates optimizations.
As for the implementation, Rust use LLVM which is also the backend used by one of the popular C++ compiler.
LLVM sometimes does this, but when it doesn't, you may insert asserts to guide the optimizer, as explained here https://news.ycombinator.com/item?id=33808853
I think this technique works in C and C++ too (if you use clang or gcc)
Most of the stuff I see at work, is quite far from this ideal reality, starting with Android's codebase, or the various ways C++ gets used in Microsoft frameworks.
let mut it = vec.iter();
println!(it.next());
println!(it.next());
println!(it.next());
This needs to do bounds checking on each call to next() to either return Some(a) or None (assuming the length of vec is unknown at compile time). (hhttps://doc.rust-lang.org/beta/src/core/slice/iter/macros.rs....)You are right that theoretically a range-based for loop that uses iterators does not need to do bounds checking because a compiler can infer the invariant that the iterator is always valid. In practice I don't know enough about LLVM or rustc to know whether this optimization is actually happening.
The check that can (and is) elided by the iterator here is the one that would be in the body of the loop, when you try to use i to index into the array. The nice thing about it is that it doesn't require the compiler to be smart at all.
Relevant part: > for-in-loops, or to be more precise, iterator loops, are a simple syntactic sugar over a common practice within Rust, which is to loop over anything that implements IntoIterator until the iterator returned by .into_iter() returns None (or the loop body uses break).
(The other uses of the 'for' keyword it refers to are unrelated to loops)
The library author has certain knowledge of what the library is meant to achieve, where the compiler is obliged to guess according to whatever tea leaves it can find to descry.
In particular, the library author knows that the container won't be changing size over the duration of the loop, something the compiler would have difficulty proving.
A prediction slot is where statistics on past behavior of a branch at a particular address accumulate. If no past behavior is needed to predict whether the branch will be taken, the predictor needs no statistics to decide, and statistics on some other branch instruction may accumulate in that slot instead.
If my understanding is correct, slots aren't really "taken up", but rather avoiding the branch predictor reduces the probability of hash collisions for other branches.
I also have a question for the crowd: for indirect branch target prediction, are the BTB slots actually tagged with the address? If you have BTB miss, do you pre-emptively stall the pipeline? It's more power-efficient to do so, but maybe it's better to avoid tags and tag-checking and spend that transistor budget on more BTB slots, and just go ahead and mispredict if you get a collision on BTB slots.
You don't stall on branches you have no history of, but apply heuristics that don't depend on history. E.g., backward branches are usually taken, forward branches less so. I think newer chips can try both alternatives, to some degree, although this cannot be carried far as it blows up exponentially.
Using only information resident in the architectural registers and the decoded instruction stream, there's no heuristic that can accurately tell you where printf (or this->do_my_thing()) lives in the address space. If the address for printf isn't in the BTB, do you just stall the pipeline while loading the proper GOT entry? I thought that in general, they just hashed the instruction address to get the BTB entry and blindly assumed that BTB entry would be the target address.. (and of course, issue the GOT read and check the assumption as soon as the real GOT entry is read). The alternative would be to stall the whole pipeline until the GOT could be read anyway, so might as well do something... unless you're really concerned about power consumption. Even in extremely power limited situations, assuming hash collisions in the BTB are rare might be less power-hungry than checking the source address in the BTB entry on each and every indirect branch.
Storing both the source and destination addresses in the BTB entry means twice as much space used, and extra latency and power usage in checking the source address. Does anyone here have good knowledge of when it's worth it to store both the source and destination address in the branch target buffer, instead of just storing the target address and presuming hash collisions are rare?
On another side note: has anyone heard of an architecture that supports IP-relative indirect function calls? Such an addressing mode would allow making a function call through the GOT entry without having to load the PLT entry into the instruction cache. I'm guessing that since there's a fair overhead for sharded library calls, the most performance-sensitive code already avoids dynamic library calls, so there's not much to be gained in exchange for the added complexity of the extra call addressing mode.
Attention seems to be reserved for events that occur in the sorts of tight loops that affect benchmark results. Thus, the first encounter with any branch may stall indefinitely long without (market) penalty.
Game coders have learned to batch their indirect calls according to actual destination, to throw the branch predictor a bone.
What's special about libraries? Every programmer has such knowledge, and every programmer writes buggy code.
Library authors know things about what their code is meant to be doing that compilers cannot deduce, so cannot act on. But the library author can. A library, according to how heavily it is used, benefits from more thorough testing than generic application code gets.
> Library authors know things about what their code is meant to be doing that compilers cannot deduce, so cannot act on. But the library author can.
I don't see your point here.
> A library, according to how heavily it is used, benefits from more thorough testing than generic application code gets
This doesn't generalise. There's plenty of very widely used application-specific code, and there's plenty of little used library code. Also, widespread use does not imply a high level of scrutiny, even if we're talking only about Free and Open Source software.
Anyway, that's all a sidetrack. The benefits of memory-safe languages aren't up for debate, even for well-scrutinised codebases. We continue to suffer a stream of serious security vulnerabilities arising from memory-safety issues in code written in unsafe languages. The go-to example is Chromium, where 70% of serious security issues are due to memory safety. [0]
[0] https://www.chromium.org/Home/chromium-security/memory-safet...
But the topic was not CVEs. It was optimization. An optimization opportunity that would be missed by a compiler can be explicitly spelled out for it by a library author. Whether the library author is obliged by the compiler to write "unsafe" next to it has exactly zero effect on the correctness of the code written in that place: you can easily write incorrect code there. If you don't, it was not because the compiler provided any help.
Not so. C++ is not a safe language, and never will be. Even if you avoid raw pointers you aren't safe from use-after-free bugs as the standard library makes them possible even then, with std::string_view (and perhaps other functionality). [0][1][2]
There is no safe subset of C++. People have tried, e.g. the MISRA folks, but they're unable to find a subset of C++ which is both safe and usable. The only way to guarantee the absence of undefined behaviour in a C++ codebase (or a C codebase) is to use formal analysis tools, which are a tremendous burden.
If it were possible to get decent safety guarantees out of C++, Mozilla wouldn't have bothered inventing a whole new language in the hope of improving Firefox.
I do agree though that modern C++ code is likely to have fewer memory-safety issues than 'C-style' C++ code.
> OpenSSL is old C code, so of even less relevance here.
It isn't irrelevant, our conversation wasn't specifically about the C++ language. I was responding to your suggestion that using well-known libraries written in unsafe languages, is a reliable way to avoid memory-safety issues. We know this isn't the case.
> An optimization opportunity that would be missed by a compiler can be explicitly spelled out for it by a library author.
Sure, but this whole thread is discussing that bounds checks are in practice generally inexpensive on modern hardware, except in cases like SIMD optimisations being precluded by the need for checks. I suspect this extends to other runtime safety checks too, but I don't have hard numbers to hand.
> An optimization opportunity that would be missed by a compiler can be explicitly spelled out for it by a library author.
Sure, that's an advantage of low-level languages. It doesn't negate the importance of memory-safety though.
Runtime checks are unlikely ever to have zero performance cost, sure, but the cost can be close to zero, and the fallout of removing checks from buggy code can be considerable.
> Whether the library author is obliged by the compiler to write "unsafe" next to it has exactly zero effect on the correctness of the code written in that place: you can easily write incorrect code there.
If it were a simple boolean matter of correct vs incorrect, then sure, but it often isn't. In practice, it can mean the difference between an exception being thrown, and undefined behaviour running riot, possibly leading to serious security issues.
> If you don't, it was not because the compiler provided any help.
Runtime checks are very helpful during development.
[0] https://alexgaynor.net/2019/apr/21/modern-c++-wont-save-us/
But the actual topic was not not that. The actual topic you have tried to steer away from is optimization. The point I made was that the author of a library can take up responsibilities that some people insist only the the language, via the compiler, can perform. The library author can perform optimizations the compiler fails to, and the library author can define interfaces that can only be used correctly, and safely. To the programmer using a library, it makes no difference, except that they may be unable to use some new, immature language, but can easily pick up and use a good library.