Rust is now overall faster than C in benchmarks(benchmarksgame-team.pages.debian.net) |
Rust is now overall faster than C in benchmarks(benchmarksgame-team.pages.debian.net) |
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
Why is julia using 250x the memory? (and it`s still fast)
If I wrap things in a function and run it a second time in the repl, this is what I get:
$ JULIA_LLVM_ARGS="-unroll-threshold=500" ~/julia-b00e9f0bac/bin/julia -O3 --check-bounds=no
julia> include("nbody.jl")
run (generic function with 1 method)
julia> @time run(50000000)
-0.169075164
-0.169059907
2.044478 seconds (294.25 k allocations: 12.599 MiB, 8.89% compilation time)
julia> @time run(50000000)
-0.169075164
-0.169059907
1.867936 seconds (12 allocations: 1.750 KiB)
So, the second time there's no compilation overhead, just like in Rust, C, C++.If you are willing to spend the time to perform at the highest level, Rust can bring you there.
I don't get those. What's the difference?
Basically, it's toy programs written in each of these languages. They measure how fast each one is executed. This methodology does have it's limitations, which that page is upfront about.
Also is there also a way to filter out posts with the word "Rust" in it so I don't see them?
You are correct that the C feature is often banned. Rust doesn’t even support it at all.
There are finite limits to the complexity cost developers are willing to pay for performance. Because the cost in each of these three languages is different to express some thing, sometimes there will be a threshold where in one or more of these languages most developers will choose a less optimal design. This manifests as practical performance differences in real software even though in theory they are equally expressive with enough effort.
This comes with another tradeoff. Efficient expressiveness relative to software performance comes at a cost of language complexity. C is a simple language that has enough efficient expressiveness for simple software architectures. C++ is at the extreme opposite; you can do mind-boggling magic with its metaprogramming facilities that can express almost anything optimally but god help you if you are trying to learn how to do this yourself. Rust sits in the middle; much more capable than C, not as expressive as C++.
This suggests the appropriate language is partly a function of the investment a developer is willing to make in learning a language and what they need to do with it. Today, I use modern C++ even though it is a very (unnecessarily) complex language and write very complex software. Once you pay the steep price of learning C++ well, you acutely feel the limitations of what other languages can't express easily when it comes to high performance software design.
I used to write a lot of C. It still has critical niches but not for high-performance code in 2021, giving up far too much expressiveness. C++17/20 is incredibly powerful but few developers really learn how to wield that power, though usage of it has been growing rapidly in industry as scale and efficiency have become more important. Rust is in many ways an heir apparent to C and/or Java, for different reasons. Or at least, that is how I loosely categorize them in my head, having a moderate amount of contact with all three. They all have use cases where you probably wouldn't want to use the others.
Yep. Rust's safety means that in some cases, you can be more aggressive because you know the compiler has your back. And in other cases, it makes harder things tractable. The example of Stylo is instructive; Mozilla tried multiple times to pull the architecture off in C++, but couldn't manage to do it until Rust.
Heh, well. The downvotes are clearly trying to tell me something, though I'm not sure what. I can guess. I assumed that this was something widely agreed upon at this point but clearly I assumed incorrectly.
I think ATS is in that category too but it was removed from the benchmarks game at some point. It's also vastly more esoteric and complicated than Rust is from my limited knowledge. Perhaps someday Zig will get there as well. I wouldn't have expected that this niche would ever see so much exploration, as C and C++ were the only game in town for so long.
(*) of course part of the output is also looking at the Rust code and evaluating style and `unsafe`-wise what the price for winning was.
I work on database engines and the impact of language choice on the design, architecture, and implementation of database engines is very evident. In the specific case of database engines, these differences in design can have a very large performance impact.
I doubt that ATS is more complicated than Rust at on-par feature comparison. It gets more complex only when you reach for a larger set of static constraints you can express in ATS, that are unavailable in Rust: things like dependent types and complete formal proofs that you don't need a separate language for. It feels esoteric only because those features in such low level languages are extremely rare. Also, consider the amount of effort you need to spend on learning a new toolchain (Rust), whereas ATS seems like a plugin/extra compilation step on top of your regular and familiar C toolchain.
Rust avoids those from the start by making slices idiomatic.
Another thing I commonly see is the usage of suboptimal containers (like arrays with linear search) - just because it’s there’s no better alternative at hand (standard library doesn’t offer ones and dependency management is messy). Which also makes it less surprising that code in higher level languages might perform better.
The reason that C programs often don’t perform as well as an equivalent rust program[1] is that it’s so incredibly hard to do anything at all in C (especially something reliable) that one can usually only do the simplest thing possible and this typically means simple data structures, simple algorithms and arrays
[1] suppose the programs are independently created at the same time by programmers of equal ability and are idiomatic in their languages. If a C program is rewritten in rust, it is likely to be faster, but that would likely also be true if it were rewritten in C.
Brian Cantrill talks[1] about exactly this: in his C version he was using AVL trees because they are easy to implement, while his Rust version was using B-trees just because he could. And as a result, his naive first attempt in Rust outperformed his long-optimized C code.
Reliability is very often the name of the game with C, and part of the reason you might see it written in such a simplistic fashion. In embedded systems, we often follow very strict code convention that has strict requirements on how a C program is to be written. This includes everything from how memory is to be allocated, the maximum number of local variables, the maximum number of the arguments, various limits on a function, constraints to handle errors, etc. We do this because it minimizes potential hazards especially when working with limited memory and system resources.
C is an unforgiving language in that mistakes can occur silently and on mission critical hardware these mistakes can cost more than just your project milestones. These programs are very specialized and often have complex algorithms associated with them. Most of the systems I've worked with include various kinds of feedback control systems, and accompanying algorithms. If you're familiar with control systems, you'll know these algorithms are certainly not trivial by any means.
The challenge with C is more as a developer your C code needs to be perfect. Bugs just aren't an option like they are in other environments. Once a specialized piece of hardware ships it needs to work as intended under a myriad of conditions without powering off for the next 30 years (not always the case but more often than you might think). Sometimes this kind of software is going to be put a position it may have never been designed for. The best we can do is try to add various check/support mechanisms both in software and in hardware, and maintain as safe and hazard free software as possible.
In terms of performance, again coming from an embedded environment, there is almost always a spec we are aiming for. It needs to do X tasks in N time for example. Of course high level design questions like whether to poll or wait for interrupt, or how to broker data from shared resources is decided long before you get to the question of how should I, or if I even should, compare two strings. It is nevertheless advised to go with a trivial solution if the ends satisfy the needs.
Is C time-consuming to write? Is C a "hard" language? These are subjective and depend on the nature of the project. I would certainly never use C with only libc to write a webserver that's going to be serving SaaS Co's backend API.
I assume most C++ programs heavily use more of these advanced data structures, correct?
In many programs in many domains the sizes of these data structures will rarely exceed this limit.
I have no idea whether that matters or even easy to measure...
Doesn't that mean that null terminated strings is idiomatic C? That is, my understanding of the term idiomatic is that it is defined by whatever is most natural to users of a language regardless of whether it is the most performant.
It would be the equivalent of teaching people to write books without encouraging them to read anything.
To break myself of the habit I started reading some well regarded programs for fun. And oh boy, have I learned a lot from doing so. One of my first discoveries was this beauty in the Redis source code:
https://github.com/redis/redis/blob/3.0/src/sds.h
The idea is to have a string struct that stores its length and content. But the pointer passed around is a pointer to the (null terminated) contents field in the struct. The string is efficient for internal calls (the length can be queried by subtracting from the pointer). But the pointer is also an idiomatic null-terminated C string pointer, compatible with the standard library and everything else. (typedef char *sds;)
Dovecot is also a gem to read if you're looking for inspiration. The way it manages memory pools is delightful - and I'm sure much more performant than idiomatic rust. (That is, without reaching for arena allocator crates and alternate implementations of Box and Vec).
When I once tested std::sort against qsort (sorting 4-byte integer) I measure a 2x difference. So yes, definitely non-trivial, but it won't get much worse than that.
Have you ever seen a program that was slow because of a slow sorting routine?
If you ever need a fast sort (~ never) then the last thing you should do is use std::sort anyway. You should figure out what your data looks like and hand roll an implementation. For example, a radix sort is often possible to use, easy to implement, and much faster than std::sort.
What is meant here? Fast? Reliable? Secure? Memory efficient? Power efficient? Easy to write? Easy to maintain? Quick to compile? Easy to debug?
I have no illusions about the reliability/security/correctness of real-world C code, especially since it's usually not compiled with a memory-safe compiler or run with memory-safe libraries and runtime environments (though it's often sandboxed to limit the damage.) It's relatively easy to introduce memory errors which are not detected by the compiler or runtime.
Certainly many algorithms and data structures (in C and other languages) exhibit tradeoffs including things like speed vs. memory use vs. code size vs. complexity, etc..
But C compilers are pretty fast, which I really like. Then there are/were environments like Turbo Pascal or Think C, which seem to have been amazingly compact while offering a rapid edit-compile-debug cycle as well as decent runtime speed and code size.
I agree, C is a really poor choice for string or other human readable content handling.
But if you have to read-shift-mask-write bytes to hardware control registers, it is a pretty easy language to use, and faster than assembly language in 90% of the cases, with modern compilers. That last note was less true in the mid 1980s, but by the mid-1990s, compiler optimization had gotten to be quite good.
Cellular modem code that runs on DSPs is written in C. Maybe someday that will be written in Rust. We'll see.
for (int i=0; i < 10; i++)
We all instantly know this means loop 10 times, from 0 through 9. But on first read it takes a little work to figure out what's happening.There is a culture of bad C code from the 90s - especially C code written in OOP-y ways where that was never necessary. Like, calling malloc() + free() for every little thing instead of more structured memory management. A lot of that code is written in C not with a efficiency or elegance mindset but simply because C was the language that you wrote programs in.
Do not tie computational operations on string operations in C.
Pick one of many string manipulation libraries to do any serious work with them.
[1] https://github.com/rust-lang/rust/issues/54878#issuecomment-...
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
On a quick inspection:
- The Rust code is about twice as long.
- The Rust code has CPU feature detection and SSE intrinsics, while the C code is more idiomatic.
- The lookup table is larger in the Rust code.
Some years ago I submitted an n-body implementation that used the crunchy crate to unroll the inner loop. This was rejected as being non-idiomatic and obscuring what compilers are/aren't capable of. Rust is currently leading the benchmark because someone added the flag -C llvm-args='-unroll-threshold=500' which achieves the same effect. Why one of these is acceptable and the other isn't is beyond me, and all of this makes the whole project very discouraging.
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
What impresses me is that the Rust version didn't do any of that stuff, just wrote very boring, straightforward code -- and got the same speed anyway. Some impressive compilation there!
1. https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
An example of simpler version of this is Fortran that can be faster for numerical loads due to the fact that Fortran disallows aliasing of function arguments. C on the other hand, must pay the price of having to be conservative with how it treats arguments just in case they overlap.
Rust is a good lang though. I am glad something else is pushing up there for the top spots. And more competition in performance is a good thing.
You don't always have to pick your sides. I just want to be able to write good code and be happy writing it :)
OpenMP is only available for C, C++, and Fortran so that is why most programs won't use it. However most of the other programming languages have their own ways for doing multi-threading and many of the programs for this benchmark do make use of them.
The rules at https://benchmarksgame-team.pages.debian.net/benchmarksgame/... request that submitters use the same algorithms as existing programs and I try to follow that rule. I believe all the binary-trees programs are using recursive functions so naturally I did the same to avoid breaking the rule about using different algorithms.
But yes, some of these do provide hints for the compiler, e.g. constexpr, ownership semantics per unique_ptr. There's nothing stopping a human from writing equivalent C, so my suspicion is that the performance gap is primarily due to the benchmark implementation.
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
I know that using PCRE2 in a Python program isn't typical but the Benchmarks Game does allow using other libraries and many of the previously submitted programs have been doing this for a long time already. The pidigits benchmark is the other benchmark that is heavily dependent on the libraries being used as is illustrated by their being a nearly ten way tie for second place by a bunch of programs that all use GMP.
(Looking at just C gcc vs Rust) https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
I tried running the C code with 2 and 4 threads, wall-time and CPU time don't change much in either case which is strange (this is cygwin with gcc 10.2.0):
2 threads:
$ /usr/bin/gcc -pipe -Wall -O3 -fomit-frame-pointer -march=ivybridge -fopenmp fasta.c -o fasta.gcc-2.gcc_run && time ./fasta.gcc-2.gcc_run 25000000 | md5sum
fd55b9e8011c781131046b6dd87511e1 *-
real 0m0.724s
user 0m1.468s
sys 0m0.108s
4 threads: $ /usr/bin/gcc -pipe -Wall -O3 -fomit-frame-pointer -march=ivybridge -fopenmp fasta.c -o fasta.gcc-2.gcc_run && time ./fasta.gcc-2.gcc_run 25000000 | md5sum
fd55b9e8011c781131046b6dd87511e1 *-
real 0m0.670s
user 0m1.514s
sys 0m0.046sI know that Python 3 can do some runtime stuff that Nodejs can't, but I wonder whether that's worth so much performance. Maybe if the answer is that you would include C modules in Python if you need the speed, but I don't know if that's a good answer to the problem.
Examples? This is too interesting to just throw in then hand wave away.
There are probably others as well, but this is the advantage I'm familiar with. C's ambiguity makes it harder to achieve some optimizations that can really matter on modern CPUs.
The tracking bug is https://github.com/rust-lang/rust/issues/54878
I suspect surprising factors are in play.
For instance, in real-world usage, you might do a bit more redundant copying in C in places you would avoid it with the borrow checker in Rust.
Conversely, enforcing RAII in Rust might cost performance in some scenarios relative to C.
The smallest "hello world" binary rustc has ever produced was 137 bytes. https://github.com/tormol/tiny-rust-executable
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
2) The home page (click the banner) has links which compare benchmarks across two language implementations.
3) Each of those comparison pages has links which compare all the programs, for all the language implementations, for each benchmark.
If Rust implementation is father than C's, kudos goes to the compiler, not to the language
Both C and Rust are capable of just inlining optimal assembly, so comparing "pure speed potential" is pointless.
Space.
Still, real world performance will probably benefit from this fix so it's a positive change regardless.
Where I expect Rust to really shine performance-wise is at the larger scale of real code, where it affords code that copies less often because the programmer isn't sure in this particular function whether or not they own this so they just have to take a copy, or the compiler can't work out aliasing, etc. Ensuring at scale that you don't take extra copies, or have an aliasing problem, or that you don't have to take copies of things just to "be sure" in multithreading situations, is hard, and drains a lot of performance.
Most C++ STL algorithms take iterators, that in many cases are just pointers in benchmarks because these often only deal with arrays. Hell most C standard APIs (e.g. memcpy) take multiple pointers.
I've done it when we really needed that performance for an inner loop(particle system). It can be a real bastard to keep the non-alias constraint held constant in a large, multi-person codebase and the error cases are really gnarly to chase down.
Compare that to Rust which has this knowledge built in since it naturally falls out of the ownership model.
LLVM implements only coarse per-function aliasing information needed by C, and doesn't properly preserve fine-grained aliasing information that Rust can provide.
And how do you know this? Have you actually measured the difference?
My CPU-heavy program (which takes ~30 seconds to run on a 32-core Threadripper) gains ~15% extra performance if I turn it on.
That's like saying C programmers could just write memory safe code if they felt like this would help, so clearly memory safety is not important.
I wouldn't be surprised if there's something similar happening here.
I like PyPy as an example: on the surface, implementing a Python runtime in Python and expecting performance gains seems crazy. PyPy manages to outperform CPython because although a C implementation should theoretically be faster, realistically the increased expressiveness of Python lets the PyPy devs opt into optimizations the CPython devs find out of reach.
I don't know C or Rust well enough to comment on these specific scenarios, but if two technologies can be fast, and one makes that speed accessible while the other encourages implementors to leave performance on the table, that's much more useful information to me than seeing a back-and-forth where folks compete to implement optimizations I'll never make in my own projects.
1. http://www.ats-lang.org/ 2. http://web.archive.org/web/20121218042116/http://shootout.al... 3. https://stackoverflow.com/questions/26958969/why-was-the-ats...
In this case it seems like benchmark code is allowed to use intrinsics, which can degenerate into a situation where a benchmark in language X is more "glorified x86 Assembly code" than actual code in language X.
This is not very useful for comparing languages IMO. Especially since all of Rust, C, C++ can use this strategy and become almost identical in both code and performance.
It is not idiomatic Haskell at all and loses all of the touted benefits.
Is there a separate benchmark that only accepts idiomatic code?
Am I missing something?, is there a list somewhere of the people who have contributed to this?
These benchmarks are almost always either useless or a scam - you either end up writing rewriting the same implementation n times or you don't utilize the capabilities of the language, either way you're not really measuring much of anything intrinsic to the language itself - Rust and C both have the same backends and if you really care about performance you're going to take it to the max anyway, so inference by the compiler isn't that important.
• https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
• https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
• https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
It's possible that someone has already submitted both algorithms for both languages, and different approaches won for language-specific reasons.
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
None of them have SSE intrinsics or are quite as long as the Rust version.
I find it doubtful that SSE intrinsics wouldn't help the C version, if they are indeed helping the Rust version. This seems fairly easy to check as the Rust version has a non-SSE fallback code path - I'd do it myself but am not able to at the moment.
Well, it looks like the submission process[0] and the maintainer[1] do, actually.
[0]: https://www.reddit.com/r/rust/comments/kpqmrh/rust_is_now_ov... [1]: https://www.reddit.com/r/rust/comments/kpqmrh/rust_is_now_ov...
In addition to what you've already mentioned, if I'm not mistaken, the programs also take different multi-threading approaches. One of the tasks in this benchmark involves reversing three very large strings. My C program took a fairly simple approach and does that reversal process for each string serially but processes multiple strings in parallel. The Rust program on the other hand does the opposite and does the reversal process for each string in a parallel manner but only processes strings one at a time. The algorithm the Rust program uses is more complicated which results in a considerably larger program size that also is a bit less flexible regarding the input (it assumes each line of input is 60 characters whereas my program will work with lines of any length) but it results in better CPU utilization, less clock time, and less memory use. I've been meaning to submit a faster C program using this faster algorithm but have simply been too busy with other things.
If I had to write a performant library at work, I too might rely on CPU-specific assembly wrappers in my code. But IMO, such code has no place in a general-purpose cross-language benchmark site.
My guess is that other people would take your "performant library at work" premise as justification for including that code.
C code that pretends it does not need to care about it's platform is not idiomatic, it's just suboptimal.
- Code that most C books/courses would teach you how to write
- Portable C code (arguably portability is one of C's biggest successes!)
- Code that you'd expect to find in the K&R book
These are the kind of things that make working in one language different from another.
I'd actually like to see the SSE version in C so I can compare the two implementations and how much grief you have to go through.
What I don't fully understand is: "GCC has the option -fstrict-aliasing which enables aliasing optimizations globally and expects you to ensure that nothing gets illegally aliased. This optimization is enabled for -O2 and -O3 I believe." (source: https://stackoverflow.com/a/7298596)
Doesn't this mean that C++ programs compiled in release mode behave as if all pointers are marked with __restrict?
I don't think there is a fair comparison between Rust and C: C is just an higher level assembler, if the programmer knows what he's doing he can use the hardware 100% of its potential. That is the reason why C is still used in all the embedded applications where you have ridiculous low power microcontrollers and you must squeeze out the best performance.
That is the difference between C and Rust to me: for each fast Rust program you are guaranteed that you can write an equivalent performant program in C (or assembly). Worst case scenario you use inline assembly in C and you get that.
Thus the contrary cannot be true for Rust: if I give you a heavily optimized C program not always you can produce an equivalent version in Rust.
Also not always these optimizations are what you want. In C you can choose the level of optimizations, and most of the time, at least on the program that I write, I choose a low level of optimization. The reason is that a lot of time performance is not the only thing that matters, but it maybe matters most the stability of the code (and a code compiler with optimizations is more likely to contain bugs) or the ability to debug (and thus the readability of the assembly output of the compiler).
Rust gives out an horrible assembly code, that is impossible to debug, or to check for correctness. You just have to hope that the compiler doesn't contains bugs. For the same reason Rust is the ideal language to write viruses, since it's difficult to reverse engineer.
[1]: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
And then there are the whole LLVM and GCC based eco-systems.
It is gratifying to see C++ identified, here, as the hands-down fastest implementation language, but odd to see Rust performance still compared, in the headline, to C, as if that were the goal. The headline should say that Rust speed is approaching C++ speed. In principle, Rust speed should someday exceed C++'s, in some cases, because it leaves behind some boat anchors C++ must retain for backward compatibility. In particular, if compiler optimizers could act on what they have been told about language semantics, they should be able to do optimizations they could not do on C++ code. As it is, Rust relies on optimizers coded for C and C++.
Rust may never reliably beat C++, because Rust has itself committed to details that interfere with optimization, principally in its standard library: The C++ Standard Library offers more knobs for tuning, while the Rust libraries are simpler to use.
Some of the reasons that C++ is so reliably faster than C are subtle and, to some, surprising. As noted elsewhere in this thread, the compiler knows more about what the language is doing, and can act on that knowledge, but C and C++ compilers share their optimizer, so that makes less difference than one might guess. Mainly, the C++ code does many things you might do in C, but in exactly one way that the optimizer can recognize easily.
The biggest reason why C++ is so much faster than C is that C++ can capture optimizations in libraries and reliably deliver those optimizations to library users. The result is that people who write C++ libraries pay a great deal of attention to performance because it pays, and that attention directly benefits all users of the library, including other libraries.
Because you can capture more semantics in C++ libraries, C++ programs naturally use better algorithms than C programs can. C programs much more frequently use pointer-chasing data structures because those are easier to put into a C library, or quicker to open-code, where the corresponding C++ program will use a library that has been hand-tuned for performance without giving up anything else.
Rust gets to claim many of the same benefits, because it is also more expressive than C, and Rust libraries are often as carefully tuned. Rust is not as expressive as C++, yet, and has many, many fewer people working to produce optimal libraries, but it is doing well with what it has.
One example: C re-write of a popular chess engine Stockfish (written in C++) is significantly faster (10%-30% depending on who measures it at which time and on what hardware). This is a piece of code which was already heavily optimized for many years as speed is critical for the engine's performance. One guy (although very talented one) re-wrote it in C and got the speed gains. Another guy (also very talented one) re-wrote it in assembly and got even bigger speed gains (see CFish and asmFish projects)
It looks to me that you are comparing badly written C to decently written C++ and claiming speed gains. I use C for my own software(solver for a popular game) and I wouldn't dream of using stuff like "pointer chasing structures". If you need to use stuff like hash tables in your performance critical code you have to code them yourself anyway as using some generic hash and memory handling algorithm is a recipe for being slow.
"Less performant" seems like a less clear (so to speak) way of saying "slower" (or maybe "less efficient" or simply "worse" in some instances.)
// Contributed by Mark C. Lewis.
// Modified slightly by Chad Whipkey.
// Converted from Java to C++ and added SSE support by Branimir Maksimovic.
// Converted from C++ to C by Alexey Medvedchikov.
// Modified by Jeremy Zerfas.
It sounds like no-one bothered to actually write a from-scratch version of many of these things :)
Those two features make for big speed improvements.
That said, there is a series of patches in-flight to get actual working full restrict support: https://reviews.llvm.org/D69542
C++: https://godbolt.org/z/dTPsbh Rust: https://rust.godbolt.org/z/Mdx87h
In C and C++, compilers are allowed to infer "noalias" based on pointer types; two pointers of different type are not allowed to alias. This is known as type-based alias analysis. But char * is given special treatment, because of its use as a generic pointer type; so if one of your arguments is char * or even signed char * , that disables any optimizations which were relying on type-based alias analysis.
This provides both for performance and undefined-behavior footguns. If you ever try to use some pointer type other than char * or signed char * to refer to data of another type, you may inadvertently cause type-based alias analysis to kick in, causing invalid optimizations and miscompilation. On the other hand, if you have a function which takes both a char * and another pointer, the compiler may not apply optimizations that it otherwise could because char * is allowed to alias anything.
In Rust, there is no such undefined behavior footgun. Because of the LLVM bug, noalias isn't applied to &mut pointers so there are still some cases which could be better optimized, though it sounds like there is some progress being made on the LLVM front so it should be fixed at some point, and there are already places where the compiler can do better optimizations with better safety due to the stronger semantics of &T.
https://gist.github.com/ityonemo/f34e0d3b4891f481863980a3142...
Note it could also be platform dependent. I'll give it a whirl on my own (new) machine.
Additionally, most compilers produce something that's useful for development and debugging by default, and that means including extra stuff for that purpose.
That Rust hello world is just hand-crafted to never be linkable to anything else at runtime and invoke a system call with a buffer. This isn't really how we'd like to do most things.
For example, an assembly program that just calls write and then exits is much smaller than even an unlinked version of it that uses printf, because the ELF binary doesn't need to contain information for the linker.
if you want to know something more than that, like how performance tends to end up after inputs of identical effort and talent you will need to recruit a lot of people and money to run really large scale experiments and when you are done people on HN will just endlessly find nits to pick with your results whenever they don’t agree with their preconceived notions
C is a tricky beast to tame.
"Registration with @gmail.com addresses is blocked due to abuse, please register with another address. You can change it later, once signed up and verified."
However, apparently, you can register using a Google, Bitbucket or GitLab.com sign-in.
Also C++ and Fortran don't have to spend much effort to achieve that.
For what it's worth I did say "can be" not "is" because I wasn't sure of the current state of this feature. I was just passing on the theory.
There are certainly other potential reasons, for instance constant expressions and generics. And, of course, the prohibited undefined behavior makes other optimizations possible, and potentially better.
Out of interest though, does he mention somewhere which actual implementation he used?
If you really need a Btree (like, if you want to make a fair benchmark for a presentation) then you'll absolutely find a reasonable implementation in C. After all, why would you need dependency management for something that should have 0 dependencies?
As an example of library data structures implemented in C - if you want, check out my Red-black tree (not a Btree): https://github.com/jstimpfle/rb3ptr/blob/master/rb3ptr.h . It's really easy to integrate into your project. I think the API (found in rb3ptr.h) is totally usable, and it might also be faster than what you can get with safe Rust - unless you can easily use intrinsically linked data structures in safe Rust.
One library I have exploits the fact that D templates are embarrassingly better than C++'s, so you can actually benchmark a template against it's parameters in a clean manner without overhead - that could be anything from a size_t parameter for a sort or a datastructure for example.
enum cpuidRange = iota(1, 10).map!(ctfeRepeater).array;
@TemplateBenchmark!(0, cpuidRange)
@FunctionBenchmark!("Measure", iota(1, 10), (_) => [1, 2, 3, 4])(meas)
static int sum(string asmLine)(inout int[] input)
{
int tmp;
foreach (i; input)
{
tmp += i;
mixin("asm { ", asmLine, ";}");
}
return tmp;
}
This made-up (pointless) benchmark measures how insert a number of cpuid instructions into the loop of a summing function affects it's runtime. My library writes the code from your specification as above to generate the instantiations and loop to measure the performance. As you might guess, the answer is a lot (CPUID is slow and serializing).edit: https://github.com/maxhaton/chimpfella - I haven't bothered to add pmc support yet
What is "real theorem-proving language" in this context? ATS has ATS/LF that is designed for writing formal proofs in the language.
http://ats-lang.sourceforge.net/DOCUMENT/INT2PROGINATS/HTML/...
One can, with sufficient effort, essentially write C code in Haskell using various unsafe primitives. We would argue that this is not true to the spirit and goals of Haskell, and we have attempted in this paper to remain within the space of “reasonably idiomatic” Haskell. However, we have made abundant use of strictness annotations, explicit strictness, and unboxed vectors. We have, more controversially perhaps, used unsafe array subscripting in places. Are our choices reasonable?
http://www.cs.ru.nl/P.Achten/IFL2013/symposium_proceedings_I...
But when using that variant all the correctness guarantees that Haskell is known for are no longer on the table.
Also C was known for not being a fast language on 80's home hardware, ruled by Assembly.
So low level coding for OS isn't coding like C and is just the way for mechanical sympathy, regardless of the language.
(The program I linked is an example of exactly what you're talking about)
Because of that, it is usually pretty easy to substitute an optimized algorithm or a container for another one with minimal code changes. In C, because there is no standard for how to do containers and algorithms, it is likely that every library does something a little different in terms of conventions making it harder to swap in implementations.
Every large scale C code base I have worked on has had these things, and they were used as you say all over the place. If I were to encounter a code base that didn't have these things I would wonder why it was written in C in the first place.
I have no problem doing that if I need to. But it feels like I'm fighting against the grain of rust more than I'd like.
I'd either start at a query entry point, or find the email delivery path or something. Or skim through the files and see if anything catches your eye. Or invent a question for yourself - how does X work (for some X), and see if you can find that part of the code.
Wherever you start, explore around a bit then go deep. See if you can understand for yourself how some small, interesting part works (by tracing out the various structs and function calls). Understanding how the whole program fits together is a separate skill - but don't worry too much about it.
Personally I fell in love with the code in src/lib/memarea and mempool. The way dovecot handles memory pools is super clever. There's also lots of src/lib-X directories, and they're reasonably self contained if you want to skim that list and find something more focussed.
If you haven't read much code before and dovecot feels too big and scary, start with something smaller. I can also recommend reading the sourcecode for git or the code for redis. Oh, and if you do, do yourself a favour and checkout one of the earlier versions of those projects. The lines added early in a project's life are usually more succinct and important compared to lines added later on. So earlier versions are usually better reads. When I read redis, I think I read the code for redis 2.0 or something.
You could do some macro magic instead of templates e.g. `DEFINE_QSORT(int, intcmp)` which could stamp out `qsort_int` but that's not a part of the stdlib.
C++ arguably gets this right since sort<int> and sort<string> will be separate functions, although templates are of course a footgun. And of course duping the logic for std::sort<T> for a bunch of different T impls increases the binary size.
When this is done, the compiler can inline qsort, and replace the indirect function call with an inlined version of intcmp, and then things are equivalent.
In that case, the compiler could make a specialized qsort using intcmp automatically.
In my similar experiments with std::sort with a function pointer only gcc does this with -O3.
It is a great example of how the Rust compiler can auto-vectorize code.
Are you sure the rust performance data isn't for one of the other implementations that use the same crufty tricks as the C version?
e.g.: https://benchmarksgame-team.pages.debian.net/benchmarksgame/..., https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
It looks like the autovectorizer did a really good job on this one.
As an aside: this made me notice this n-body simulation only has 5 bodies! This is a pretty strange case that makes these O(n^2) optimizations practical. The n-body simulations I've been familiar with in the past had thousands of bodies, where this approach probably isn't a good idea.
This blog post shows how to write simple idiomatic Rust code that will allow the compiler to auto-vectorize:
So maybe it's just the C benchmark being a cargo cult.
At least you can audit only those places though.
That's why Rust developers don't take kindly to unnecessary unsafe usage, because the audit surface area (and interaction complexity) increases.
Therefore, the optimizer can always assume mutations don't alias. Even UnsafeCell isn't allowed to break the alias rules, it just provides flexibility going from immutable to mutable.
For example consider Slice::swap [1]. This does consider the possibility that the pointers alias, and correctly, because they might point at the same object.
https://doc.rust-lang.org/src/core/slice/mod.rs.html#541-553
What prevents this?
I trust you on on this, but where is the remaining work? Language semantics, compiler, third choice?
The remaining work _on this particular benchmark_ is the regex algorithm itself. I'm on mobile so I can't do a deep dive, but I haven't yet figured out how to easily improve on this particular case. It has to do with the fact that the benchmark has a high match count and the finite automata approach in the regex crate has a bit higher overhead than the typical backtracking solution used in PCRE2 (which is also JIT'd in this case).
It's not the language, compiler, inlining or any other such thing. It's algorithms.
But this is one single benchmark. Before regex-redux there was regex-dna, and Rust's regex crate was #1 there. Why? Same reason. Algorithms.
You can't judge regex performance by a single benchmark. Two won't do it. Not even ten. It's one of the many problems with the Benchmark Game. This would be fine if everyone was circumspect and careful with their interpretation of the data provided, but they aren't. And the Benchmark Game doesn't really do much to alleviate this other than some perfunctory blurbs in some corners of the web site.
With that said, running a benchmark is hard work. It's easy to criticize.
Indeed.
Especially easy if it's just fault-finding without any suggestion as to what might be done to "alleviate this" :-)
I know because I run a userland that uses length-prefixed strings as far as possible: https://github.com/akkartik/mu
You get a null terminated char array but the length is actually available since arrays (as long as they haven't decayed into pointers) can still share their length at compile time.
If your new C program is faster than your old C++ program, then you may simply rename files from ".c" to ".cc". Then, you have two C++ programs, one faster than the other.
If your custom hash table really is faster than any library you find, congratulations! Recoding its interfaces, without giving up any performance, you can make it available for use in other C++ programs. As one may deduce, this has already happened, and has produced the better hash table libraries you appear not to know about. Available C++ hash table libraries have benefit of overwhelmingly more optimization attention than could be afforded on behalf of a single program, and they deliver that performance to all dependent programs.
Every programming project is an exercise in practical economics: your strictly-limited available attention goes where you choose. The greater productivity of coding in a more powerful language frees up attention that may then be allocated to areas that would otherwise suffer neglect. Whatever amount of attention you devote to making code in a poor language work at all, you may spend a fraction of coding in a better language, and the balance on other beneficial uses, such as better performance.
That the new program is faster than the old program reliably demonstrates that the old program was not so well optimized as you suggest. (Your comment elsewhere, that "the whole project uses only [a] minimal set of C++ features" reveals perhaps more than you intended.) One may surmise that the old program's authors spent more of their limited attention on its effectiveness at playing chess than the latter program's author needed to.
That the third, assembly-language program is faster still demonstrates that the previous programs left performance on the table. My experience is that it is not hard to double the speed of a typical program just by paying attention to cache and pipeline effects. Most likely, the asm coder just happens to know more about those effects, knowledge that could as well have been applied to the others. Most programs seem fast enough exactly until another, faster one comes along; then they are instantly slow.
While C++ isn't an exact superset of C it's close enough but it's not the point. The debate is about how the languages are used. Otherwise you could always say "yeah, use inline assembly and don't use any abstractions with the exception of structs, pointers and arrays".
>>Available C++ hash table libraries have benefit of overwhelmingly more optimization attention than could be afforded on behalf of a single program, and they deliver that performance to all dependent programs.
They are also general purpose. When you have something specific to optimize you will always beat general solutions.
>>Every programming project is an exercise in practical economics: your strictly-limited available attention goes where you choose. The greater productivity of coding in a more powerful language frees up attention that may then be allocated to areas that would otherwise suffer neglect.
Sure. The debate is about what is faster though and then C++ is not faster than C but C is often faster than even slightly idiomatic C++ and much faster than C++ with all the modern features used frequently. You will get stuff done faster in a higher level language of course but that besides the point of debate which language is faster where performance matters.
>>Whatever amount of attention you devote to making code in a poor language work at all, you may spend a fraction of coding in a better language, and the balance on other beneficial uses, such as better performance.
I don't agree C is a poor language. It's quite a common view as well. A lot of people who are good at low level stuff prefer C to C++ because the language is simple, easy to read and easy to reason about. C is poor at some things, it's fantastic for others. I personally love the language and it's my choice for many weekend projects.
>>That the new program is faster than the old program reliably demonstrates that the old program was not so well optimized as you suggest. (Your comment elsewhere, that "the whole project uses only [a] minimal set of C++ features" reveals perhaps more than you intended.) One may surmise that the old program's authors spent more of their limited attention on its effectiveness at playing chess than the latter program's author needed to.
Stockfish is a big decade+ old and still heavily developed community project with tens of programmers contributing to it. A lot of attention was given to it and to optimizing specific parts of it. Still, it's written in C++ in usual (although still very minimal) style and there is cost to it.
>>That the third, assembly-language program is faster still demonstrates that the previous programs left performance on the table.
Well, my experience is that people good at assembly run circles around very good C/C++ programmers. I would go as far as to say that it's hard to take anyone who doesn't realize it seriously. There is just so many things you can't do in C/C++ even with today very smart compilers.
>>My experience is that it is not hard to double the speed of a typical program just by paying attention to cache and pipeline effects. Most likely, the asm coder just happens to know more about those effects, knowledge that could as well have been applied to the others. Most programs seem fast enough exactly until another, faster one comes along; then they are instantly slow.
You just don't get it. Stockfish is a program which depends on being fast. Making it fast is the number one priority. A lot of very smart people spent hundreds of hours optimizing small parts of it like the best way to generate chess moves in as few CPU cycles as possible, designing the hash table to minimize cache misses etc.
If you can make Stockfish 2x faster you would be considered the prophet of programming and your statues would be raised in cities around the world. Seriously, it's not some average random code which you can make faster applying concept you happen to see on front page of Hacker News. We are talking about the code where performance matters. All the people working on such code know a lot about all the concepts you mentioned because they apply them day to day.
Anyway, CFish, asmFish and Stockfish are all open source projects. You can find them on Github. You can ask the authors why they think their stuff is faster if you are curious about it. I mean maybe it's worthwhile to understand why someone chooses to do all the work to re-write a popular community project to C or asm. It's a lot of difficult work. They might know something you don't.
That does not mean C++ programs are necessarily fast: it has always been easier to write slow programs, in any language. It doesn't mean C programs must be slow: making a fast C program just takes a great deal more work, and time, than a fast program in modern C++, so the extra time taken has been wasted.
You are always free to waste your own time. Wasting your employer's time is often less defensible. In the fintech world, suggesting C as a way to speed up a C++ program would get you laughed out of the room, at best.
The common experience noted in the original page is that C++ programs are faster than C programs. We know why.
That the chess program was easy to transcribe to C demonstrates that it was not good C++ code -- which you also revealed yourself. Making a faster C program while transcribing an almost-C C++ program is no big trick.
I routinely speed up programs by 2x by changing only a few lines. A 10-20% improvement in a whole rewrite is a great waste of time. Translating to asm, moreso. I would call a 10-20% improvement negligible, and crediting it to C delusional.
All high level languages are portable, including some older than C.
It helps when one drags the OS the language was born on, along via parallel standards like POSIX.
Theoretically portable, maybe. But C is actually portable in practice - you can compile C code for the vast majority of CPUs and OSs.
Every “implementation defined” feature (https://en.wikipedia.org/wiki/Unspecified_behavior#Implement...) creates two different languages (examples are sizes of basic numeric types, signedness of the char type, and evaluation order of arguments)
Also, the standard library is so limited that, historically, you needed lots of stuff that wasn’t standardized between systems (https://en.wikipedia.org/wiki/GNU_Autotools: “It can be difficult to make a software program portable: the C compiler differs from system to system; certain library functions are missing on some systems; header files may have different names.”)
But yes, bringing up C on a system is a lot easier than doing the same for, to pick another extreme, Ada, but that’s because only half the job is done. Autotools does a bit of the rest, but still requires help from program writers.
Original C was created after UNIX already existed, used for UNIX's first rewrite and until UNIX source code left Bell Labs, it only worked on PDP-11 computers.
Practically all non-trivial C codebases have platform specific segments. They can be either isolated to their own libraries or wrapped around tons of preprocessor switches.
C code is portable in the sense that it's not too much work to get a C compiler to work on a new hardware platform. On the other hand, compiling a large real-world C codebase on the said platform will probably require porting, and the effort may or may not be trivial.
Most C++ code I see has few vtables, and what vtables exist are mostly removed by the compiler. Java-style inheritance hierarchies are not idiomatic in C++. The code gen for C++ looks a lot like hyper-optimized C code, but without having to write hyper-optimized code. C++ naturally provides much more information about intent to the compiler than C does, and modern compilers are excellent at using that information to provide nearly optimal code gen.
FWIW I feel like this is a fairly recent (good!) development. Up through TR/TR1 I think at least half the C++ code I saw was pretty heavy on the "Java envy" and not at all concerned about the cost of dynamic dispatch or memory allocation. Avoiding vtables was derided as part of "C with classes" - today that usually refers more to templates and exceptions but STL implementations were not mature enough and too much code not exception-safe for those to be universally viable back then. Something I heard more than once was that if you had a destructor it should always be virtual just in case someone wanted to subclass it later.
Naturally we have to ignore that wxWidgets, Gtkmm, MFC, ATL, Qt, COM, DirectX, IO Kit, DriverKit, Skia are still around.
I mean, as much of a Qt user I am, that's definitely the case. All the libs you listed are still around, but no one writes new libraries in that style (sometimes at a performance cost, e.g. using a ton of std:: function instead of virtuals)
My experience is that C++ that's written for performance will often prefer the use of templates over inheritance since the cost is then paid upfront by the compiler. What's stopping a C programmer from hand-coding template instantiations (via macro or otherwise)?
In my experience most of the time you need arenas you’re using your own data structure anyway, but YMMV.
That makes sense for video games. Recently I was goofing with cyrus-imap. I wanted to parse the emails out of an mbox file into JSON (JMAP). Parsing an email with cyrus currently does about 5-10k calls to malloc, but the objects are all extremely short lived - they just have to live long enough to parse and then convert to JSON. This is a perfect case for a bump allocator - I'd love to allocate all the parsed email fields into an arena and then clear the whole thing when we move on to the next message.
Yes, Cyrus uses a ton of its own internal structs for emails, and they're littered with strings and vectors. (Eg for email headers, lists of email recipients, plain text / HTML message content, etc).
Looks like bumpalo will do the job, since it implements its own Box, Vector and String. I understand why, but it seems jarring that I'd need to replace the data types in order to change out the allocator like this. I'm definitely keen for GAT landing if it means bumpalo and friends don't need to reinvent the world to be able to change the allocation strategy.
Edit: Oooh Vec::new_in is in nightly! Exciting! https://doc.rust-lang.org/beta/std/vec/struct.Vec.html#metho...
https://rust-lang.github.io/rfcs/1398-kinds-of-allocators.ht...
void foo(/*restrict*/ bool* x, int* y) {
if (*x) {
printf("foo\n");
*y = 0;
}
if (*x) {
printf("bar\n");
}
}
Enabling strict aliasing is effectively an assertion that pointers of incompatible types will never point to the same data, so a write to y will never touch *x. restrict is an assertion to the compiler on that specific pointer that no other pointer aliases to it.I wish there was a C/C++ compiler flag for treating all pointers as __restrict__. However I guess that C/C++ standard libraries wouldn't work with this compiler option (and therefore this compiler option wouldn't be useful in practice).
So while it is true that by default Rust has a theoretical performance advantage (compared to C/C++) because it forbids aliasing pointers I wonder (doubt) whether this will cause Rust binaries to generally run faster than C/C++ binaries.
On the other hand Rust puts security first, so there are lots of array bounds checks in Rust programs (and not all of these bounds checks can be eliminated). Personally I think this feature deteriorates the performance of Rust programs much more than you gain by forbidding aliasing pointers.
Does that also hold for a "uint8_t"--which is often just a renamed unsigned char rather than being a genuine type of its own?
Being around long enough, the meme writing Java in C++ just rubs me the wrong way for something that was already common several years before Gosling even thought of Oak.
As if C++, alongside Smalltalk, and all those enterprise object modelling methodologies and books, weren't to blame for the widespread of that way of programming.
When Java came into the scene, those C++ and Smalltalk developers (and enterprise architects) migrated to it, and kept doing what were already considered "best practices" by then.
https://microsoft.github.io/microsoft-ui-xaml/
https://developer.apple.com/documentation/driverkit
https://www.khronos.org/registry/SYCL/
https://developer.nvidia.com/optix
https://docs.unrealengine.com/en-US/ProgrammingAndScripting/...
I could keep adding a couple more, but I think you got the point by now.
I kinda want this on a mug or something.
``` bool b; if (b) printf("1 "); if (!b) printf("2 "); ```
might print `1 2 `.
To my knowledge there is no C compiler that promises to always handle all forms of UB consistently. The closest you could get would probably be something like Valgrind.
Use your brain and use your tools. C is not really that hard for a large number of problem sets.
If their salaries cannot find them, where are they?
Do we have a candidate to prove their secure reports as being wrong?
In an embedded context, such bugs may be extremely costly to fix, if it's even possible.
> modern compilers will flag uninitialized variables
Reading uninitialized variables is one of the more easily prevented forms of undefined behaviour. As pjmlp points out, undefined behaviour in C/C++ programs is one of the major sources of security vulnerabilities in today's software.
It is reasonably easy to measure, and the GP is about right. I've measured a crossover point of around a few hundred items too. (Though I'm sure it'll vary depending on use case and whatnot.)
I made a rope data structure a few years ago in C. Its a fancy string data structure which supports inserts and deletes of characters at arbitrary offsets. (Designed for text editors). The implementation uses a skip list (which performs similarly to a b-tree). At every node we store an array of characters. To insert or delete, we traverse the structure to find the node at the requested offset, then (usually) memmove a bunch of characters at that node.
Q: How large should that per-node array be? A small number would put more burden on the skip list structure and the allocator, and incur more cache misses. A large number will be linearly slower because of all the time spent in memmove.
Benchmarking shows the ideal number is in the ballpark of 100-200, depending on CPU and some specifics of the benchmark itself. Cache misses are extremely expensive. Storing only a single character at each node (like the SGI C++ rope structure does) makes it run several times slower. (!!)
Code: https://github.com/josephg/librope
This is the constant to change if you want to experiment yourself:
https://github.com/josephg/librope/blob/81e1938e45561b0856d4...
In my opinion, hash tables, btrees and the like in the standard library should probably swap to flat lists internally when the number of items in the collection is small. I'm surprised more libraries don't do that.
If I recall correctly, the STL provides guarantees that prevents it from taking advantage of flat lists. I think some containers (not arrays) guarantee that they don't move the address of whatever they're containing, even if you insert or remove elements. Even if they switched to flat lists, it would be a flat list of pointers, with all the incurred overhead.
Likewise, I believe I have heard of small string optimisation being impossible with std::string for similar reasons.
not impossible, SSO is implemented to some extent in most mainstream c++ compilers.
When I changed the hardcoded node size from 136 to 1024 bytes, time went down from 3.6 secs to 2.6 secs on my laptop. It kind of plateaued at 1024 bytes. I didn't do more extensive testing.
What's the rationale for the choice of 136? A cache line is usually 64, so the CPU will always end up loading a multiple of that in any case.
When I did a rope implementation myself (I don't know how it compares in terms of performance) I think I ended up with nodes of 4096 or 8192 bytes size. That was based on a Red-black tree. When I load ~1Gig of data, even with a node size of 4096, there will still be 2^18 nodes, meaning each access requires sth. like 17 child traversals, which seems a lot to me. So I can't see myself going down to 200 bytes or less.
https://ridiculousfish.com/blog/posts/array.html
In the sense of swapping the underlying implementation as the number of items in the collection increases.
I suspect for this reason it'd be better in Rust to use a Go-like hash map implementation [1] that keeps all the key/value information (size, Hash and Eq implementations) in a vtable-like form rather than be monomorphized, except in really hot inner loops where the specialization is worth it. There was an interesting article and discussion on reddit relating to this [2] where someone made a toy type-erased map (though not as nice as Go's) to measure the difference in compilation times. Maybe some day I'll make a more production-ready attempt...
[1] https://dave.cheney.net/2018/05/29/how-the-go-runtime-implem...
[2] https://www.reddit.com/r/rust/comments/g1skz4/an_experiment_...
L1 and L2 are per core. A cacheline is 64 bytes and per-core cache size is on the order of 32kb and 256kb for L1/L2.
Reading data from L1 is on the order of 1 nanosecond (or less) and RAM on the order of 50 nanoseconds.
If you’re scanning an array and load a dozen cachelines that’s almost certainly preferable to several cache-misses (and lines).
Memory access is very often an application’s bottleneck. The answer is almost always more arrays and fewer pointers.
The number of people who dismiss the lowly array is way too high. Arrays are fast. Keep your data in the cache and they're 10x or more faster than normal, and plain flat arrays are almost always faster than literally any OOP nonsense. By "almost always" I mean that I've never once encountered a situation where flat arrays weren't the fastest solution attempted, and I haven't seen everything, so I can't claim they're always fastest.
People really don't understand how fast their computers really are, thanks to developers not caring how fast their code is.
Linear arrays are often faster, not because they require fewer memory loads, but because the cache has intelligence built in (the prefetcher) that loads some memory in advance even before the code requests it. That intelligence works way better with linear array scans than with pointer chasing. (I'm not sure if these loads will count as cache misses or not).
Yes, for large enough N, a BST easily beats linear search.
Would you maybe if there's any source discussing how this is solved in different languages and performance consequences of it?
* In many languages, monomorphization is impractical. Eg, in C it requires macros or external code generators. Java and Go don't even have macros. So it's rarely done. Instead, stuff uses type erasure, dynamic dispatch, more heap allocations, and tables like the ones described in that dave.cheney.net link. Maybe some JITs or even some ahead-of-time compilers do monomorphization internally in some cases, but I'm speculating rather than speaking from knowledge.
* In C++ and Rust, monomorphization is easy. You don't have to do it, but it's easy, so many people do. The containers in the standard library are monomorphized. This code can perform very well in any particular case, but in aggregate it's large enough that I have doubts about whether it's always worth it for the whole program.
Depends on how those programs were written. Iterators should avoid bounds checking for example.
It says that they can't overlap, but yes, you can get the compiler to optimize based on this if you provide the aliasing information and remember to keep it accurate. You probably won't do that, though, for anything except the most performance-critical of inner loops. A compiler that can infer more about aliasing can provide more optimization, safely, in the >99% of code that doesn't have explicit aliasing annotations, and that's probably worth some decent speedups in practice.
There are two main things you might be talking about when you call a programming language fast:
1. Given some really performance-critical code and enough time to hand-optimize it, how close can you get the compiler's output to the optimal machine code?
2. If you write code normally, not making any effort to micro-optimize, how fast will it usually be?
Both of these matter. #1 matters when you've got some bottlenecks that you really care about, and #2 matters when you've got a flatter profile -- and both situations are common.
Another illustrative example of the #2 type of performance with Rust is the default collections compared to the ones in C++. Rust's default HashMap is faster than C++'s std::unordered_map because of some ill-advised API constraints in the C++ STL. You can get similar performance by using a custom C++ hash table implementation, and in fact Rust's HashMap is a port of a C++ hash table that Google wrote for that purpose, but most people probably won't bother.
So, a semantic question: if you can get the same speed in one language as you can in another, but in practice usually don't, is one language faster than the other?
Most people who don't bother with that probably also don't bother with the performance of their hashmap.
Also in C you cannot retrofit restrict in existing programs, specially they depend on existing binary libraries.
Do you have some evidence here? Or something more specific?
(The rest of your comment is opinions, which you can of course have, but does not match my personal experience, FWIW.)
So probably you can give me any C program and I'll be able to give you an equivalent Rust program. It'll probably perform about the same, too.
Rust also allows inlining assembly.
> Thus the contrary cannot be true for Rust: if I give you a heavily optimized C program not always you can produce an equivalent version in Rust.
This is incorrect for the reason above. Worst case scenario you just inline assembly in Rust to reproduce exactly the same performance of any arbitrarily optimized C program.
Eh, not really.
Add analysis to each benchmark. Require submissions to come with analysis. Or more minimally, make the existing disclaimers on the web site more discoverable through one of any number of means, up to taste.
Perhaps you'd like to take on that task?
> Require submissions to come with analysis.
Which raises the barrier for program contributors and presumably would require me to judge whether their analysis was acceptable? Me? Really? ;-)
> disclaimers on the web site more discoverable
The problem is that no one wants to "discover" disclaimers.
We want to see something that supports whatever it is we already believe — and we're very good at ignoring anything else.
> The problem is that no one wants to "discover" disclaimers.
Again repeating something I already said. It is indeed a problem. There are ways to address it. I personally think you do the absolute minimum.
Right, with C, you may have to fix a few bugs to get a program working properly in a new system due to the reasons you mention.
But with many other languages, you'd have to port a compiler and the standard library first, which is probably harder :)
Unfortunately, I'm still of the opinion that "idiomatic Rust" is quite a lot harder to write than idiomatic Go or C# or etc, and many applications absolutely index on developer velocity and performance and quality are "nice-to-haves". Many companies are making money hand over fist with Python and JavaScript, and Go and C# are quite a lot more performant and typesafe than those languages already; Rust is better still in these areas, but returns diminish. If C# and Go preclude 95% of the errors found in Python and JS, it's not worth trading tens or hundreds of percents of developer velocity for that extra 4% or so improvement in quality (a Go developer could recoup a bit more of that 4% by writing tests in the time they save over Rust).
Of course, this is all subjective, and I've met smart people who argue with a straight face that Rust is better for developer velocity than languages like Go, so YMMV.
Well Go and C#[1] still suffer from the billion dollar mistake (null pointers), which represents at least 1/3 of errors I've witnessed in JavaScript code, so I'd say they at best removes 70% of errors. And there's also logic errors, for which neither Go's or C#'s type system helps either, so maybe we're at 50% error reductions with Go and C# compared to JS and Python. I don't even think that Rust, with its really helpfull type system (ADT + Affine types) achieves 95% error reductions though.
[1]: for C#, at least last time I used it, which was 2011
Since then, C# added nullable reference types:
https://docs.microsoft.com/en-us/dotnet/csharp/language-refe...
In projects with complete NRT coverage, it's extremely rare I ever get a null ref, as the compiler will warn/error me if I'm using a possibly null value in an unsafe way.
-- How much usage do you expect to have? The more usage, the more important performance and correctness are.
-- How critical is your application? The more critical it is, the more important correctness is.
Also, the argument that you can spend developer time to increase correctness just by writing more tests works up to a point, but then it doesn't, because of diminishing returns. With Rust you can eliminate certain entire classes of bugs which tests will never reach.
Agreed. This is what I was alluding to by "many applications absolutely index on developer velocity and performance". Note that it's even a bit more nuanced--within an application there are bits that are more sensitive than others. For example, the UI widgets are typically much less sensitive than the security, data privacy, and data integrity bits. Even still, I haven't noticed a frenzied rewriting of these systems from Go/C#/Java/etc into Rust, and I'm willing to bet that there's a fair amount of this kind of code implemented in JavaScript or Python.
> How much usage do you expect to have? The more usage, the more important performance and correctness are.
True, but this is a very low-value concern. Notably, hugely popular apps like Reddit can break altogether on a nearly daily basis (never mind more minor bugs, like some UI widget breaking) and they're still content to write in a completely dynamic language.
> How critical is your application? The more critical it is, the more important correctness is.
I think this is a much more important driver than usage, but relatively little software is so critical. As previously mentioned, there's lots of software that governs privacy and security systems that isn't being frenzily rewritten in Rust. Even OpenSSL, an application in which correctness is far more important than velocity, isn't considering a rewrite to Rust despite that it's written in a language with worse static verifiability than the Go/C#/Java tier. This is probably the kind of application that I would want to see in Rust--it's very sensitive to performance and correctness; which is to say, I think OpenSSL is very near the boundary at which writing in Rust makes economic sense (something that is very stable and very sensitive to correctness and performance). It's almost certainly not your web services.
> Also, the argument that you can spend developer time to increase correctness just by writing more tests works up to a point, but then it doesn't, because of diminishing returns. With Rust you can eliminate certain entire classes of bugs which tests will never reach.
Yeah, sorry, my brain was thinking one thing and my fingers typed another. I should have said "writing additional tests would recoup a bit less than the additional 4% that Rust would get you"; instead I typed "more". I missed the edit window, so I can't update my post.
I agree with your post generally though.
One potentially important difference is integer indexes vs pointers. Compared to C, Rust uses int indexes less often within a function (iterators, not for loops) but more often between functions: in C you might store a pointer somewhere, while in Rust it's easier to store an index into a collection. But a ton of work has been poured into LLVM's induction variable analysis so this is swimming upstream.
The aliasing could be really important; there's a ton of pessimizations for e.g. uint8 pointers. Part of the maturing is fully specifying Rust's UB and aliasing model.
However, one advantage for C and C++ for such projects is that they make it far more pleasant to do bulk allocations. With C, you just need to group your mallocs and cast the memory and C++ offers placement new where you pass in the memory explicitly. With Rust, you need to reach into unsafe Rust (rightfully so), but unsafe Rust is very unpleasant to write.
They are definitely not easier to write. The last time I checked, the error reporting were still so intractable that would be hard to take this argument in good faith. Maybe it improved?
>In practice templates are embraced and used extensively, not "avoided at all costs" (like C++ exceptions.)
I think many would agree that this is largely because the package management system for C++ is so awful that header-only libraries are popular since you just need to copy paste a single file into your project.
C++ is a large language and there are some wildly different coding standards. Some do avoid templates (but you're right that GP overstated this). Why? Templates introduce a lot of code bloat because there is little opportunity for eliding copies of the same code at link time. Aside from bloating the global offset table in position independent executables (if the GOT doesn't fit in cache, perf drops), the compile time hit is also a non trivial issue. But this is a really exceptional scenario where you are shaving the monolith and need to get the ELF executable down (e.g. server side ball-of-mud, small embedded systems).
Unfortunately we may need an entirely new generation of compiler infrastructure to exploit these opportunities because LLVM is really a C/C++ compiler at heart and may not be happy about taking big changes that can't benefit C/C++.
You are shifting goalposts. What the parent poster said is that a linear scan will read more memory than a binary search, therefore it will cause more churn in the cache (assuming same cache policies, which I don't know is a safe assumption), therefore linear scan may actually be less cache friendly - given that you're not interested in the data in any other way besides for the scan - by way of putting other parts of the program whose data was pushed out of the cache at a disadvantage.
Implied was also that linear scan might actually result in a slower program, even if the scan itself might be faster than a binary search.
What you replied is "The CPU will load less cache data with linear searches", which I assume to be false, although I will be happy to learn that it's actually the case (for example because the CPU has a clever cache eviction policy, executing linear scans by streaming in memory into a small part of the cache instead of thrashing the whole cache).
Let's say we have an array of 200 4-byte ints. Let's do a linear array scan and on average will touch 100 of them (or about 7 cache lines). Now let's say we have each integer in its own linked node in a BST in a totally random memory location. We can expect to need about 7 steps as well (2^8 = 256 > 200), which is 7 cache lines. So about 200 is already a lower bound where BST is more efficient in terms of visited cache lines, even for a pessimistic inefficient data structure like this 4-byte integer example.
Pypy outperforms cpython for the simple reason that pypy is a jit where cpython is a basic bytecode interpreter. Anything else is icing on the cake.
The only reason python seems slow as an implementation language is because it was traditionally slow; the reason it was traditionally slow is that cpython, the only major implementation, is slow. Common lisp, for instance, is similarly dynamic (moreso, in fact), also is also generally natively compiled by itself, and is quite performant.
One quick example: standard-class objects in Lisp are mostly more dynamic than python objects[1] but structure-class objects are quite static, and compilers will use this to their advantage.
On top of this, functions in python are in fact objects; in python you can do this just fine:
def foo():
foo.bar = 3
def bar():
print(foo.bar)
There are obviously other ways to accomplish this in lisp, but the fact that someone might do something like this can inhibit optimizations.If you read the Lisp specification you will see dozens of places where they allow for opting out of dynamism in ways that help the compiler optimize. Type declarations are one and the fact that a function defined in a single file is considered to have a fixed definition for the scope of the file is another.
1: It's not quite as easy to add new slots to Lisp standard-class instances on the fly as it is to python class instances, but it's doable, and Lisp, of course, supports class redefinition in a way that few other languages (including Python) do.
https://morepypy.blogspot.com/2011/04/tutorial-part-2-adding...
To put it another way, there's a reason PyPy trails CPython language features on the scale of literal years. This is not explained by the affordances of the language they're written in! They're just projects with very different requirements which, on some platforms, happen to run a lot of similar-looking code.
I have nothing against Rust personally, but it's ultimately not an apples-to-apples comparison if they're not implementing the same algorithm, or even if they're not using the same mechanisms (e.g. Rust explicitly using SSE intrinsics, which are certainly available to C in just as idiomatic a fashion).
and you can always compare “equivalent” algorithms as some languages may not be able to efficiently express the same algorithm as another. i know what you are after, but trying to have some benchmark that is “fair” according to some spec that is important to your needs will just be seen as pointless to others. benchmark game at least lets us see what the performance ceiling is of various languages, and anytime you object to a language having a worse implementation you may go fix it.
i think what anyone with a clue understands is that C C++ and Rust all have roughly equivalent performance ceilings. Nobody really thinks Rust is faster than C now.
The submission title would beg to differ.
(I know, it explicitly calls out "benchmarks" as the context.)
I think languages like Rust or Swift have significant advantages around safety over C/C++, while not sacrificing much in terms of performance. But if one language's benchmark contributors are willing to put in more effort than another's to eke out additional performance, then you're going to see skewed results in favor of whichever has the more fervent evangelists or whichever language has more to prove.
If the goal is to compare performance of two languages which can express the same optimization in exactly the same way, and only one uses it, then the benchmarks fail in that respect.
Yep. There are good reasons why this set of benchmarks is called The Computer Language Benchmarks Game.
Not if it includes a JIT compiler, as PyPy does. I don't know much about PyPy, but if most of the time is spent in JITted machine code, the fact that it's written in Python may not affect performance much.
Not really: It's the compilation strategy (that whole meta-tracing JIT compiler thing, compared to a simple bytecode interpreter) that makes the difference, not the surface syntax of the implementation language - which is actually not 'real' Python, but a restricted, less dynamic subset known as RPython.
Also note that the CPython is deliberately kept 'dumb'.
http://cm.bell-labs.co/who/dmr/chist.html
> Thompson was faced with a hardware environment cramped and spartan even for the time: the DEC PDP-7 on which he started in 1968 was a machine with 8K 18-bit words of memory and no software useful to him. While wanting to use a higher-level language, he wrote the original Unix system in PDP-7 assembler. At the start, he did not even program on the PDP-7 itself, but instead used a set of macros for the GEMAP assembler on a GE-635 machine. A postprocessor generated a paper tape readable by the PDP-7.
> Although we entertained occasional thoughts about implementing one of the major languages of the time like Fortran, PL/I, or Algol 68, such a project seemed hopelessly large for our resources: much simpler and smaller tools were called for. All these languages influenced our work, but it was more fun to do things on our own.
> By 1970, the Unix project had shown enough promise that we were able to acquire the new DEC PDP-11. The processor was among the first of its line delivered by DEC, and three months passed before its disk arrived. Making B programs run on it using the threaded technique required only writing the code fragments for the operators, and a simple assembler which I coded in B; soon, dc became the first interesting program to be tested, before any operating system, on our PDP-11. Almost as rapidly, still waiting for the disk, Thompson recoded the Unix kernel and some basic commands in PDP-11 assembly language. Of the 24K bytes of memory on the machine, the earliest PDP-11 Unix system used 12K bytes for the operating system, a tiny space for user programs, and the remainder as a RAM disk. This version was only for testing, not for real work; the machine marked time by enumerating closed knight's tours on chess boards of various sizes. Once its disk appeared, we quickly migrated to it after transliterating assembly-language commands to the PDP-11 dialect, and porting those already in B.
> By 1971, our miniature computer center was beginning to have users. We all wanted to create interesting software more easily. Using assembler was dreary enough that B, despite its performance problems, had been supplemented by a small library of useful service routines and was being used for more and more new programs. Among the more notable results of this period was Steve Johnson's first version of the yacc parser-generator [Johnson 79a].
> By early 1973, the essentials of modern C were complete. The language and compiler were strong enough to permit us to rewrite the Unix kernel for the PDP-11 in C during the summer of that year. (Thompson had made a brief attempt to produce a system coded in an early version of C—before structures—in 1972, but gave up the effort.)
What failed was the initial port attempt to C, not UNIX running on PDP-7.
While I am quite critic of C, I take the effort to learn and criticise properly, because regardless of my opinion it isn't going away on my lifetime and bills need to be paid.
Yes, but that is the key bit right, the fact that C didn't have structures yet is what caused the effort to be aborted, and once those were added they succeeded.
the reason is under-funding, and the lack of developers to the cause as a consequence, not the portability complexities of jit.
> Notably, hugely popular apps like Reddit can break altogether on a nearly daily basis (never mind more minor bugs, like some UI widget breaking) and they're still content to write in a completely dynamic language.
Yes, but I wonder if people are sometimes being irrational here. You pay for those breaks out of the devops budget. Moving breakage to earlier in the developer cycle typically reduces the cost of fixing it.
Regarding OpenSSL, there are a few reasons that isn't going to be rewritten in Rust anytime soon. The developers probably don't know Rust. OpenSSL runs in lots of places that don't have a Rust toolchain. Even if you rewrite it in Rust you're stuck with a bad API so why bother. In general there are more reasons against rewrite code in Rust than against writing new code in Rust ... and more reasons against writing new code in Rust that interfaces with old code than writing new greenfield code in Rust.
Only for organizations where "devops" means "ops". If your organization practices continuous deployment, then getting a bug fix out is no big deal (merge your PR to master and make sure the deployment doesn't fail); certainly no devops time is required.
> Moving breakage to earlier in the developer cycle typically reduces the cost of fixing it.
This axiom originates in the days of waterfall development and annual software deployments shipped to customers on shrink-wrapped physical media. It's not nearly as costly as our dated CompSci or SoftEng educations make it out to be.
Your average web service isn't going to benefit on balance from moving from Go to Rust because any gains in finding/preventing bugs sooner are going to be dwarfed by the slowed pace of development.
Of course, someone will point out that data races of the sort that Rust precludes are really hard to debug, which is true, but they're also very rare because request handling rarely requires parallel access to local resources (mostly things like database connections, where the locking is handled by the database library). The total cost of dealing with these for your average web application is still going to be dwarfed by the smaller tax Rust imposes on every change you make.
Note that Rust has improved significantly, both by accepting more correct programs (laxer rules such as non-lexical lifetimes) and by improving tooling (rust analyzer). I'm not sure if they'll ever close the gap completely such that using Go/C#/etc are relegated to certain niche use cases, but I think it will continue to steal more and more of the application space that leans toward performance- or correctness-sensitive.
> Original C was created after UNIX already existed, used for UNIX's first rewrite and until UNIX source code left Bell Labs, it only worked on PDP-11 computers.
Regardless of failed porting effort, the original UNIX, implemented in Assembly and running on the PDP-7, was already being used by a small community.
are you maybe thinking of iterators? you have to assume iterators are invalid after a move.
EDIT: seems I was wrong, and SSO is allowed for std::string. A similar optimization is illegal for std::vector, though, for the reasons I gave above.
If you are accessing data via some mechanism of IPC across processes, e.g. database access to same data rows without proper transactions, accessing shared files without locks, GPU memory data,.., affine types won't save you.
Please make a specific suggestion.
For analysis, it increases review time and increases the burden on contributors. The extent to which any one person can review all analyses isn't clear to me, and they would likely need to rely on contributors to self regulate them. But I would expect the maintainers to review them for some minimal level of quality. Being more specific than this would require a more thorough proposal. Given my time constraints and how much I dislike interacting with you, it's not something I can do right now.
As for increasing contributor barriers, yes, adding an analysis requirement would do that. However, I think you could compensate for that partially by reducing existing barriers. I've submitted code to your game before, and I'm unlikely to ever do it again. One reason is that interacting with you was unpleasant for me. But I found the interface difficult to navigate (that has since changed) and the rules difficult to discover. I think barriers could be reduced by switching to github (one of the reasons I use github) with more automated checks on submissions and better documented rules. (At the time when I worked on a submission it was unclear for example how to submit a Rust program that used external crates.)
Invariably though, providing an analysis is still more work. IMO, I think it would add rnough value to be worth it. I imagine many amalyses would build on top of others, similar to how code submissions do that today. As can be seen in any thread about the Benchmark Game, there are tons of comments that misunderstand things or make wild guesses. (Followed up invariably a day later with coy quips from you.) Some of those comments are from people that are unlikely to be helped by my analysis idea. But a lot aren't. I think a lot of people would benefit from an explanation of the performance characteristics of the program.
I generally try not to publish benchmarks without analysis. Here's one example.[1] I care a lot about it and I think it has immense benefits in terms of educating others. But reasonable people can disagree. You might have different values than me, which influences whether my suggestions are applicable to you or not.
As for the disclaimer, I don't really see the point of being more specific. There isn't much to this one? A simple suggestion might be to put it at the top of each page or even in the description of each benchmark. I don't know how much this would help things. Maybe people would still just ignore it.
People regularly misinterpret benchmarks and especially those in your Benchmark Game. From what I can tell, you as the maintainer have done very little to address that. Maybe you don't care. Maybe you think it isn't your problem to solve. Maybe you think it can't be solved or even improved, and therefore it isn't worth doing anything. I don't know. We might just disagree, which is fine.
Invariably, you can just respond and tell me to go do it. I think that's a cheap response, but I know you're fond of it. I've been thinking a lot about it and considering starting my own benchmark. But it will likely be a few years before I'll get to it. It is a trying daunting amount of work.
For sure!
And that's way beyond the modest aims of the benchmarks game.
> People regularly misinterpret benchmarks … From what I can tell, you as the maintainer have done very little to address that.
Seems to me that misinterpretation is not something that is effectively addressed by website content; it's something that is effectively addressed by responding to what people say when they discuss benchmarks in forums like HN and proggit and … one person at a time.
> … you can just respond and tell me to go do it. I think that's a cheap response…
It isn't intended as a brush-off.
otoh as you show on proggit, it's when you "Just sit down and actually try to plan out what you would do." that you start to understand what is involved.
otoh I'd like to see others do all that stuff the benchmarks game does not do, in whatever way they choose https://pybenchmarks.org/
I really was asking for a specific suggestion for "disclaimers on the web site more discoverable".
> A simple suggestion might be to put it at the top of each page or even in the description of each benchmark. I don't know how much this would help things. Maybe people would still just ignore it.
https://www.reddit.com/r/rust/comments/kpqmrh/rust_is_now_ov...
People see what they are looking for.
no single person can maintain optimally implemented solutions over dozens of languages. its up to we the people to help out. you want perfectly equivalent comparisons, make it happen
You can have "soft" or you can have "warm" — which do you want?
You can have an apple or you can have an orange — which do you want?