Making Rust as Fast as Go

Making Rust as Fast as Go(christianfscott.com)

304 points by chrfrasco 6 years ago | 200 comments

matklad 6 years ago |

Note that Rust and Go programs differ in a seemingly insignificant, but actually important detail.

Rust does this

    next_dist = std::cmp::min(
        dist_if_substitute,
        std::cmp::min(dist_if_insert, dist_if_delete),
    );

Go does this

    nextDist = min(
         distIfDelete, 
         min(distIfInsert, distIfSubstitute)
    )

The order of minimums is important for this dynamic programming loop. If I change Rust version to take minimums in the same order (swapping substitute and delete), runtime drops from 1.878696288 to 1.579639363.

I haven't investigated this, but I would guess that this is the same effect I've observed in

* https://matklad.github.io/2017/03/12/min-of-three.html

* https://matklad.github.io/2017/03/18/min-of-three-part-2.htm...

(reposting my comment from reddit, as it's a rather unexpected observation)

ttd 6 years ago | |

Those are two really well written articles -- taking a complicated topic and making it very accessible. Thanks! FWIW I think a companion article on how to effectively use perf for tasks like this would be a great addition, since it can be a bit novice-unfriendly.

ishanjain28 6 years ago | |

I think they made the rust version same as Go because I cloned it just now and they are both the same. Also, Thank you soo much for the blog posts! :)

_____smurf_____ 6 years ago | |

Given this information, and for general parsing functionalities, which one is faster, Go or Rust?

arcticbull 6 years ago | | |

As always, it depends on what your goals are. String processing is usually not the long pole when you're building something that consumes the output of a parser. Based on micro and macro benchmarks, Rust is typically substantially faster than Go and pretty much always uses less RAM [1].

But again, depends what you're doing with the output, and if these deltas even matter in your context.

[1] https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

fluffything 6 years ago | | |

You can write a parser as far as the limits of your imagination allow in Rust, or C, or any other baremetal language without a thick run-time.

In Go, beyond the limits of your imagination, you'll also hit other limits, like those of the garbage collector.

drfuchs 6 years ago | |

Very few things have ever been measured accurately to ten significant digits. How much did these numbers change run to run? How many measurements did you take? Were the caches warmed up similarly? Still, point taken.

matklad 6 years ago | | |

The excessive "precision" is because I've just copy-pasted what the original bench printed.

As for "is this a reliable result", I believe I've performed diligence, appropriate for a HN comment, to make sure that this is not a completely random result. As I've said, I did not investigate this particular bit of code thoroughly. You are welcome to read the linked blog posts, which study the issue in depth.

londons_explore 6 years ago | |

Since min(a, min(b,c)) == min(b, min(a,c)), perhaps the compiler should be smart enough to swap the comparisons around if it makes it quicker?

dan-robertson 6 years ago | | |

I suspect that statement is not true for floats. Possibly you don’t get the same float from min(0,-0) as min(-0,0), and similarly with NaNs. Rust specifies that if one input is NaN then the other is returned but doesn’t say what happens if both are NaN.

arcticbull 6 years ago | | |

Part of the problem may be they re-implemented `std::cmp::min` at the bottom of the file, I wonder if there's a more optimized version in the stdlib.

chrfrasco 6 years ago |

Hey all, as some keen-eyed commenters have pointed out, it looks like the rust program is not actually equivalent to the go program. The go program parses the string once, while the rust program parses it repeatedly inside every loop. It's quite late in Sydney as I write this so I'm not up for a fix right now, but this post is probably Fake News. The perf gains from jemalloc are real, but it's probably not the allocators fault. I've updated the post with this message as well.

The one-two combo of 1) better performance on linux & 2) jemalloc seeming to fix the issue lured me into believing that the allocator was to blame. I’m not sure what the lesson here is – perhaps more proof of Cunningham’s law? https://en.wikipedia.org/wiki/Ward_Cunningham#Cunningham's_L...

molf 6 years ago |

The Rust version uses `target.chars().count()` to initialise the cache, while the Go version counts up to `len(target)`. These are not equivalent: the Rust version counts Unicode code points, the Go version counts bytes.

I am confused by the implementations, although I have not spent any time testing them. Both versions contain a mix of code that counts bytes (`.len()` and `len(...)`) and Unicode code points (`chars()` and `[]rune(...)`). My guess is that the implementation might not work correctly for certain non-ASCII strings, but I have not verified this.

Of course, if only ASCII strings are valid as input for this implementation then both versions will be a lot faster if they exclusively operate on bytes instead.

ishanjain28 6 years ago |

I tried to benchmark Go/Rust versions as well.

I made 4 changes in Rust version.

1. Moved up the line that gets a value from cache[j+1] before any calls are made to cache[j]. This removes 1 bound check. (Improvement from 182,747ns down to 176,xyzns +-4800)

2. Moved from .chars().enumerate() to .as_bytes() and manually tracking current position with i/j variables. (Improvement from 176,xyz ns down to 140,xyz ns)

3. Moved to the standard benchmark suite from main + handrolled benchmark system.(File read + load + parse into lines was kept out of benchmark)

4. Replaced hand rolled min with std::cmp::min. (improvement from 140,xyz down to 139,xyz but the std deviation was about the same. So Could just be a fluke. Don't know)

In Go version, I made three changes.

1. Doing the same thing from #1 in Rust actually increased the runtime from 190,xyz to 232,xyz and quite consistently too. I ran it 10+ times to confirm)

2. Replaced []rune(source), []rune(target) to []byte(source), []byte(target). (Improvement from 214817ns to 190152 ns)

3. Replaced hand rolled bench mark system with a proper bench mark system in Go. (Again, File read + load + parse into lines was kept out of benchmark)

So, At the end of it, Rust version was about 50k ns faster than Go version.

Edit #1:

In rust version, I had also replaced the cache initialization to (0..=target.len()).collect() before doing anything els.. This also gave a good perf boost but I forgot to note down the exact value.

blablabla123 6 years ago | |

I'd be really surprised to hear that Go is supposed to be faster than Rust. Of course Rust is a bit newer but to me it always sounded like Go is fast because it's static but it doesn't have to be high-speed if that would sacrifice conciseness. Given that this is an artifical example, this here looks more realistic: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

arcticbull 6 years ago | | |

Rust and Go are contemporaries. Rust started in 2006 at Mozilla, and the first Go public release from Google was in 2009, meaning it probably started at the same time.

Except of course all the Plan 9 garbage (like Go's hand-rolled assembler) brought in to underpin Go from the 80s ;)

burntsushi 6 years ago | |

Note that using bytes is a fundamentally different implementation that will produce different results on non-ASCII input. Using codepoints (or "runes") will better approximate edit distance based on visual characters. (And grapheme clusters would be even better. Although one could put the text in composed normal form to get more mileage out of the rune based algorithm.)

codeflo 6 years ago |

I recently did some experiments with creating small static Rust binaries, custom linking, no_std et cetera. A lot of stuff around that kind of thing is unstable or unfinished, which might be somewhat expected. But I’ve also come to the conclusion that Rust relies on libc way too much. That might be fine on Linux, where GNU’s libc is well-maintained, is a bit questionable on MacOS (as seen in this article) and is a a complete distribution nightmare on Windows (in no small part due to a series of very questionable decisions by Microsoft).

My understanding is that Go doesn’t use the libc at all and makes system calls directly, which IMO is the correct decision in a modern systems programming language that doesn’t want to be limited by 40 years of cruft.

devit 6 years ago |

The main problem is that allocating a vector for each evaluation is completely wrong: instead, it needs to be allocated once by making the function a method on a struct containing a Vec; which makes the allocator moot.

The second problem is that at least the Rust code is decoding UTF-8 every iteration of the inner loop instead of decoding once and saving the result, or even better interning the characters and having versions of the inner loop for 32-bit chars and 8-bit and 16-bit interned indexes.

Furthermore the code rereads cache[j] instead of storing the previous value, and doesn't do anything to make sure that bound checks are elided in the inner loop (although perhaps the compiler can optimize that).

The code for computing the min seems to have been written mindlessly rather than putting serious thought towards whether to have branches or not and in what order (depending on an analysis of what the branch directions rates would be).

Implausible benchmark results are almost always an indicator of the incompetence of the person performing the benchmark.

tromp 6 years ago |

Missed chance to shorten title to "Making Rust Go Fast" :-)

katktv 6 years ago | |

Making Rust As Fast As It Can Go

codegladiator 6 years ago | | |

Making the code rust as fast as it can go

kreetx 6 years ago | | |

Go Rust!

mqus 6 years ago |

Is it intentional that the benchmarks include not only running the program itself but also compiling it? e.g. in the linked source code, the bench.sh includes the compilation step which is known to be slow in rust:

    #!/usr/bin/env bash

    set -e
    run() {
      cargo build --release 2> /dev/null
      ./target/release/rust
    }

    run;

Sure, if you run it many times in succession the compiler won't do much but the benchmarking script (run.js) doesn't really indicate that and the blog post also doesn't mention that.

EDIT: I was just being stupid, don't mind me. The times were taken within each language and not externally.

chrfrasco 6 years ago | |

run.js is not doing the benchmarking. If you look at the source for each of the programs being benchmarked, you'll see that the programs themselves are responsible for benchmarking

alvarelle 6 years ago |

Could also try to use the smallvec crate in this case, which put small allocation on the stack https://docs.rs/smallvec/

arcticbull 6 years ago |

There's a bunch of issues with the Rust implementation, not least that the initial condition where source or target lengths are zero, it returns the number of UTF-8 bytes of the other, but all other computations are performed in terms of Unicode chars -- except at the end: `cache[target.len()]` which will return the wrong value if any non-ASCII characters precede it.

Further, each time you call `.chars().count()` the entire string is re-enumerated at Unicode character boundaries, which is O(n) and hardly cheap, hence wrapping it in an iterator over char view.

Also, re-implementing std::cmp::min at the bottom there may well lead to a missed optimization.

Anyways, I cleaned it up here in case the author is curious: https://gist.github.com/martinmroz/2ff91041416eeff1b81f624ea...

hu3 6 years ago |

I'm surprised that a naive implementation in Go can outperform a naive implementation in Rust.

empath75 6 years ago | |

I’m not. Hell, when I first started learning rust i frequently wrote code that ran slower than _python_.

virtualritz 6 years ago |

I tried this on a spare time project[1]. Runtime in a quick test went down from 14.5 to 12.2 secs on macOS!

So a solid ~15% by changing the allocator to jemalloc.

However, I now have a segfault w/o a stack trace when the data gets written at the end of the process.

Possibly something fishy in some `unsafe{}` code of a dependent crate of mine that the different allocator exposed. :]

Still – no stack trace at all is very strange in Rust when one runs a debug build with RUST_BACKTRACE=full.

[1] https://github.com/virtualritz/rust-diffusion-limited-aggreg...

saagarjha 6 years ago | |

I have found that jemallocator is currently broken on macOS Catalina, so that might be the problem. If you can reproduce this issue reliably, I'd love to hear about it because I can't myself unless I use a very specific toolchain that produces -O3 binaries that are a real pain to work with.

virtualritz 6 years ago | | |

It's 100% reproducible. Just check out the previous to last commit on master on the github repo I linked to and run the tool with any command that invokes the nsi crate.

Eg.:

   > rdla dump foo.nsi

should produce the segfault before exiting the process.

Is there a jemallocator ticked where to attach a report for this?

submeta 6 years ago |

Impressed to see four posts about Rust on the front page of HN simultaneously.

anderskaseorg 6 years ago |

I’ve found that Microsoft’s mimalloc (available for Rust via https://crates.io/crates/mimalloc) typically provides even better performance than jemalloc. Of course, allocator performance can vary a lot depending on workload, which is why it’s good to have options.

savaki 6 years ago |

This discussion seems to me like a microcosm of the differences in philosophies between Rust and Go.

With Rust, you have much more control, but you also need a deep understanding of the language to get the most out of it. With Go, the way you think it should work is usually is Good Enough™.

jeffdavis 6 years ago | |

I wouldn't put it that way. Both languages are fast at nice straight-line code.

The main area I'd expect to see performance benefits for rust (though I don't have experience here) is larger rust programs. Rust's zero-cost abstractions have more benefits as the abstractions nest more deeply. For a small program, you don't really have a lot of abstractions, so Go will do just fine.

I think Go has a number of nice performance tricks up it's sleeve, though, so I wouldn't rule out Go on performance grounds too quickly.

novocaine 6 years ago |

It may be that the system allocator is making an excessive number of syscalls to do work, whereas most custom allocators will allocate in slabs to avoid this. You could try using dtruss or strace to compare the differences.

savaki 6 years ago |

A few folks have commented that there were logic errors in the Go version. Specifically that

  len("föö") = 5

should instead have returned

  len("föö") = 3

I submitted a pull request, https://github.com/christianscott/levenshtein-distance-bench..., that fixes these issues in the Go implementation.

Interestingly enough, when I re-ran the benchmark, the Go version is roughly 19% faster than it was previously:

  old: 1.747889s
  new: 1.409262s (-19.3%)

fortran77 6 years ago |

loeg 6 years ago |

FreeBSD's system allocator is jemalloc :-).

goranmoomin 6 years ago |

TLDR for people who didn't read:

The speed difference came from the allocator.

Rust switched from jemalloc to the system allocator per ticket #36963[0] for various reasons (like binary bloat, valgrind incompatibility, etc...).

Go uses a custom allocator[1] instead.

To make 'Rust Go fast' (pun intended), one can use the '#[global_allocator]' to use a custom allocator (in this case, with the jemallocator crate) to make allocations fast again.

[0]: https://github.com/rust-lang/rust/issues/36963

[1]: https://golang.org/src/runtime/malloc.go

k__ 6 years ago | |

The comments of Rust programmers here also suggest that the Rust implementation is, indeed, different from the Go implementation.

pcr910303 6 years ago | | |

It was just a summary of the post contents - the post suggests that the biggest difference comes from the allocator.

maoeurk 6 years ago |

Assuming this was run on a 64bit system, the Rust version seems to be allocating and zeroing twice as much memory as the Go version.

edit: this has been pointed out as incorrect, Go ints are 8 bytes on 64bit systems -- thanks for the correction!

  let mut cache: Vec<usize> = (0..=target.chars().count()).collect();

which can be simplified as

  let mut cache: Vec<usize> = vec![0; target.len()];

  cache := make([]int, len(target)+1)
  for i := 0; i < len(target)+1; i++ {
    cache[i] = i
  }

Rust usize being 8 bytes and Go int being 4 bytes as I understand it.

So between doing more work and worse cache usage, it wouldn't be surprising if the Rust version was slower even with the faster allocator.

rossmohax 6 years ago | |

Go int is 8 bytes

eis 6 years ago | | |

It can be either depending on the system.

https://golang.org/ref/spec#Numeric_types

linux-vdso.so.1 (0x00007ffe3d7f0000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fc7ac05a000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc7abc69000) /lib64/ld-linux-x86-64.so.2 (0x00007fc7ac279000)

let target_chars: Vec<char> = target.chars().collect(); for (i, source_char) in source.chars().enumerate() { let mut next_dist = i + 1; for (j, target_char) in target_chars.iter().enumerate() {