Why Discord is switching from Go to Rust (2020)

Previous Discussion

https://news.ycombinator.com/item?id=22238335

verdagon 4 years ago |

I suspect that GC'd languages could mitigate this problem by introducing regions; separate areas of memory that cannot point at each other. Pony actors [0] have them, and Cone [1] and Vale [2] are trying new things with them.

If golang had this, then it might not ever need to run its GC because it could just fire up a new region for every request. The request will likely end and blast away its memory before it needs to collect, or it could choose to collect only when that particular goroutine/region is blocked.

Extra benefit: if there's an error in one region, we can blast it away and the rest of the program continues!

[0] https://tutorial.ponylang.io/types/actors.html#concurrent

[1] https://cone.jondgoodwin.com/fast.html

[2] https://verdagon.dev/blog/seamless-fearless-structured-concu...

jatone 4 years ago | |

or you know; just pace the GC mark and sweep algorithm. which is what go is doing now.

verdagon 4 years ago | | |

Correct me if I'm wrong, but IIRC pacing would still cause a latency spike, it would just be a more strategically-timed latency spike.

jatone 4 years ago | | |

sure - https://github.com/golang/go/issues/44167. you'll see the new design the CPU util only increases in GC CPU utilizations when you're actually allocating heavily. which makes sense you're doing more work. this should completely resolve the problem discord had; since their system was in a steady state.

abeltensor 4 years ago | | |

You are always going to have some kind of latency spike with a sweeping GC; even if that spike is tiny.

jatone 4 years ago | | |

not really. if you're in a steady state (like discord was) you wouldn't see any spikes. you'd have a consistent utilization. if you start allocating heavily then you would potentially see an increase. which makes sense, you're increasing your workload, utilization needs to increase. but still not necessarily a 'spike'

erikbye 4 years ago |

Not discrediting Rust, but I've noticed you rarely hear "we improved performance" by rewriting our implementation using the same language... Although this, too, can yield similar performance improvements.

stonewareslord 4 years ago | |

In the article they describe their attempts at tuning and weren't happy with the results

abeltensor 4 years ago | |

The main issue with Go in this regard is that it generally doesn't have multiple ways of doing the same thing at least idiomatically. I mean sure, there are architectural choices you could make in a rewrite but the actual structure of the code itself is going to be very similar.

ceeplusplus 4 years ago | |

It does look like with some GC tuning (e.g. manually triggering GC's at a smaller interval than the Go automatic GC threshold) they might've mitigated the spikes, although I don't think they would have gotten the level of perf improvement they did. Golang assembly code IME is not very optimized compared to Rust/C++.

edit: reading comprehension skills are lacking, please see comment below for why I'm wrong

jholman 4 years ago | | |

I don't understand.... isn't this idea (triggering GC more often) explicitly discussed in the article?

ceeplusplus 4 years ago | | |

My understanding is they tried to tune the GC percent to make the automatic heuristic do GC sooner, but they didn't allocate enough to have that make a difference. However, Go has a way to manually trigger a GC which they could've set on a timer on a goroutine. If they weren't actually generating that much garbage then pause times should theoretically be pretty short if you're doing a GC every 5 seconds or something like that.

That being said it's not something that 100% is guaranteed to fix the issue so maybe they did test this and just didn't mention it in the blog.

jholman 4 years ago | | |

Okay, I see what you're saying about the timer. That isn't in TFA, agreed.

But I still don't understand, because....

(NB: I'm not a GC expert, just a curious amateur, so my apologies if there are errors in the following, and the opportunity to be corrected in these errors is part of why I'm posting this.)

Regarding the "not much garbage => theoretically times would be shorter", my understanding is that this is actually not how GC works. The GC time is a function of the size of the GC pool, because GC works by walking ("tracing") the tree of live references. So the only way to make GC faster is to have not less garbage, but less stuff allocated at all.

Multi-generational GC works by dividing the whole pool into smaller pools, so that most GC passes only visit the high-churn nursery, but even then some GC passes need to read the

TFA mentions this, where they say "the spikes were huge not because of a massive amount of ready-to-free memory, but because the garbage collector needed to scan the entire [thing we were keeping track of]".

That is, they had virtually no garbage to collect, and that wasn't speeding up the GC. Which is consistent with how all tracing GC works, as far as I know.

Comments/corrections/clarifications are requested!!

mohanmcgeek 4 years ago | | |

I remember when this article came out, everybody was pointing out the fact that they used a go version that was several releases older.

Perhaps if the intent wasn't to convince their managers to let them write it in Rust, they would have tried using the latest Go version at the time?

tbillington 4 years ago | | |

https://news.ycombinator.com/item?id=31021719

mohanmcgeek 4 years ago | | |

Then why publish it at all?

Not to mention, the article made no effort to establish that it's describing the world 2 years prior to this being written

mountainriver 4 years ago | | |

They actually would have mitigated it by upgrading their Go version, because the latest release at the time had a fix in the runtime that would have basically solved this.

Turns out no one on the team actually looked into issues in the Go repo to see if it was being addressed. Looks like they just wanted to write Rust, which is fine Rust is cool, but let’s not deceive ourselves.

steveklabnik 4 years ago | | |

That is not what happened, they did not publish the blog post immediately after the transition, and those changes to the GC did not happen until after the port happened. Some people made assumptions about timeline that were incorrect, and then repeated.

The discussion at the time on Reddit [1] mentions this. The general discussion as well talked about if the improvements, which were big in many cases, would have even improved this particular case. We’ll never truly know.

That said it is important to recognize that Go’s GC has received significant upgrades over the years, and remember that what’s true in the past may not be true today.

1: https://www.reddit.com/r/programming/comments/eyuebc/why_dis...

mountainriver 4 years ago | | |

The point is that they could have found the issue and seen that it was about to be released. That would be good engineering, bad engineering is when you don't find the root cause of your problem and see if its being worked on

steveklabnik 4 years ago | | |

Good engineering is when you solve the problems you have. Sometimes there are multiple ways to solve a problem. Just because they did not choose the solution (which again, we're only speculating would actually solve the issue here, we don't have proof of that) you prefer does not make it poor engineering.

pfraze 4 years ago |

I’m told the Go GC has gotten better in recent years. Has anybody run a similar program in Go lately that can confirm that?

phendrenad2 4 years ago |

(2020)

(Anyone know if they're still using Rust?)

jhgg 4 years ago | |

Yes we are using rust in a big way. We have multiple teams now full time working on Rust. It is being used on both the client and server, as native modules, web assembly, and also native rust services and NIFs that embed themselves in our elixir services.

It has been an incredible success. I plan to blog more about it in the coming months. Our usage of Rust is continuing to grow, and if you check out our jobs page, you might notice all backend / infra jobs list Rust in them now :)

I think probably 40% of requests are handled directly by rust services now, with the rest involving one or more rust service called from our Python API layer.

SirGiggles 4 years ago | | |

Bit of a late reply, but how does Elixir fit into the overall strategy? Is it still like how it is described in previous engineering blogs where it acts as a kind of orchestrator for guilds?

abeltensor 4 years ago | | |

I love Rust elixir Nifs. Gives you the best of both worlds to be honest. Highly fault tolerant code with fast computation. Only downside is that it can't really handle extreme crashes like a native process can.

wut42 4 years ago | | |

Erlang/Elixir + Rust is an awesome couple. For the downside you mentioned, depending on the use case, it could be interesting to use Rust as a node: https://github.com/sile/erl_dist

abeltensor 4 years ago | | |

Very nice. Nodes are great if you want a long running system along side your elixir app.

eatonphil 4 years ago | |

Their blog doesn't list all articles on a single page (so you could ctrl-f) and doesn't have its own search and doesn't have it's own domain (so googleing `site:discord.com rust` returns a mix of Discord communities and blog posts).

Makes it pretty hard to find stuff!

ajot 4 years ago | | |

You can search "site:discord.com/blog/ rust", it appears to work for me on DuckDuckGo or Google. It seems TFA is the latest article mentioning Rust.

eatonphil 4 years ago | | |

Woah I didn't realize you could filter on paths. I thought `site:` was domain only.

tedunangst 4 years ago | | |

HN site listing is also useless because they host the blog on the same domain as everything else.

https://news.ycombinator.com/from?site=discord.com

faitswulff 4 years ago | |

Looks like they're still hiring for it: https://www.google.com/search?q=rust+site%3Adiscord.com%2Fjo...

mc4ndr3 4 years ago |

How illuminating. From CloudFlare posts, I had been under the impression that Go's gc was incredibly unintrusive, near-real time performance for applications operating in increments of a few hundred milliseconds. For example, CloudFlare uses Go to analyze network traffic.

Yes, Rust provides a more predictable, faster memory management model than Go. At the expense of unpredictable, expensive memory leaks triggering application termination.

Curious how much time and effort was dedicated to improving gc, which is a useful endeavor in its own right.

alberth 4 years ago |

Isn’t this a function of them being such heavy Erlang users and are writing NIFs (via Rustler) in Rust.

midrus 4 years ago |

Or they just got bored and wanted to try some shinier toy. I've seen this happen dozens of time, all the bullshit for justifying it is just that, bullshit.

Not saying this is the case here but highly likely.

butterisgood 4 years ago |

Well... is this still true? Go's had a lot of perf improvements in the last two years.

mohanmcgeek 4 years ago | |

It wasn't true even when they wrote the blog. Realistically this should read 2018 because apparently they waited two years before writing this blogpost.

Andys 4 years ago | |

There was a major performance boost in Go GC just after this happened

sys_64738 4 years ago |

C with coroutines?

loudtieblahblah 4 years ago |

meh. I'm switching from Discord to Guilded.

boxingrock 4 years ago |

isn't the tldr on this that Go let them scale up for years before it became the bottleneck? a natural progression for any successful project...