Making a Go program faster with a one-character change

Making a Go program faster with a one-character change(hmarr.com)

511 points by hcm 3 years ago | 236 comments

ludiludi 3 years ago |

> If you read the title and thought “well, you were probably just doing something silly beforehand”, you’re right!

Don't feel too silly. Russ Cox, one of the technical leads on the Go language, made the same mistake in the regexp package of the standard library.

https://go-review.googlesource.com/c/go/+/355789

sakras 3 years ago |

A while ago at my company we switched from GCC to Clang, and noticed a couple of massive regressions (on the order of 50%?) in performance having to do with floating point.

After profiling for a bit, I discovered that suddenly a lot of time was spent in isinf on Clang and no time in GCC… Clang was emitting a function call where GCC wasn’t. I happened to randomly change isinf to std::isinf (it’s a random habit of mine to put std:: in front of these C functions). Suddenly the regression disappeared! I guess on Clang only std::isinf was a compiler intrinsic while GCC recognized both? Anyway, that’s my small-change optimization story.

10000truths 3 years ago | |

C defines isinf as a macro, whereas C++’s std::isinf is a function. Perhaps the discrepancy has to do with differences in how they’re evaluated?

jeffrallen 3 years ago | |

> random habit of mine to put std:: in front of these C functions

And did you learn your lesson about making random changes that "shouldn't matter" without proving they don't matter? :)

I find that once I spend the time to make these changes correctly, they are not worth the time to make correctly.

asim 3 years ago |

If you want to have a solid understanding and need to do it in just a few hours here's a few things to review.

- The Go programming language spec https://go.dev/ref/spec

- Effective Go https://go.dev/doc/effective_go

- Advanced Go concurrency patterns https://go.dev/talks/2013/advconc.slide#1

- Plus many more talks/slides https://go.dev/talks/

erdaniels 3 years ago | |

I created this video on concurrency (maybe advanced) patterns a while back that some may find helpful but it's pretty long https://www.youtube.com/watch?v=U3_2xiPxyA8.

bogomipz 3 years ago | | |

This was really good. You should do more of these. Cheers.

lairv 3 years ago | |

The "How to write Go code" article https://go.dev/doc/code is also very useful to actually know how to structure a codebase

karmakaze 3 years ago |

> I did consider two other approaches:

  Changing Ruleset from being []Rule to []*Rule, which would mean we no longer need to explicitly take a reference to the rule.
  Returning a Rule rather than a *Rule. This would still copy the Rule, but it should stay on the stack instead of moving to the heap.

> However, both of these would have resulted in a breaking change as this method is part of the public API.

The problem with heap allocated objects could be due to the incorrect public API.

The change that improves performance also gives out pointers to the actual elements of Ruleset itself permitting the caller to change the contents of Ruleset which wasn't possible before the speed-up. Perhaps you're already aware since change to []*Rule was being considered.

mook 3 years ago | |

The best part is, there's a subtle API breakage here: the returned *Rule is now a reference to an element of the []Rule, so if the caller was previously modifying the returned value it'll change the slice.

It's debatable what API guarantees existed around this though; most of the time this would be unspecified.

ok_dad 3 years ago | |

Sometimes it doesn't matter if a public API is incorrect, because it's set in stone for whatever reason, and you just need to fix the problem internally.

spockz 3 years ago | | |

This is why I like https://github.com/openrewrite so much. One gets to tell users how to rewrite code automatically. It makes refactoring almost as easy as in a mono repo.

karmakaze 3 years ago | | |

The way to fix it in that manner here is to undo the 42% speedup and return the heap allocated object for the caller to mangle.

coder543 3 years ago |

There is potentially another option: use the midstack inliner to move the allocation from the heap to the stack of the calling function: https://words.filippo.io/efficient-go-apis-with-the-inliner/

As long as the global slice is never mutated, the current approach is probably fine, but it is definitely a semantic change to the code.

ploxiln 3 years ago | |

That seems like overkill for this particular case, but it's a very interesting technique, thanks for the link!

throwaway232iuu 3 years ago | |

Why doesn't go use RVO like C++ and Rust?

https://en.wikipedia.org/wiki/Copy_elision#Background

coder543 3 years ago | | |

I don’t think we’re on the same page about what midstack inlining is being used for in my suggestion. This discussion is about eliminating a heap allocation, which as far as I understand, RVO never does. Please read the article I linked if you want to discuss this further. I don’t want to repeat the article pointlessly.

I’m also fairly sure Go uses RVO here too, which cuts down on the number of times the object is copied around, but again, it’s irrelevant to the discussion of heap allocations. Copying the object isn’t the performance problem here, needlessly allocating a very short-lived object on the heap over and over is.

morelisp 3 years ago | |

Packages should be exposing an API with destination slices more often to begin with. The stdlib is pretty good about this (there's a few missing though 1.19 closed the most obvious absences), but most third-party code is awful. Or worse, it only takes strings.

tuetuopay 3 years ago |

Aaaaaand that's why I love Rust's decision to make copies explicit with `.clone()`. Annoying as hell when you're not used to it but overall worth it.

BlackFly 3 years ago | |

Except a lot of structs also derive and prefer `Copy` and a lot of rust code also avoids heap allocation which requires `Clone`. The `Copy` trait can be used implicitly like in the example here. On the other hand, due to the lack of garbage collector, you wouldn't be able to return the reference to the copy which might lead you to find your accidental copy.

tuetuopay 3 years ago | | |

I have yet to come on structs that implement `Copy` while being expensive to actually copy. The largest I can think of is `Uuid` from the `uuid` crate, which is 128 bits in size. This is a single word copy for most machines since modern hardware has 128 bit support for case like this. Still, two 64-bit words to copy is definitely negligible: that's equivalent of copying two pointers.

I agree with you for the garbage collector. By design, a GC allows you to willy-nilly copy without thinking about the consequences.

kangalioo 3 years ago | | |

The Copy trait can only be used for bitwise copies. Expensive copying with heap allocations will never happen implicitly

assbuttbuttass 3 years ago |

Returning a pointer to a local variable is convenient, but can be a source of hidden allocations.

It's best to treat each struct as a "value" or "pointer" type, and use one or the other consistently for each type. This mostly avoids the need to use & in the first place

cbsmith 3 years ago |

As an old C/C++ programmer, I'm always surprised by how often software developers are surprised by the performance costs of inopportune value semantics (C and C++ even more so, punishes you severely for using value semantics when you shouldn't). I increasingly see the wisdom of languages with implicit reference semantics.

It's not that value semantics can't be better (they most assuredly can be), or that reference semantics don't cause their own complexity problems, but rather that so often we thoughtlessly imply/impose value semantics through interfaces in ways that negatively impact performance; getting interfaces wrong is a much tougher bell to unring.

The vast majority of my mental energy when I define an interface in C++ is carefully thinking through a combination of ownership contracts and value vs. reference semantics that I can mostly ignore in languages with implicit reference semantics. While occasionally ignoring those contracts while developing in Java/Python/whatever comes back to bite me, the problem isn't nearly as common or problematic as when I unintentionally impose value semantics in a language that allows me to.

amluto 3 years ago |

Somewhat off topic, but I find a different part of this to be quite ugly:

    if match || err != nil {
        return rule, err
    }

Translating this code to actual logic takes too much thought and is too fragile. Is that an error path or a success path? It’s both! The logic is “if we found a rule or if there was an error then return a tuple that hopefully indicates the outcome”. If any further code were to be added in this block, it would have to be validated for the success and the error case.

But this only makes any sense at all if one is okay with reading Go result returns in their full generality. A canonical Go function returns either Success(value) or Error(err not nil, meaningless auxiliary value). And this code has “meaningless auxiliary value” != nil! In fact, it’s a pointer that likely escapes further into unrelated error handling code and thus complicates and kind of lifetime or escape analysis.

I don’t use Go, but if I did, I think this part of the language would be my biggest peeve. Go has very little explicit error handling; fine, that’s a reasonable design decision. But Go’s error handing is incorrectly typed, and that is IMO not a reasonable design.

enedil 3 years ago |

Went from 4.139s to 2.413s. I fail to see how it is 70%. I think it is explained as 4.139/2.413 = 1.7 which of course doesn't make sense here.

CodesInChaos 3 years ago | |

I do think saying it's 71% faster makes sense here, since "x% faster" and "speed increased by x%" mean the same thing. This reduces the runtime by 42%, but that doesn't mean it's just 42% faster.

bmicraft 3 years ago | |

It would probably be more accurate to say it can do 70% more stuff in the same time. Or that it takes 42% less runtime

xnorswap 3 years ago | | |

But 70% more stuff in the same time is 70% faster.

morelisp 3 years ago | |

This is an extremely common mistake in reporting performance numbers. That the old version is 70% slower does not make the new version 70% faster.

wizofaus 3 years ago | | |

70% slower is a bit ambiguous though - it could mean 70% extra runtime or it could mean 30% of the new speed. Whereas 70% faster would always suggest to me that it can do 70% more work in the same amount of time, i.e. a 1.7x increase in speed.

hcm 3 years ago | |

Doh! Thanks for pointing out another silly mistake – I'll fix that.

xnorswap 3 years ago | | |

You had it right the first time, 1.7x speed is 70% faster.

If something previously took 4s now takes 2s then it's 100% faster.

Think of driving 10miles. If you drive at 20mph then it takes 30 minutes. If you drive twice as fast, 40mph, it takes 15 minutes.

40mph is 100% faster than 20mph.

Half the time is twice as fast!

wizofaus 3 years ago | | |

I disagree, I think the 70% is right, and matches what you still describe as a 1.7x speed increase. If it originally took 4 seconds and now takes 2, I'd call that a 100% speed increase, i.e. twice as fast.

markoman 3 years ago | | |

@hcm: Would have loved to see the 'after' flamegraph just for comparison purposes! I'm still trying to get used to groking flamegraphs when optimizing. They're a somewhat new tool, IMO.

barbegal 3 years ago | |

It's gone from 14.5 operations per minute to 24.9 operations per minute so a 70% speedup.

gp 3 years ago |

I was trying to debug and improve the performance of some parallelized C++ code over the weekend for parsing CSV files. What would happen was parsing each file (~24k lines, 8 columns) would take 100ms with one execution context, but when split across many threads, the execution time of each thread would slow down proportionally and the throughput of the whole program would strictly decrease as thread count increased.

I tried all of the obvious things, but the offender ended up being a call to allocate and fill a `struct tm` object from a string representation of a date. This doesn't have any obvious reasons (to me) that it would cause cache invalidation, etc, so I'm a little in the dark.

Still, replacing this four line block improved single threaded performance by 5x, and fixed the threaded behavior, so on the whole it is now ~70x faster and parses about 400mb of csv per second.

Thorrez 3 years ago | |

Maybe date related code calls out to the operating system to find your time zone, and maybe that can't be done in parallel.

Quekid5 3 years ago | |

False sharing, maybe?

gp 3 years ago | | |

Not a bad suggestion - thanks for the idea

lanstin 3 years ago |

That seems like a potential for compiler optimization. It should already know that the rule value is only used one time, as the target of a & and this must be somewhat common in managing return values.

jimsmart 3 years ago |

From the headline alone, I guessed this was to do with pointers/references to values vs values themselves.

Yep, with values that take a lot of memory, it's faster to pass pointers/references around than it is to pass the values around, because it is less bytes to copy.

Of course there is more to such a decision than just performance, because if the code makes changes to the value which are not meant to be persisted, then one wants to be working with a copy of the value, not a pointer to the value. So one should take care if simply switching some code from values to pointers-to-values.

All of these things are things that coders with more experience of languages that use such semantics kinda know already, almost as second nature, since the first day they got caught out by them. But everyone is learning, to various degrees, and we all have to start somewhere (i.e. knowing little to nothing).

hoosieree 3 years ago |

> You can see these decisions being made by passing -gcflags=-m to go build:

That's a very nice feature! I wonder if compilers for other languages have something similar.

Beltalowda 3 years ago |

The deeper lesson here is "don't use pointers unless you're sure you need them". I've seen quite a few people use pointers for no reason in particular, or there's simply the assumption it's faster (and have done this myself, too), but it puts a lot more pressure on the GC than simple local stack variables.

Of course sometimes pointers are faster, or much more convenient. But as a rule of thumb: don't use pointers unless you've got a specific reason for them. This applies even more so if you're creating a lot of pointers (like in a loop, or a function that gets called very frequently).

chubot 3 years ago |

FWIW, to prevent the bug where a = b is slow for big types, Google's C++ style guide used to mandate DISALLOW_COPY_AND_ASSIGN (which used to be DISALLOW_EVIL_CONSTRUCTORS I think) on all types (most types?)

Looks like that's been gone for awhile in favor of C++ 11 stuff, which I don't really like:

https://google.github.io/styleguide/cppguide.html#Copyable_M...

A lot of good software was written in that style, but it has grown bureaucratic over time, and as the C++ language evolved

sendfoods 3 years ago |

1 character, in 2 places ;) I did not know profiling support for go was so seamless, thank you!

May I ask, is that theme custom or available somewhere? I really enjoyed it

gwd 3 years ago | |

> 1 character, in 2 places ;)

Moving a single character from one place to another. :-)

A good explanation of why "fire the developers with the lowest 50% of lines added" is an idiotic thing to do: this sort of deep analysis takes a lot of time and expertise, and frequently results in tiny changes.

hcm 3 years ago | |

Thanks! It's just a few dozen lines of CSS. The body font is Inter and the monospaced font is JetBrains Mono.

blowski 3 years ago | |

> is that theme custom or available somewhere

Looks a bit like https://newcss.net/ or Water CSS

is_taken 3 years ago |

Would be interesting to see the performance difference if you undo that move-&-change and change the function signature from:

  func (r Ruleset) Match(path string) (*Rule, error)

to:

  func (r *Ruleset) Match(path string) (*Rule, error)

masklinn 3 years ago | |

Likely none: Ruleset is

    type Ruleset []Rule

The original code creates a local copy of a rule and explicitly returns a pointer to that. Taking the ruleset by address wouldn't change that issue.

lxe 3 years ago |

This is the kind of stuff that the compiler needs to really understand. If all this de-referencing and referencing magic is at the control of the user, it needs to have meaningful effect on what the code does. Otherwise we might as well just write C.

silisili 3 years ago | |

The compiler does understand it and did what was asked - it was just written rather poorly.

There are valid use cases for wanting to take a copy, and then pass along a pointer of the copy. Perhaps to go through a series of modification methods that don't touch the original. I'd sure hate it if the compiler tried to outsmart me on that and changed the behavior away from what I'd written.

tonymet 3 years ago |

Overall good review of profiling tactics . But there’s nothing egregious about Golang here . Pass by value vs reference is a common performance issue.

masklinn 3 years ago | |

> But there’s nothing egregious about Golang here . Pass by value vs reference is a common performance issue.

The trap here is that everything is passed by reference (pointer), but the intermediate local value is, well, a value (a copy).

Rule is not a gigantic monster struct (it's 72 bytes), chances are returning it by value would not have been an issue.

Anyway I would say there is an issue with Go here: it's way too easy to copy out of a slice.

infamousclyde 3 years ago |

Thank you for sharing. I'm curious if you would recommend any good resources for profiling with Go. I enjoyed your code snippets and methodology.

stephen123 3 years ago |

Great post. I always feel smart when I find these kind of optimisations. Then I wonder why the compiler isnt smarter, I dont have to be.

erdaniels 3 years ago |

Is there any nice tooling / static analysis for golang that instruments the builds the process to add all the gcflags with verbose output and give you hints as to what can be optimized?

renewiltord 3 years ago |

Clear tutorial of how to go about identifying this. Good blog post. Since the problem was constrained and real, it helps someone know when to use these tools. Thank you for sharing.

cratermoon 3 years ago |

This has made me go back to look at all the Go I've written recently and look at the & uses.

amtamt 3 years ago |

It falls in those 3% of code lines one should think of while not optimizing prematurely.

notpushkin 3 years ago |

Well, technically it's either a 2-character or 0-character change! :-)

AtNightWeCode 3 years ago |

So, this is very basic Go design and you could write something about how it works in C and Go and why a older lang like C don't have this prob but then at the end of the day the Go fanclub will down vote the hell out you no matter what.

AtNightWeCode 3 years ago | |

Go compiler is garbage by the design. A 20 year old C compiler does not have this prob. This is also why Go have declined so much during the last couple of years. The benefits of Go have not increased and most of the quirks are still there. Like the error handling, the naive compiler and the syntax sugar that somewhat hides the diff between pointers and direct heap allocs.

-1

kosherhurricane 3 years ago | | |

I work on a code base that is a mixture of Go and C.

It's IO, CPU and Memory hungry, and it's distributed.

C is fast because it's close to how CPU and memory actually work. Go gives you 95+% of that plus easy to learn, easy to use language. A new person could start contributing useful features and bug fixes immediately. A senior person could get C-level performance.

More and more of our code is moved from C to Go, with very little performance penalty, but with a lot more safety and ease of use.

Our customers benefit, and our company makes more money.

In the end, that's what software is about.

YesThatTom2 3 years ago | | |

C is not a good match for modern cpus. https://queue.acm.org/detail.cfm?id=3212479

fear91 3 years ago | | |

Can you show any examples of "go compiler being garbage"? In my experience, it often generates much smarter code than C# or Java.

func alwaysError() (*int, error) { var dummy int // We return &dummy even though it's a meaningless value. // Will this cause a memory leak? return &dummy, fmt.Errorf("An error") } func caller() { val, err := alwaysError() if err != nil { fmt.Printf("Error\n") // It won't! Because the value pointed to by 'val' // can be GCed from this point on. return } // never get here fmt.Printf("Value %v\n", val) // // ...lots of code that uses *val... // }

pub fn match_(&self, path: &str) -> Result<&Rule, Error> { for rule in self.0.iter() { if rule.match_(path)? { return Ok(rule); } } Err(Error) }

pub fn match_(&self, path: &str) -> Result<Box<Rule>, Error> { for rule in self.0.iter() { if rule.match_(path)? { return Ok(Box::new(rule.clone())); } } Err(Error) }