C++ patterns for low-latency applications including high-frequency trading

C++ patterns for low-latency applications including high-frequency trading(arxiv.org)

389 points by chris_overseas 1 year ago | 231 comments

nickelpro 1 year ago |

Fairly trivial base introduction to the subject.

In my experience teaching undergrads they mostly get this stuff already. Their CompArch class has taught them the basics of branch prediction, cache coherence, and instruction caches; the trivial elements of performance.

I'm somewhat surprised the piece doesn't deal at all with a classic performance killer, false sharing, although it seems mostly concerned with single-threaded latency. The total lack of "free" optimization tricks like fat LTO, PGO, or even the standardized hinting attributes ([[likely]], [[unlikely]]) for optimizing icache layout was also surprising.

Neither this piece, nor my undergraduates, deal with the more nitty-gritty elements of performance. These mostly get into the usage specifics of particular IO APIs, synchronization primitives, IPC mechanisms, and some of the more esoteric compiler builtins.

Besides all that, what the nascent low-latency programmer almost always lacks, and the hardest thing to instill in them, is a certain paranoia. A genuine fear, hate, and anger, towards unnecessary allocations, copies, and other performance killers. A creeping feeling that causes them to compulsively run the benchmarks through callgrind looking for calls into the object cache that miss and go to an allocator in the middle of the hot loop.

I think a formative moment for me was when I was writing a low-latency server and I realized that constructing a vector I/O operation ended up being overall slower than just copying the small objects I was dealing with into a contiguous buffer and performing a single write. There's no such thing as a free copy, and that includes fat pointers.

chipdart 1 year ago | |

> Fairly trivial base introduction to the subject.

Might be, but low-latency C++, in spite of being a field on its own, is a desert of information.

The best resources available at the moment on low-latency C++ are a hand full of lectures from C++ conferences which left much to be desired.

Putting aside the temptation to grandstand, this document is an outstanding contribution to the field and perhaps the first authoritative reference on the subject. Vague claims that you can piece together similar info from other courses does not count as a contribution, and helps no one.

Aurornis 1 year ago | | |

> this document is an outstanding contribution to the field and perhaps the first authoritative reference on the subject.

I don't know how you arrive at this conclusion. The document really is an introduction to the same basic performance techniques that have been covered over and over. Loop unrolling, inlining, and the other techniques have appeared in countless textbooks and blog posts already.

I was disappointed to read the paper because they spent so much time covering really basic micro techniques but then didn't cover any of the more complicated issues mentioned in the parent comment.

I don't understand why you'd think this is an "outstanding contribution to the field" when it's basically a recap of simple techniques that have been covered countless times in textbooks and other works already. This paper may seem profound if someone has never, ever read anything about performance optimization before, but it's likely mundane to anyone who has worked on performance before or even wondered what inlining or -Funroll-loops does while reading some other code.

radarsat1 1 year ago | |

> A creeping feeling that causes them to compulsively run the benchmarks through callgrind

I'm happy I don't deal with such things these days, but I feel where the real paranoia always lies is the Heisenberg feeling of not even being able to even trust these things, the sneaky suspicion that the program is doing something different when I'm not measuring it.

mbo 1 year ago | |

Out of interest, do you have any literature that you'd recommend instead?

nickelpro 1 year ago | | |

On the software side I don't think HFT is as special a space as this paper makes it out to be.[1] Each year at cppcon there's another half-dozen talks going in depth on different elements of performance that cover more ground collectively than any single paper will.

Similarly, there's an immense amount of formal literature and textbooks out of the game development space that can be very useful to newcomers looking for structural approaches to high performance compute and IO loops. Games care a lot about local and network latency, the problem spaces aren't that far apart (and writing games is a very fun way to learn).

I don't have specific recommendations for holistic introductions to the field. I learn new techniques primarily through building things, watching conference talks, reading source code of other low latency projects, and discussion with coworkers.

[1]: HFT is quite special on the hardware side, which is discussed in the paper. The NICs, network stacks, and extensive integration of FPGAs do heavily differentiate the industry and I don't want to insinuate otherwise.

You will not find a lot of SystemVerilog programmers at a typical video game studio.

_aavaa_ 1 year ago | | |

I'd recommend: https://www.computerenhance.com

The author has a strong game development (engine and tooling) background and I have found it incredibly useful.

It also satisfies the requirement for "A genuine fear, hate, and anger, towards unnecessary allocations, copies, and other performance killers."

contingencies 1 year ago | |

How I might approach it. Interested in feedback from people closer to the space.

First, split the load in to simple asset-specific data streams with a front-end FPGA for raw speed. Resist the temptation to actually execute here as the friction is too high for iteration, people, supply chain, etc. Input may be a FIX stream or similar, output is a series of asset-specific binary event streams along low-latency buses, split in to asset-specific segments of a scalable cluster of low-end MCUs. Second, get rid of the general purpose operating system assumption on your asset-specific MCU-based execution platform to enable faster turnaround using low-level code you can actually find people to write on hardware you can actually purchase. Third, profit? In such a setup you'd need to monitor the overall state with a general purpose OS based governor which could pause or change strategies by reprogramming the individual elements as required.

Just how low are the latencies involved? At a certain point you're better off paying to get the hardware closer to the core than bothering with engineering, right? I guess that heavily depends on the rules and available DCs / link infrastructure offered by the exchanges or pools in question. I would guess a number of profitable operations probably don't disclose which pools they connect to and make a business of front-running, regulations or terms of service be damned. In such cases, the relative network geographic latency between two points of execution is more powerful than the absolute latency to one.

nickelpro 1 year ago | | |

The work I do is all in the single-to-low-double-digit microsecond range to give you an idea of timing constraints. I'm peripheral to HFT as a field though.

> First, split the load in to simple asset-specific data streams with a front-end FPGA for raw speed. Resist the temptation to actually execute here as the friction is too high for iteration, people, supply chain, etc.

This is largely incorrect, or more generously out-of-date, and it influences everything downstream of your explanation. Think of FPGAs as far more flexible GPUs and you're in the right arena. Input parsing and filtering are the obvious applications, but this is by no means the end state.

A wide variety of sanity checks and monitoring features are pushed to the FPGAs, fixed calculation tasks, and output generation. It is possible for the entire stack for some (or most, or all) transactions to be implemented at the FPGA layer. For such transactions the time magnitudes are mid-to-high triple digit nanoseconds. The stacks I've seen with my own two eyeballs talked to supervision algorithms over PCIe (which themselves must be not-slow, but not in the same realm as <10us work), but otherwise nothing crazy fancy. This is well covered in the older academic work on the subject [1], which is why I'm fairly certain its long out of date by now.

HRT has some public information on the pipeline they use for testing and verifying trading components implemented in HDL.[2] With the modern tooling, namely Verilator, development isn't significantly different than modern software development. If anything, SystemVerilog components are much easier to unit test than typical C++ code.

Beyond that it gets way too firm-specific to really comment on anything, and I'm certainly not the one to comment. There's maybe three dozen HFT firms in the entire United States? It's not a huge field with widely acknowledged industry norms.

[1]: https://ieeexplore.ieee.org/document/6299067

[2]: https://www.hudsonrivertrading.com/hrtbeat/verify-custom-har...

PoignardAzur 1 year ago | |

> optimization tricks like fat LTO, PGO, or even the standardized hinting attributes ([[likely]], [[unlikely]]) for optimizing icache layout

If you do PGO, aren't hinting attributes counter-productive?

In fact, the common wisdom I mostly see compiler people express is that most of the time they're counter-productive even without PGO, and modern compilers trust their own analysis passes more than they trust these hints and will usually ignore them.

FWIW, the only times I've seen these hints in the wild were in places where the compiler could easily insert them, eg the null check after a malloc call.

nickelpro 1 year ago | | |

I said "or even", if you're regularly using PGO they're irrelevant, but not everyone regularly uses PGO in a way that covers all their workloads.

The hinting attributes are exceptional for lone conditionals (not if/else trees) without obvious context to the compiler if it will frequently follow or skip the branch. Compilers are frequently conservative with such things and keep the code in the hot path.

The [[likely]] attribute then doesn't matter so much, but [[unlikely]] is absolutely respected and gets the code out of the hot path, especially with inlined into a large section. Godbolt is useful to verify this but obviously there's no substitute for benchmarking the performance impact.

matheusmoreira 1 year ago | |

> allocations, copies, and other performance killers

Please elaborate on those other performance killers.

twic 1 year ago |

My emphasis:

> The output of this test is a test statistic (t-statistic) and an associated p-value. The t-statistic, also known as the score, is the result of the unit-root test on the residuals. A more negative t-statistic suggests that the residuals are more likely to be stationary. The p-value provides a measure of the probability that the null hypothesis of the test (no cointegration) is true. The results of your test yielded a p-value of approximately 0.0149 and a t-statistic of -3.7684.

I think they used an LLM to write this bit.

It's also a really weird example. They look at correlation of once-a-day close prices over five years, and then write code to calculate the spread with 65 microsecond latency. That doesn't actually make any sense as something to do. And you wouldn't be calculating statistics on the spread in your inner loop. And 65 microseconds is far too slow for an inner loop. I suppose the point is just to exercise some optimisation techniques - but this is a rather unrepresentative thing to optimise!

sneilan1 1 year ago |

I've got an implementation of a stock exchange that uses the LMAX disruptor pattern in C++ https://github.com/sneilan/stock-exchange

And a basic implementation of the LMAX disruptor as a couple C++ files https://github.com/sneilan/lmax-disruptor-tutorial

I've been looking to rebuild this in rust however. I reached the point where I implemented my own websocket protocol, authentication system, SSL etc. Then I realized that memory management and dependencies are a lot easier in rust. Especially for a one man software project.

jeffreygoesto 1 year ago |

Reminds me of https://github.com/CppCon/CppCon2017/blob/master/Presentatio...

munificent 1 year ago | |

This is an excellent slideshow.

The slide on measuring by having a fake server replaying order data, a second server calculating runtimes, the server under test, and a hardware switch to let you measure packet times is so delightfully hardcore.

I don't have any interest in working in finance, but it must be fun working on something so performance critical that buying a rack of hardware just for benchmarking is economically feasible.

nine_k 1 year ago | | |

Delightfully hardcore indeed!

But of course you don't have to buy a rack of servers for testing, you can rent it. Servers are a quickly depreciating asset, why invest in them?

a_t48 1 year ago | | |

The self driving space does this :)

winternewt 1 year ago |

I made a C++ logging library [1] that has many similarities to the LMAX disruptor. It appears to have found some use among the HFT community.

The original intent was to enable highly detailed logging without performance degradation for "post-mortem" debugging in production environments. I had coworkers who would refuse to include logging of certain important information for troubleshooting, because they were scared that it would impact performance. This put an end to that argument.

[1] https://github.com/mattiasflodin/reckless

munificent 1 year ago |

> The noted efficiency in compile-time dispatch is due to decisions about function calls being made during the compilation phase. By bypassing the decision-making overhead present in runtime dispatch, programs can execute more swiftly, thus boosting performance.

The other benefit with compile-time dispatch is that when the compiler can statically determine which function is being called, it may be able to inline the called function's code directly at the callsite. That eliminates all of the function call overhead and may also enable further optimizations (dead code elimination, constant propagation, etc.).

foobazgt 1 year ago | |

> That eliminates all of the function call overhead and may also enable further optimizations (dead code elimination, constant propagation, etc.).

AFAIK, the speedup is almost never function call overhead. As you mention at the tail end, it's all about the compiler optimizations being able to see past the dynamic branch. Good JITs support polymorphic inlining. My (somewhat dated) experience for C++ is that PGO is the solve for this, but it's not widely used. Instead people tend to avoid dynamic dispatch altogether in performance sensitive code.

I think the more general moral of the story is to avoid all kinds of unnecessary dynamic branching in hot sections of code in any language unless you have strong/confidence your compiler/JIT is seeing through it.

binary132 1 year ago | |

The real performance depends on the runtime behavior of the machine as well as compiler optimizations. I thought this talk was very interesting on this subject.

https://youtu.be/i5MAXAxp_Tw

xxpor 1 year ago | |

OTOH, it might be a net negative in latency if you're icache limited. Depends on the access pattern among other things, of course.

munificent 1 year ago | | |

Yup, you always have to measure.

Though my impression is that compilers tend to be fairly conservative about inlining so that don't risk the inlining being a pessimization.

globular-toast 1 year ago |

Is there any good reason for high-frequency trading to exist? People often complain about bitcoin wasting energy, but oddly this gets a free pass despite this being a definite net negative to society as far as I can tell.

astromaniak 1 year ago |

Just in case you are a pro developer, the whole thing is worth looking at:

https://github.com/CppCon/CppCon2017/tree/master/Presentatio...

and up

ykonstant 1 year ago |

I am curious: why does this field use/used C++ instead of C for the logic? What benefits does C++ have over C in the domain? I am proficient in C/assembly but completely ignorant of the practices in HFT so please go easy on the explanations!

jqmp 1 year ago | |

C++ is more expressive and allows much more abstraction than C. For a long time C++ was the only mainstream language that provided C-level performance as well as rich abstractions, which is why it became popular in fields that require complex domain modeling, like HFT, gamedev, and graphics. (Of course one can debate whether this expressivity is worth the enormous complexity of the language, but in practice people have empirically chosen C++.)

ibeff 1 year ago |

The structure and tone of this text reeks of LLM.

poulpy123 1 year ago |

the irony being that if something should not be high frequency, it is trading

apantel 1 year ago |

Anyone know of resources like this for Java?

Hixon10 1 year ago | |

https://www.reddit.com/r/java/comments/1ctpebe/low_latency/

apantel 1 year ago | | |

Thanks! Looks like a great list of resources.

gedanziger 1 year ago |

Very cool intro to the subject!

T *item = &this->shared_mem_region ->entities[this->shared_mem_region->consumer_position]; this->shared_mem_region->consumer_position++; this->shared_mem_region->consumer_position %= this->slots;

If the GNU kernel had been ready last spring, I'd not have bothered to even start my project: the fact is that it wasn't and still isn't. Linux wins heavily on points of being available now.