After almost six months, I finally found a spot where I could monkey patch a function to wrap it with a short circuit if the coordinates were out of bounds. Not only fixed the bug but made drag and drop several times faster. Couldn’t share this with the world because they weren’t accepting PRs against the old widgets.
I’ve worked harder on bug fixes, but I think that’s the longest I’ve worked on one.
Debugging errors in JS crypto and compression implementations that only occur at random, after at least some ten thousand iterations, on a mobile browser back when those were awful, and only if the debugger is closed/detached as opening it disabled the JIT was not fun.
It taught me to go into debugging with no assumptions about what can and cannot be to blame, which has been very useful later in even trickier scenarios.
This is why D, by default, initializes all variables. Note that the optimizer removes dead assignments, so this is runtime cost-free. D's implementation of C, ImportC, also default initializes all locals. Why let that stupid C bug continue?
Another that repeatedly bit me was adding a field, and neglecting to add initialization of it to all the constructors.
This is why D guarantees that all fields are initialized.
If native code calls back into Java, and the GC kicks in, all the objects the native code can see can be compacted and moved. So my implementation worked fine for all of the smaller test fixtures, and blew up half the time with the largest. Because I skipped a line to make it “go faster”.
I finally realized I was seeing raw Java objects in the middle of my “array” and changing the value of final fields into illegal pairings which blew everything the fuck up.
Level 2 systems programmer: "oh no, my memory allocator is a garbage collector"
For as painful as the debugging story was I have spent vastly more amounts of time working around garbage collectors to ship performant code.
( https://github.com/jemalloc/jemalloc/issues/1317 Unlike what the title says, it's not Windows-specific.)
(*): The application uses libc malloc normally, but at some places it allocates pages using `mmap(non_anonymous_tempfile)` and then uses jemalloc to partition them. jemalloc has a feature called "extent hooks" where you can customize how jemalloc gets underlying pages for its allocations, which we use to give it pages via such mmap's. Then the higher layers of the code that just want to allocate don't have to care whether those allocations came from libc malloc or mmap-backed disk file.
If there were 20million rooms in the world with a price for each day of the year, we’d be looking at around 7billion prices per year. That’d be say 4Tb of storage without indexes.
The problem space seems to have a bunch of options to partition - by locality, by date etc.
I’m curious if there’s a commonly understood match for this problem?
FWIW with that dataset size, my first experiments would be with SQL server because that data will fit in ram. I don’t know if that’s where I’d end up - but I’m pretty sure it’s where I’d start my performance testing grappling with this problem.
[1]: https://github.com/microsoft/mimalloc/blob/dev/src/heap.c#L1...
The underlying sys crate provides the binding for mimalloc API like `mi_collect`: https://docs.rs/libmimalloc-sys/0.1.39/libmimalloc_sys/fn.mi...
“C programmers think memory management is too important to be left to the computer. LISP programmers think memory management is too important to be left to the user.”
But far better to just use integer cents.
Does your project correctly calculate $300,000.00 + $0.01, (or even just correctly represent the value $300,000.01) and if so, how?
The concept of memory that is allocated by a thread and can only be deallocated by that thread is useful and valid, but as TFA demonstrates, can also cause problems if you're not careful with your overall architecture. If the language you're using even allows you to use this concept, it almost certainly will not protect you from having to get the architecture corect.
Interestingly, it would seem that Java programmers play with garbage collectors while Rust programmers play with memory allocators.
*system
Every OS will provide some mechanism to get more pages. But it turns out that managing the use of those pages requires specialized handling, depending on the use case, as well as a bunch of boilerplate. Hence, we also have malloc and its many, many cousins to allocate arbitrary size objects.
You're always welcome to use brk(2) or your OS's equivalent if you just want pages. The question is, what are you going to do with each page once you have it? That's where the next level comes in ...
For high performance stuff where you need low, predictable latency, you're probably not going to want to use dynamic memory at all.
The downside is that it makes things like "print" a pain in the ass.
The upside is that you can have multiple memory allocators with hugely different characteristics (arena for per frame resources, bump allocator for network resources, etc.).
Generally given that page size isn't something you know at compiler (or even install size) and it can vary between each restart and it being between anything between ~4KiB and 1GiB and most natural memory objects being much less then 4KiB but some being potentially much more then 1GiB you kind don't want to leak anything related to page sizes into your business logic if it can be helper. If you still need to most languages have memory/allocation pools you can use to get a bit more control about memory allocation/free and reuse.
Also the performance issues mentioned have not much to do with memory pages or anything like that _instead they are rooted in concurrency controls of a global resource (memory)_. I.e. thread local concurrency syncronization vs. process concurrency synchronization.
mainly instead of using a fully general purpose allocator they used an allocator whiche is still general purpose but has a design bias which improves same-thread (de)allocation perf at cost of cross thread (de)allocation perf. And they where doing a ton of cross thread (de)allocations leading to noticeable performance degradation.
The thing is even if you hypothetically only had allocations at sizes multiple of a memory page or use a ton of manual mmap you still would want to use a allocator and not always directly free freed memory back to the OS as doing so and doing a syscall on every allocation tends to lead to major performance degradation (in many use cases). So you still need concurrency controls but they come at a cost, especially for cross thread synchronization. Even just lock-free controls based on atomic have a cost over thread local controls caused often largely by cache invalidation/synchronization.
A lot of my opinions on code and the human brain started in college. My roommate was washing out and didn’t know it yet. The rules about helping other people were very clear, I was a boy scout but also grade-a bargainer and rationalized so I created a protocol for helping him without getting us expelled. Other kids in the lab started using me the same way.
There were so many people who couldn’t grasp that your code can have three bugs at once, and fixing one won’t make your code behave. Some of those must have washed out too.
But applying the scientific method as you say above is something that I came to later and it’s how I mentor people. If all of your assumptions say the answer should be 3 but it’s 4, or “4” or “Spain”, one of your assumptions is wrong and you need to test them. Odds of being the flaw / difficulty of rechecking. Prioritize and work the problem.
(Hidden variable: how embarrassed you’ll be if this turns out to be the problem)
Edit: In your case, that's where I start print debugging LOL
[0] https://perldoc.perl.org/functions/sort
If the subroutine's prototype is ($$), the elements to be compared are passed by reference in @_, as for a normal subroutine. This is slower than unprototyped subroutines, where the elements to be compared are passed into the subroutine as the package global variables $a and $b (see example below).
…
println(“2”);
…
println(“wtf”);
Default initialization, on the other hand, gives 100% coverage. Experience with it in D is a satisfying success.
If you can't reproduce a bug, you cannot in my opinion say that it is fixed. If you have to reproduce it via local debugging and changing a value, or hard coding a value, I think you're possibly close, but there's a chance it might not be the case!
If he didn’t know he would just say. But he says he does.
Still a language design issue: C++ and Rust doesn't put allocation concerns front and center, when they very much are. Not encouraging thinking about these things is very bad for systems languages.
It's about the idea that you are using per-thread allocators, and one of your threads allocates a lot of memory, then goes to sleep for a long time.
Per-thread allocators are orthogonal to per-structure allocators.
Why should memory be different?
Go for instance bills itself as a systems language and that's true for domains where bounded, predictable memory consumption / CPU trade-offs are not necessary _because_ the runtime GC is bundled and non-negotiable. Its behavior also shifts with releases. A systems program relying on an allocator alone can choose to ignore the allocator until it's a problem and swap the implementation out for one -- perhaps custom made -- that tailors to the domain.
so many serious applications end-up reimplementing their own custom user-space / process-level filesystem for specific tasks because how SLOW can OS filesystems be though
Unfortunately science only evolves one funeral at a time.
But there were / are also plenty of trading shops that paid Azul for their pauseless C4 GC. Nowadays there's also ZGC and Shenandoah, so if you want to both allocate a lot and also not have pauses, that tech is no longer expensive.
Well, I just trivialized it. However, in one case in mid 00s, I saw it disabled completely to avoid any pauses during trading hours.
> or a design that does not respect how GC works in the first place
It’s called shipping a 90 Hz VR game without dropping frames.
(if that is the case, I understand where the GC PTSD comes from)
To be fair, there are about 4 completely independent bad decisions that tend to be made together in a given language. GC is just one of them, and not necessarily the worst (possibly the least bad, even).
The decisions, in rough order of importance according to some guy on the Internet:
1. The static-vs-dynamic axis. This is not a binary decision, things like "functions tend to accept interfaces rather than concrete types" and "all methods are virtual unless marked final" still penalize you even if you appear to have static types. C++'s "static duck typing" in templates theoretically counts here, but damages programmer sanity rather than runtime performance. Expressivity of the type system (higher-kinded types, generics) also matters. Thus Java-like languages don't actually do particularly great here.
2. The AOT-vs-JIT axis. Again, this is not a binary decision, nor is it fully independent of other axes - in particular, dynamic languages with optimistic tracing JITs are worse than Java-style JITs. A notable compromise is "cached JIT" aka "AOT at startup" (in particular, this deals with -march=native), though this can fail badly in "rebuild the container every startup" workflows. Theoretically some degree of runtime JIT can help too since PGO is hard, but it's usually lost in the noise. Note that if your language understands what "relocations" are you can win a lot. Java-like languages can lose badly for some workflows (e.g. tools intended to be launched from bash interactively) here, but other workflows can ignore this.
3. The inline-vs-indirect-object axis - that is, are all objects (effectively) forced to be separate allocations, or can they be subobjects (value types)? If local variables can avoid allocation that only counts for a little bit. Java loses very badly here outside of purely numerical code (Project Valhalla has been promising a solution for a decade now, and given their unwieldy proposals it's not clear they actually understand the problem), but C# is tolerable, though still far behind C++ (note the "fat shared" implications with #4). In other words - yes, usually the problem isn't the GC, it's the fact that the language forces you to generate garbage in the first place.
4. The intended-vs-uncontrollable-memory-ownership axis. GC-by-default is an automatic loss here; the bare minimum is to support the widely-intended (unique, shared, weak, borrowed) quartet without much programmer overhead (barring the bug below, you can write unique-like logic by hand, and implement the others in terms of it; particularly, many languages have poor support for weak), but I have a much bigger list [1] and some require language support to implement. try-with-resources (= Python-style with) is worth a little here but nowhere near enough to count as a win; try-finally is assumed even in the worst case but worth nothing due to being very ugly. Note that many languages are unavoidably buggy if they allow an exception to occur between the construction of an object and its assignment to a variable; the only way to avoid this is to write extensions in native code.
[1] https://gist.github.com/o11c/dee52f11428b3d70914c4ed5652d43f... - a list of intended memory ownership policies. Generalized GC has never found a theoretical use; it only shows up as a workaround.
re 4. there is some understanding gap in programming community to the kind of constraints imposed by lifetime analysis on dynamicity allowed by JIT compilation, which comes at a tradeoff of being able to invalidate previous assertions about when the object or struct truly no longer referenced, whether it escapes or else - you may be no longer able to re-JIT the method, attach a debugger or introduce some other change. There is still also lack of understanding where the cost of GC comes from and how it compares to other memory management techniques, or how it interacts with escape analysis (which in many ways resembles static lifetime analysis for linear and affine types), particularly so when it is inter-procedural. I am saying this as a response to "GC-by-default is an automatic loss" which sounds overly generalized "GC bad" you get used to hearing from audience who never looked at it with a profiler.
And lastly - latency-sensitive gamedev and predictability tends to come with completely different set of constraints to regular application code, and tends to require comparable techniques regardless of the language of choice provided it has capable compiler and GC implementations. It greatly favours low or schedulable STW pause GC (pause-less-like and especially non-moving designs tend to come with very ARC-like synchronization cost and low throughput (Go) or significantly higher heap sizes over actively used set (JVM pauseless GC impls. like Azul, maybe ZGC?), ideally with some or most collection phases being concurrent that performs best at moderate allocation rates. In the Unity case, there are quite a few poor quality libraries, as well as constraints of Unity specifically in regards to its rudimentary non-moving GC, which did receive upgrades for incremental per-frame collection but still would cause issues in scenarios where it cannot keep up. This is likely why the author of the parent comment is so up and arms about GC.
However, for complex frequently allocated and deallocated object graphs that do not have immediately observed lifetime constrained to a single thread, good GC is vastly superior to RC+malloc/free and can be matched by manually managing various arenas at much greater complexity cost, which is still an option in a GC-based language like C# (and is a popular technique in this domain).
That particular project was Unity. Which, as you know, has a notoriously poor GC implementation.
It sure seems like there are a whole lot more bad GC implementations than good. And good ones are seemingly never available in my domains! Which makes their supposed existence irrelevant to my decision tree.
> good GC is vastly superior to RC+malloc/free
Ehhh. To be honest memory management is kind of easy. Memory leaks are easy to track. Double frees are kind of a non-issue. Use after free requires a modicum of care and planning.
> and can be matched by manually managing various arenas at much greater complexity cost, which is still an option in a GC-based language like C# (and is a popular technique in this domain).
Not 100% sure what you mean here.
I really really hate having to fight the GC and go out of my way to pool objects in C#. Sure it works. But it defeats the whole purpose of having a GC and is much much more painful than if it were just C.
It is easy to understand how it has grown historically, but the fact that every process still manages its own memory is a little absurd.
If your program __wants__ to manage its own memory, then that is simple: allocate a large (gc'd) blob of memory and run an allocator in it.
The problem is that the current view has it backwards.
This is like saying to an OS all file descriptors are just integers.
It's just that programs tend to want to manage objects with sub-page granularity (as well as on separate threads in parallel), and at that level there are infinitely many possible access patterns and reachability criteria that a GC might want to optimize for.
When a process requests additional pages be added to its address space, they remain in that address space until the process explicitly releases them or the process exits. At that time they go back on the free list to be re-used.
GC implies "finding" unused stuff among something other than a free list.
The job of the OS is to virtualize resources, which it does (including memory).
I doubt GC would work on file descriptors either. How could an OS tell when scanning through memory if every 4 bytes is a file descriptor it must keep alive, or an integer that just happens to have the same value?
Not to mention that file descriptors (and pointers!) may not be stored by value. A program might have a set of fds and only store the first one, since it has some way to calculate the others, eg by adding one.