The Windows malloc() implementation from MSVCRT is slow

The Windows malloc() implementation from MSVCRT is slow(erikmcclure.com)

211 points by blackhole 4 years ago | 182 comments

TonyTrapp 4 years ago |

Not that it helps here, but Microsoft never considered the MSVCRT that ships with Windows to be public API. This is not the "Windows allocator", this is the (very) old MSVC runtime library's allocator. Of course that doesn't keep anyone from using this library because it's present on any Windows system, unlike the newer MSVC versions' runtime library. Using the allocator from a later MSVC's runtime library would provide much better results, as would writing a custom allocator on top of Windows' heap implementation.

MSVCRT basically just exists for backwards compatibility. It's impossible to improve this library at this point.

FreakLegion 4 years ago | |

There's also UCRT, which ships with the OS since Windows 10. The logic of this rant was a real head-scratcher. If you must blame one side, it's LLVM. Fragmentation of C runtimes is annoying but inescapable. Glibc for example isn't any better.

mananaysiempre 4 years ago | | |

> Fragmentation of C runtimes is annoying but inescapable. Glibc for example isn't any better.

Glibc very much is better.

There cannot be more than a single version of a (tightly coupled) ld.so+libc.so pair in a given address space any more than there can be more than a single KERNEL32 version in a given address space, and given that some system services are exclusively accessible via dynamic linking (accelerated graphics, for one), this forces essentially everything on a conventional Linux desktop to use a single version of a single C runtime (whether Glibc or Musl). No need to guess whether you need CRTDLL or MSVCRT or MSVCR70 or MSVCR71 or ... or UCRT (plus which compiler-version-specific overlay?): you need a sufficiently recent libc.so.6.

I cannot say I like this design (in some respects I actually like the Windows one more), but it does force Glibc to have very strong back-compat mechanisms and guarantees; e.g. it includes a bug-compatible version of memcpy() for old binaries that depended on it working byte by byte. As far as I’m aware, this applies from the point GNU stopped pretending Linux did not exist and the “Linux libc” fork died, that is 1998 or thereabouts.

(There are a myriad reasons why old Linux binaries won’t run on a new machine, but Glibc isn’t one of them.)

This is not to say that Glibc is perfect. Actually building a backwards-compatible binary that references old symbol versions is a gigantic pain: you either have to use a whole old environment, apply symbol versioning hacks on top of a new one and pray the resulting chimera works, or patch antique Glibc sources for new compilers and build a cross toolchain. But if you already have an old binary, it is better.

ChrisSD 4 years ago | | |

The UCRT has even been present since Windows 7, if users keep up with updates. Or if applications bundle the UCRT installer with their own.

leajkinUnk 4 years ago | | |

Could you elaborate why Glibc isn't any better?

I remember some funny problems with Glibc, like, 20 years ago, but it's been invisible to me (as a user) since then. You get a new Glibc, old binaries still work, it's fine.

cryptonector 4 years ago | | |

Eh, no, this is strictly a Windows problem. On Windows every DLL can be statically linked each with its very private copy of the MSVCRT, which means -for example- that you'd better never ever pass one DLL's malloc()'ed memory pointers to another DLL's free().

On Unix systems (and Linux) this sort of thing can only ever happen if you have a statically-linked application that is linked with libdl and then it dlopen()s some ELF -- then you need the application's C library to be truly heroic. (Solaris stopped supporting that insanity in Solaris 10, though glibc apparently still supports it, I think?)

Shadonototra 4 years ago | | |

what has LLVM to do with software development on windows?

because developing for windows is cancer that people fall back to LLVM and others

it's all on microsoft for not cleaning their mess

even apple offers a better story than the platform for "developers, developers, developers, developers"

a shame, a well deserved shame, a hall of shame

jart 4 years ago | |

It's effectively mandatory. Microsoft provides about twelve different C Runtimes. But if you're building something like an open source library, you can't link two different C runtimes where you might accidentally malloc() memory with one and then free() with the other. If you want to be able to pass pointers around your dynamic link libraries, you have to link the one C runtime everyone else uses, which is MSVCRT. Also worth mentioning that on Windows 10 last time I checked ADVAPI32 links MSVCRT. So it's pretty much impossible to not link.

TonyTrapp 4 years ago | | |

It isn't mandatory. I have never actively linked against MSVCRT on Windows. From my experience it's mostly software that isn't built with Visual Studio that uses MSVCRT, or software that that takes extreme care of its binary size (e.g. 64k intros). MSVCRT is not even an up-to-date C runtime library. You wouldn't be able to use it for writing software requiring C11 library features without implementing them somewhere on top of it.

It's true that you cannot just happily pass pointers around and expect someone else to be able to safely delete your pointer - but that is why any serious library with a C interface provides its own function to free objects you obtained from the library. Saying that this is impossible without MSVCRT implies that every software needs to be built with it, which is not even remotely the case. If I wanted, I could build all the C libraries I use with LLVM and still link against them in my application compiled with the latest MSVC runtime or UCRT.

The much bigger problem is mixing C++ runtimes in the same piece of software, there you effectively must guarantee that each library uses the same runtime, or chaos ensues.

kazinator 4 years ago | | |

C applications targeting Windows must provide their own C library with malloc and free (if they are using the "hosted implementation" features of C).

MSVCRT.DLL isn't the library "everyone" uses; just Microsoft programs, and some misguided freeware built with MinGW.

Even if ADVAPI32.DLL uses MSVCRT.DLL, it's not going to mistakenly call the malloc that you provide in your application; Windows DLL's don't even have that sort of global symbol resolution power.

I would be very surprised if any public API in ADVAPI32 returns a pointer that the application is required to directly free, or accept a pointer that the application must malloc. If that were the case, you'd have to attach to MSVCRT.DLL with LoadLibrary, look up those functions with GetProcAddress and call them that way.

Windows has non-malloc allocators for sharing memory that way among DLL's: the "Heap API" in KERNEL32. One component can HeapAlloc something which another can HeapFree: they have to agree on the same heap handle, though. You can use GetProcessHeap to get the default heap for the process.

It may be that the MSVCRT.DLL malloc uses this; or else it's based on VirtualAlloc directly.

plonk 4 years ago | | |

Our programs ship their DLL dependencies in their own installer anyway, like most others on Windows. Just ship your FOSS library with a CMake configuration and let the users build it with whatever runtime they want.

garaetjjte 4 years ago | |

>but Microsoft never considered the MSVCRT that ships with Windows to be public API

It was in the past. At first msvcrt.dll was the runtime library used up to Visual C++ 6. Later, VC++ moved to their own separate dlls, but you could still link with system msvcrt.dll using corresponding DDK/WDK up to Windows 7.

I'm also not sure that this is just ancient library left for compatibility, some system components still link to it, and msvcrt.dll itself seems to link with UCRT libraries.

TonyTrapp 4 years ago | | |

> It was in the past. At first msvcrt.dll was the runtime library used up to Visual C++ 6.

At that time it was already a big mess, because at first it was the runtime library of Visual C++ 4 in fact! The gory details are here: https://devblogs.microsoft.com/oldnewthing/20140411-00/?p=12...

> some system components still link to it

Some system components themselves are very much ancient and unmaintained and only exist for backwards compatibility as well.

ComputerGuru 4 years ago | | |

I don’t think msvcrt is exposed to link against in the DDK anymore. I maintain this, with the caveat that you really need to know what you’re doing: https://github.com/neosmart/msvcrt.lib

Sesse__ 4 years ago | |

Win32 has an allocator (HeapAlloc), and it is similarly slow and low-concurrent. Even if you enable the newer stuff like LFH.

bjourne 4 years ago |

Well... Who told you to link to MSVCRT (the one in System32)? Not Microsoft that's for sure. New software is supposed to link to the Visual Studio C runtime it was compiled with and then ship that library alongside the application itself. Even if you don't compile with VS you can distribute the runtime library (freely downloadable from some page on microsoft.com). Ostensibly, that library contains an efficient malloc. If you willingly link to the MSVCRT Microsoft for over a decade has stated is deprecated and should be avoided you are shooting yourself in the foot.

"Windows is not a Microsoft Visual C/C++ Run-Time delivery channel" https://devblogs.microsoft.com/oldnewthing/20140411-00/

rayiner 4 years ago |

I wonder how much of this is the development culture at MS. https://www.theregister.com/2022/05/10/jeffrey_snover_said_m... (“When I was doing the prototype for what became PowerShell, a friend cautioned me saying that was the sort of thing that got people fired.”)

In that environment I can imagine nobody wants to be on the hook for messing with something fundamental like malloc().

The complete trash fire that is O365 and Teams—for some reason the new Outlook kicks you out to a web app just to manage your todos—suggests to me that Microsoft may be suffering from a development culture that’s more focused on people protecting fiefdoms than delivering the best product. I saw this with Nortel before it went under. It was so sclerotic that they would outsource software development for their own products to third party development shops because there was too much internal politics to execute them in house.

barrkel 4 years ago |

Windows doesn't have a malloc. The API isn't libc like conventional Unix and shared libraries on Windows don't generally expect to be able to mutually allocate one another's memory. Msvcrt as shipped is effectively a compatibility library and a dependency for people who want to ship a small exe.

qsdf38100 4 years ago | |

Note that Windows has HeapAlloc and HeapFree, which provide all the functionality to trivially implement malloc and free.

The C runtime is doing exactly that, except it adds a bit of bookkeeping on top of it IIRC. And in debug builds it adds support for tracking allocations.

barrkel 4 years ago | | |

VirtualAlloc is a better base for a custom memory allocator. It's closer to mmap + mprotect in functionality.

There's also CoTaskMemAlloc (aka IMalloc::Alloc). And COM automation has a bunch of methods which allocate memory for dynamically sized data, which could be abused for memory allocation - SafeArrayCreate, SysAllocString.

evmar 4 years ago |

The other inaccuracies in this article have already been covered. I noticed there was also a weird rant about mimalloc in there ("For some insane reason, mimalloc is not shipped in Visual Studio").

My understanding is mimalloc is basically a one-person project[1] from an MSR researcher in support of his research programming languages. It sounds like it's pretty nice, but I also wouldn't expect it to be somehow pushed as the default choice for Windows allocators.

[1]: https://github.com/microsoft/mimalloc/graphs/contributors

bcbrown 4 years ago |

Seeing someone refer to any piece of software technology as a "trash fire" makes it harder for me to view them as credible. It's unnecessarily divisive and insulting, and it means it's unlikely they will have any appreciation of the tradeoffs present during initial design and implementation.

dang 4 years ago | |

We've replaced the baity wording with more representative language from the article, in keeping with the HN guideline: "Please use the original title, unless it is misleading or linkbait; don't editorialize."

https://news.ycombinator.com/newsguidelines.html

trollied 4 years ago |

The Factorio team were looking at a performance bug recently & tracked it down to similar: https://forums.factorio.com/viewtopic.php?f=7&t=102388

https://developercommunity.visualstudio.com/t/mallocfree-dra...

InfiniteRand 4 years ago | |

The thread on the second link gives some clue as to why things are the way they are

eska 4 years ago | |

So Microsoft changed the malloc behavior for UWP apps, but not desktop apps. In other words they saw it as problematic enough to change it but then say it’s not a bug for the other case. Schizophrenic.

DHowett 4 years ago |

I'm curious whether the "new"(ish) segment heap would address some of the author's issues.

It's poorly documented, so I can't find a reference explaining what it is on MSDN save for a snippet on the page about the app manifests[1]. There's some better third-party "documentation"[2] that gets into some specifics of how it works, but even that is light on the real-world operational details that would be helpful here.

Chrome tried it out and found[3] it to be less than suitable due to its increased CPU cost, which might presage what Erik would see if they enabled it.

[1] https://docs.microsoft.com/en-us/windows/win32/sbscs/applica...

[2] (PDF warning) https://www.blackhat.com/docs/us-16/materials/us-16-Yason-Wi...

[3] https://bugs.chromium.org/p/chromium/issues/detail?id=110228...

MarkSweep 4 years ago | |

The one other piece of “documentation” that I know of is this blog post:

https://blogs.windows.com/windowsexperience/2020/05/27/whats...

It mentions that the segment heap is used by default for UWP apps and reduces memory usage of Edge.

denkshom 4 years ago |

This rant was rather devoid of relevant technical detail.

I mean, why exactly is the malloc of the compatibility msvcrt so slow compared to newer allocators? What is it doing?

An analysis of that would have been some actual content of interest.

chrisseaton 4 years ago |

So why is it a trash fire? It's just slow? Or is there something else wrong with it? I thought the author was going to say it did something insane or was buggy somehow.

Someone 4 years ago | |

Also, is it slow because it’s badly implemented, or is it better than other mallocs in some other respect? Maybe, dating from decades ago, it’s better in the memory usage front?

softwaredoug 4 years ago |

My knowledge is like 10 years old - For a long time, Microsoft's stl implementation was based on their licensning of dinkumware's STL (https://www.dinkumware.com/). Not something maintained in house. It seemed to work OK'ish - giving lowest common denominator functionality. However, it was pretty easy to create higher performing specialized data structures for your use case then what seemed like simple uses of dinkumware STL.

garaetjjte 4 years ago | |

malloc is not related to STL. But about it, big issue with Microsoft STL is that it is atrociously slow on debug builds.

Sesse__ 4 years ago |

Just wait until you try to use it from multiple threads at the same time!

eps 4 years ago | |

Not sure what's your usage was exactly, but Heap API works reallly well in this context.

So much so that beating it with a custom allocator is a real challenge.

Sesse__ 4 years ago | | |

I had a system that was sped up by 30%+ on Windows by switching from HeapAlloc to jemalloc. Profiling showed that HeapAlloc was largly stuck in a single giant lock. (This was on Windows Server 2016, IIRC.) And that wasn't even that allocation-heavy in the large scale of it; most of memory was done through arena allocations, but a few larger buffers were not.

shaggie76 4 years ago |

I wonder if he was running with the debugger attached; we also saw atrocious performance with MSVCRT malloc until we set _NO_DEBUG_HEAP=1 in our environment.

moonchild 4 years ago |

> it basically represents control flow as a gigantic DAG

Control flow is not a DAG.

spatulon 4 years ago | |

You're not wrong.

I guess they're just trying to say that LLVM's control-flow graph is implemented as individually heap-allocated objects for nodes, and pointers for edges. (I haven't looked at the LLVM code, but that sounds plausible).

Even if those allocations are fast on Linux/Mac, I wonder whether there are other downsides of that representation, for example in terms of performance issues from cache misses when walking the graph. Could you do better, e.g. with a bump allocator instead of malloc? But who knows, maybe graph algorithms are just inherently cache-unfriendly, no matter the representation.

3836293648 4 years ago | |

Pretty sure they mean the AST is a DAG

tick_tock_tick 4 years ago | |

For it to be a DAG you'd have the solve the halting program wouldn't you?

remram 4 years ago | | |

No. Knowing whether programs end in finite time (dependent on input) doesn't mean all programs end or that all programs end in constant time.

The halting problem is also not considered unsolved (though P=NP is unsolved).

pshirshov 4 years ago | |

Well, why?

AshamedCaptain 4 years ago | | |

Directed _Acyclic_ Graph. Control flow graph has loops.

somerando7 4 years ago |

> I was taught that to allocate memory was to summon death itself to ruin your performance. A single call to malloc() during any frame is likely to render your game unplayable. Any sort of allocations that needed to happen with any regularity required writing a custom, purpose-built allocator, usually either a fixed-size block allocator using a freelist, or a greedy allocator freed after the level ended.

Where do people get their opinions from? It seems like opinions now spread like memes - someone you respect/has done something in the world says it, you repeat it without verifying any of their points. It seems like gamedev has the highest "C++ bad and we should all program in C" commmunity out there.

If you want a good malloc impl just use tcmalloc or jemalloc and be done with it

KerrAvon 4 years ago |

Has everyone forgotten that Unix is the common ancestor of Linux and every other Unixlike? I’m seeing an uptick of people writing nonsensical comments like “this was written for Linux (or Mac OS X, which implements POSIX and is therefore really Linux in drag)”.

jchw 4 years ago | |

No... That's why they had the parenthetical. The problem is, your computer probably doesn't boot the common ancestor. If you're writing UNIX-like stuff, most likely it boots macOS or Linux. If you're cool maybe it's one of the other modern BSD variants aside macOS. In practice there's a pretty low probability that your code also runs on all POSIX-compliant operating systems, and more honest/experienced people often don't kid themselves into thinking that they're seriously targeting that. Even if you believe it, you probably have some dependency somewhere that doesn't care, like Qt for example. Saying something like "Linux (or macOS, which is similar)" is a realization that you're significantly more likely to be targeting both Linux and macOS than you are to even test on BSD. And to solidify that point, note that lots of modern CI platforms don't even have great BSD support to begin with.

Of course, there is a semantic point here. macOS nominally really is UNIX, except for when someone finds out it's not actually POSIX compliant due to a bug somewhere every year or so. Still, it IS UNIX. But what people mostly run with that capability, is stuff that mostly targets Linux. So... yeah.

Of course it is true that some people really think macOS is actually Linux, but that misunderstanding is quite old by this point.

addendum: I feel like I haven't really done a good job putting my point across. What I'm really saying is, I believe most developers targeting macOS or Linux today only care about POSIX or UNIX insofar as they result in similarities between macOS and Linux. That macOS is truly UNIX makes little difference; if it happened to differ in some way, developers would happily adjust to handle it, just like they do for Linux which definitely isn't UNIX.

copperx 4 years ago | |

Apparently yes, because all I ever hear is "macOS is like Linux" and even "macOS is really Linux behind the scenes" from less enlightened people.

naniwaduni 4 years ago | |

Well, a pretty big part of the point of Linux is that it's not a Unix-descendant, just a Unix-clone.

pcl 4 years ago | | |

What’s the distinction between those two?

avgcorrection 4 years ago | |

I just call Linx+Mac+Bsds Unix (not “Unix-like” and certainly not that “*nix” nonsense). I don’t respect Unix enough to be perfectly precise with it.

pjmlp 4 years ago |

There is no Windows malloc(). Only UNIXes have the C API as part of the OS API.

spc476 4 years ago | |

malloc() is defined by the C Standard. If you want to claim your compiler is ANSI or ISO certified, you need to support malloc() (as well as the rest of the C Standard library).

pjmlp 4 years ago | | |

Quite right, except we are then talking about compilers and not OS APIs.

UNIXes are the only OSes were there is an overlapping between OS APIs and libc due to C's origin.

jart 4 years ago | |

malloc() isn't part of the Linux API which provides mmap().

plorkyeran 4 years ago | | |

Libc being just a library is indeed one of the ways that Linux is unlike Unix.

pjmlp 4 years ago | | |

Since we are getting pedantic, Linux isn't a UNIX.

fguerraz 4 years ago |

"Don't use spinlocks in user-land."

eska 4 years ago | |

He only did as a workaround for a performance issue in the mutex.

oddity 4 years ago |

If you're depending on the performance of malloc, you're either using the language incorrectly or using the wrong language. There is no such thing as a general purpose anything when you care about performance, there's only good enough. If you are 1) determined to stick with malloc and 2) want something predictable and better, then you are necessarily on the market for one of the alternatives to the system malloc anyway.

mwcampbell 4 years ago | |

The whole point of the article, though, was that the system malloc was good enough on Linux and Darwin.

oddity 4 years ago | | |

This misses the point of my comment. When you put faith in malloc, you're putting hope in a lot of heuristics that may or may not degenerate for your particular workload. Windows is an outlier with how bad it is, but that should largely be irrelevant because the code should have already been insulated from the system allocator anyway.

An over-dependence on malloc is one of the first places I look when optimizing old C++ codebases, even on Linux and Darwin. Degradation on Linux + macOS is still there, but more insidious because the default is so good that simple apps don't see it.

jeffbee 4 years ago | | |

There isn't really a "system malloc on Linux". Many distributions come with the GNU allocator based on ptmalloc2, but there is no particular reason that a distro could not come out of the box with any other allocator. The world's most widespread Linux distribution uses LLVM's Scudo allocator. Alpine Linux comes with musl's (unbelievably slow) allocator, although it is possible to rebuild it with mimalloc.