Threadripper 3990X: The Quest To Compile 1B Lines Of C++ On 64 Cores

Threadripper 3990X: The Quest To Compile 1B Lines Of C++ On 64 Cores(blogs.embarcadero.com)

230 points by fmxexpress 5 years ago | 178 comments

Fun experiment.

The more pedestrian 5950X or the now bargain 3950X are great for anyone doing a lot of compiling. With the right motherboard they even have ECC RAM support. Game changer for workstations in the $1000–$2000 range.

The more expensive Threadripper parts really shine when memory bandwidth becomes a bottleneck. In my experience, compiling code hasn’t been very memory bandwidth limited. However, some of my simulation tools don’t benefit much going from 8 to 16 cores with regular Ryzen CPUs because they’re memory constrained. Threadripper has much higher memory bandwidth.

ska 5 years ago | |

I suspect the biggest (build time) benefit to most c++ workflows and toolchains was the move to ubiquitous SSD. Prior to that in my experience excepting expensive RAID array dedicated build machines, it was really easy to build a system that would always be IO bound on builds. There of course were tricks to improve things but you still tended to hit that wall unless your CPUs were really under spec.

edit: to be clearer, I'm not thinking of dedicated build machines here (hence RAID comment) but over all impact on dev time by getting local builds a lot faster.

PragmaticPulp 5 years ago | | |

SSDs help, but nothing beats core count X clock speed when compiling.

Source code files are relatively small and modern OSes are very good at caching. I ran out of SSD on my build server a while ago and had to use a mechanical HDD. To my surprise, it didn’t impact build times as much as I thought it would.

lumost 5 years ago | | |

You know I wonder how much of an impact this has had on the recent move back to statically typed and compiled languages vs. interpreted languages. I had assumed most of the compilation speedups were due to enhancements to the compiler toolchain - but my local laptop moving from 100 IOPS to > 100k IOPS and 3GB/s throughput may have more to do with it.

trhway 5 years ago | | |

i (and some teamates) actually put HDDs on some workstations as SSD just die after 2-3 years of active build on them and with modern HDDs you have practically unlimited storage while you can have only limited number of 400G builds on SSD (the org has psychological barriers to having more than 1-2Tb SSD in a machine) and the SSD start to have perf issues when at 70-80% capacity . With HDD the build time didn't change much - the machines have enough memory for the system to cache a lot (256-512G RAM).

fctorial 5 years ago | | |

Don't SSDs have a finite TBW? 50GB of writes everyday (possible on large projects) will consume that in a couple of months.

KingOfCoders 5 years ago | | |

We had ultra fast HDDs as developers with sound proof housings because they were so loud. Glad for SSDs.

masklinn 5 years ago | |

> The more expensive Threadripper parts really shine when memory bandwidth becomes a bottleneck.

Threadripper can be useful for IO, especially for C++ (which is famously quite IO intensive) owing to its 128 PCIe lanes, you can RAID0 a bunch of drives and have absolutely ridiculous IO.

bmurphy1976 5 years ago | |

Where can you get decent ECC ram for a reasonable price? I was on the hunt recently for ECC RAM for my new desktop and I gave up and pulled the trigger on low latency non-ECC RAM. Availability seems to be pretty terrible at the moment.

55873445216111 5 years ago | | |

You can get ECC UDIMMs from Supermicro. They are rebranded Micron DIMMs. ECC memory is not going to go as high of frequencies as you might be looking for. They will only go up to the officially validated speed of the CPUs. https://store.supermicro.com/16gb-ddr4-mem-dr416l-cv02-eu26....

mng2 5 years ago | | |

Kingston has some, I got a couple of KSM32ED8/16HD recently. They are 3200 CL22, though they probably have some room to tighten up timings.

m463 5 years ago | | |

same for everything. 5950? where would you even get one?

wait - seems you can get one, just pay 2x list price.

PartiallyTyped 5 years ago | |

IIRC AMD has EEC support in all x70/x50 motherboards and cpu combinations. If I may, what kind of simulations are you running?

I am trying to build a system for Reinforcement Learning research and seeing many things depend on python, I am not certain how to best optimise the system.

peferron 5 years ago | |

Yep, with permanent WFH due to the pandemic I started working on a desktop with 5950X + 64 GB memory and it's been a huge upgrade over my work laptop (and probably any laptop available at the moment).

It's much quieter under load as well.

ahepp 5 years ago |

>C++Builder with TwineCompile is a powerful productivity solution for multi-core machines compiling 1 million lines of code very quickly and can work better than the MAKE/GCC parallel compilation Jobs feature due to it’s deep IDE integration

You're claiming this plugin has deeper IDE integration than `make`? I find that really, really difficult to believe. And if it's true, it seems like the solution is to either use a better IDE, or improve IDE support for the de facto standard tools that already exist, as opposed to writing a plugin for the -j flag.

fmxexpress 5 years ago | |

Yes, it could be as simple as having Dev-C++ run a build every time a file is saved. Currently it does not do this. Remember, Dev-C++ didn't have -j support at all until I added it. TwineCompile does do this (background compile). Therefore the IDE is providing this functionality and has nothing really do to with make or the compiler.

TwineCompile is not a plugin wrapping the -j flag. It is a separate thing entirely unique to C++Builder. It does offer integration with MSBuild though.

The second part of that was the fall off. With the 1 million size files it only ever used half of the cores and each successive round of core compiles it would use even less cores. TwineCompile didn't seem to have that problem but this post was not about TwineCompile vs. MAKE -j so I did not investigate this farther.

I was expecting MAKE/GCC to blow me away and use all 64 cores full bore until complete and it did not do this.

klodolph 5 years ago | |

Make forces you to choose between being able to do full parallel builds or using recursive make, you can’t do both.

StillBored 5 years ago | | |

I might be misunderstanding something, but the common gnumake does no such thing.

https://www.gnu.org/software/make/manual/html_node/Job-Slots...

zajio1am 5 years ago | | |

Why would you do recursive make? That is setup discouraged for decades ...

formerly_proven 5 years ago |

[Not "real" C++ code, benchmark is for compiling 14492754 copies of a fairly simple C function]

blt 5 years ago | |

Yeah, I would say this title is a little misleading. The example doesn't use any of the C++ features that cause long compile times, like templates and the STL.

jandrese 5 years ago | |

Seems like he tried more complex examples too, but ran into roadblocks like a 2GB limit on executables and running into a commandline length limit restriction that dates back to early DOS days which made it impossible to link.

Both of those problems seemed solvable if he was willing to chunk up his application into libraries, maybe 1024 files per library then linked to the main application.

simcop2387 5 years ago | | |

I believe this is one of the reasons for object libraries (or archives, the foo.a files on linux/unix), you can then link in all of the object files from one of those at link time without having to list them all at once. That won't get past the 2GB limit on executables but it will get past the command line length.

account42 5 years ago | | |

> a commandline length limit restriction that dates back to early DOS days which made it impossible to link.

MinGW's linker supports passing the list of objects as a file for this reason and CMake will use that by default.

jpaul23 5 years ago | |

Does there exist some kind of random C code generator?

gm 5 years ago |

That article mentioned Delphi and Object Pascal, and it brought back many fond memories. I absolutely LOVED Delphi and Object Pascal back in the day. So clean and so fun to program in. If Borland hadn't f-ed it up and had stayed around until now, I'd be the biggest Delphi fanboy.

Alas, that was not to be. Modern languages are fun and all, but not Delphi-back-in-the-day level fun :-).

nick__m 5 years ago | |

2 actively maintained version of Delphi still exist, the original one maintained by Embarcadero, and an open-source one available at https://www.lazarus-ide.org/ .

dboat 5 years ago |

After liking this article, I wanted to check out others on the site, and am shocked at the terrible usability of their front page. I can't finish reading the titles of their articles before the page just keeps moving things around on me. It is so frustrating, which is unfortunate because I would otherwise have been interested to see more of their content. Experience completely ruined by awful design judgment.

peter_d_sherman 5 years ago |

This seems to be a little bit related to this quest for fast compilation:

The "mold" linker:

https://github.com/rui314/mold

>"Concretely speaking, I wanted to use the linker to link a Chromium executable with full debug info (~2 GiB in size) just in 1 second. LLVM's lld, the fastest open-source linker which I originally created a few years ago, takes about 12 seconds to link Chromium on my machine. So the goal is 12x performance bump over lld. Compared to GNU gold, it's more than 50x."

trhway 5 years ago |

Lucky sons of gun. We are stuck with Xeons. Have to wait 3 hours for our 20M C/C++ on the 2x14cores Xeon machine after a pull/rebase. Ryzen/TR would probably be faster 2-3x times for the same money, yet it is a BigCo, so no such luck (and our product is certified only for Xeons, so our customers can't run AMD too - thus we're de-facto part of the Great Enterprise Wall blocking AMD from on-premise datacenter).

maccard 5 years ago | |

I upgraded from 2x 12 core xeons to a 64 core thread ripper - compile times dropped from 45m to 12m

AshamedCaptain 5 years ago | |

Industrial software will always grow in size to use all available compilation time ... I have seen large Xeon distcc farms and the total build walltime was still measured in hours...

barkingcat 5 years ago |

there's something much easier to bring 64 cores to its knees - chromium takes a loooong time to compile.

mrlonglong 5 years ago | |

Two and half hours on my trusty Threadripper 2920x. Firefox only takes 20 mins.

robinei 5 years ago |

This shows that if you are making a not-very-fast compiler (most compilers these days), then the much maligned C compilation model has some serious advantages on modern and future hardware, due to its embarrassingly parallell nature.

ianhanschen 5 years ago |

Great read. I wonder if the make -j modification wasn’t scaling things across all cores because it was using the physical core count (number of cores) versus the logical core count (number of core threads).

Or perhaps the code wasn’t modified to spread the work across all processor core groups (a Windows thing to support more than 64 logical cores).

https://bitsum.com/general/the-64-core-threshold-processor-g...

dboreham 5 years ago |

They finally got around to reusing mainframe model numbers.

andy_ppp 5 years ago |

Does anyone have reviews of this on their JS test suite. The quicker the tests run the better my life, I have around 2000 quite slow tests... 76s MacBook 15” 2016, 30s M1 Apple Silicon Mac Mini, what should I expect with loads more cores like this?

nevi-me 5 years ago | |

How parallel do the tests run? The Threadrippers have massive number of cores, but their per-core performance is lower than say a Ryzen 9.

Tade0 5 years ago |

The images remind me of "Bad Apple!" as displayed on a CPU load graph of a 896 core machine:

https://youtu.be/RY5_gutA_Vw

Yuioup 5 years ago |

Embarcadero? Are they still around?

colejohnson66 5 years ago | |

They’re still selling Delphi, for what it’s worth

tester756 5 years ago |

Just try to compile LLVM - maybe not 1b of LoC, but that's definitely going to be challenging

Daho0n 5 years ago |

Great article but for the love of god don't use Passmark. They are extremely bad on AMD scores. Now this is luckily two CPU's from AMD so it isn't bad but it is a bad comparison site as they heavily favour Intel.

zelly 5 years ago |

On Linux I would just use Bazel. It can burn through 1B lines of code on all cores.

renewiltord 5 years ago |

Hahaha, fuck me, CPUs are fast. That's wicked. 15 mins. A billion lines of C. Insane. Wonder if there's some IO speed to be gained from ramdisking the inputs.

titzer 5 years ago | |

17,300 lines/sec per core. That's embarrassingly slow IMHO.

bjoli 5 years ago | | |

That depends completely on what optimizations are being done.

But alas, I have said for some time that a fast compiler should be able to compile about 1MLOC/S with some basic optimization work.

Macha 5 years ago | | |

It is pointed out that the threadripper does worse per core when under full load than even high core count consumer CPUs like the 3950x/5950x. That's the tradeoff you make for huge core count CPUs. 4x 3950x might do better, but then you need to build 3 other PCs, and for actual processing tasks, co-ordinate stuff to run across multiple systems.

bserge 5 years ago | | |

What can perform better?

muststopmyths 5 years ago |

Interesting. It would be cool to compare this against Visual Studio + Incredibuild, in my experience the most solid distributed C++ compilation tool.

bullen 5 years ago |

In my experience multi-core compilation does not work.

make -j>3 just locks the process and fails.

jcelerier 5 years ago | |

You just need more ram. I 'ever compile at less than -j$(ncpu). Hard with less than 32 GB tho - a single clang instance can easily eat upwards of 1gb of ram

bullen 5 years ago | | |

Aha, I only compile on ARM so I got no room to increase RAM...

Is it the same with g++? I have 4GB so I should be able to compile with 4 cores, but the processes only fill 2-3 cores even when I try make -j8 on a 8 core machine and then locks the entire OS until it craps out?!

Something is fishy...

coliveira 5 years ago |

It is a good thing that Embarcadero is keeping alive this technology to create desktop apps from the early 2000s that was abandoned by MS and other large companies in favor of complex Web-based apps.

dvfjsdhgfv 5 years ago | |

If only they had made Delphi Community Edition available a decade earlier...

cosmotic 5 years ago | |

Someone has to build the native electron wrapper

dvfjsdhgfv 5 years ago | | |

If you mean wrapping native widgets, this wouldn't solve much - you would still need some language to take care of the logic, like a JavaScript engine. At this point just using Electron is simply easier for devs, and as much as we hate it, realistically speaking it's still better than nothing.

phendrenad2 5 years ago | |

Qt is still around and doing well. Some people still need desktop apps.

solinent 5 years ago |

> 1B Lines of C++

Seems like our code is inflating quite rapidly. I remember when 1M was the biggest project. /snark

throwaway81523 5 years ago |

How many times are they going to repeat the search phrases like "one billion lines"? It's reached the point where SEO obstructs human readability. It was cool that Object Pascal (maybe a descendant of Turbo Pascal) compiled 1e9 lines of Pascal in 5 minutes on the 64 core box. Scrolling way through the article, it looks like they had enough trouble setting up their parallel Windows C++ build environment on 64 cores that they ended up running 4 instances on 16 cores each, and splitting the source files among the instances. The build then took about 15 minutes on 64 cores, which is faster than I'd have expected.

This all seems kind of pointless since distributed C++ compilation has been a thing for decades, so they could have used a cluster of Ryzens instead of "zowie look at our huge expensive single box".

czbond 5 years ago |

1B Lines? And this is just from a "rails new" command. Had to for some levity.

einpoklum 5 years ago |

A Billion lines, eh?

  int
  main
  ()
  {
    /* 
     _______ _     _       _               _                                                               
    |__   __| |   (_)     (_)             | |                                                              
       | |  | |__  _ ___   _ ___    __ _  | | ___  _ __   __ _   _ __  _ __ ___   __ _ _ __ __ _ _ __ ___  
       | |  | '_ \| / __| | / __|  / _` | | |/ _ \| '_ \ / _` | | '_ \| '__/ _ \ / _` | '__/ _` | '_ ` _ \ 
       | |  | | | | \__ \ | \__ \ | (_| | | | (_) | | | | (_| | | |_) | | | (_) | (_| | | | (_| | | | | | |
       |_|  |_| |_|_|___/ |_|___/  \__,_| |_|\___/|_| |_|\__, | | .__/|_|  \___/ \__, |_|  \__,_|_| |_| |_|
                                                          __/ | | |               __/ |                    
                                                         |___/  |_|              |___/                   
    */
    return 0;
  }