What scientists must know about hardware to write fast code (2020)

What scientists must know about hardware to write fast code (2020)(viralinstruction.com)

226 points by goerz 2 years ago | 74 comments

morcus 2 years ago |

In college (for a time) I was a double major in CS and Physics. I found a job as a programmer at a Physics lab, which fit my interests very well. The previous person roughly showed me the ropes for just a few days before she left to go to grad school.

The PI started asking me to run some analyses on a raw dataset. Since I was so new at it, I often messed up and had to rerun the whole thing after looking at the output; this was painful because the entire script took a few hours to run.

I started poking around to see whether it could be optimized at all. the raw data was divided up into hundreds files from different runs, sensors, etc..., that were each processed independently in sequence, and the results were all combined together into a big array for the final result. Seems reasonable enough.

Except this code was all written by scientists, and the combination was done in the "naive" way - after each of data files was processed, a new array was created and the previous results were copied into the new array, as were the results from the current data file. This meant that for the iterations at the end, we roughly needed to have Memory = 2 * Size of final data, which eventually exceeded the amount of physical memory on the machine (and because there were so many data files, it was doing this allocation and copying dozens of times after it used all the RAM).

I updated this to pre-allocate the required size at the beginning for a very very easy 3-4 fold improvement in the overall runtime and felt rather proud of myself.

scottLobster 2 years ago | |

Yeah, back in college I worked with a Biochemistry grad student on a group project that involved some coding (I was Computer Engineering). To iterate over a matrix, he used three nested loops with an if-statement to switch between rows and columns. Technically it worked but wildly inefficient, and he was proud of it...

To his credit once I (as nicely as possible) showed him how to do it with two nested for-loops he clearly felt stupid and conceded the point. He was otherwise a very smart guy and good to work with, but goes to show how we can take our training for granted. Even freshman-level stuff goes over the heads of PhDs, and I'm sure the same would be true if I were to drop into a biochem lab.

Rayhem 2 years ago | | |

Similar story - a PI had written some code to from (row, column) indices of the upper triangle of a matrix (made somewhat tricky by excluding the main diagonal) to a linear index. He used a for loop to start from the beginning and count up for an O(n^2) algorithm - I was able to give him an O(1) constant time formula to do the same thing for a rather dramatic speedup.

meow_cat 2 years ago | | |

During my masters thesis in a chemistry lab, I got a side task to look at a data analysis script and make it run faster. It was a "C/C++" code (i.e. procedural C-style code using C++ stdlib for convenience) that read a file line by line and then fed it to a slow processing function, then aggregated the results. It took over a day to run.

Without even looking at the processing function, which I considered some sciency science, I set up pthreads and mutexes on the result array and such to reap almost perfectly linear scaling. So far, so good.

Then I ran a profiler to see what was actually taking so long.

... Uh, why are you spending all this time copying strings back and forth?

Turns out they passed all strings by value. Sprinkling in a few const & here and there got a 1000-fold speedup or such. I felt pretty stupid for my multithreading antics after that.

tetris11 2 years ago | |

Having continuity from your previous analysis is a feature though. You can load up one of your objects and a good library should have the exact parameters used to generate stored there.

Also, H5 data formats[0] have been a god-send for scientific computing, due to its ability to inherently make sense of how to store your data. You can have your previous results curried over into your new analysis without doubling your data.

0: https://en.wikipedia.org/wiki/Hierarchical_Data_Format

ericmcer 2 years ago | |

Could you have achieved the same thing by just delete/free() the old arrays after copying them? I suck at manually allocating memory, have always worked in garbage collected languages where this wouldn't really be an issue.

morcus 2 years ago | | |

No, it was actually in a (obscure scientific) garbage collected language. The syntax was roughly: `allOrbitFiles = allOrbitFiles + currentOrbitFiles`.

I believe what was roughly happening under the hood was: 1. Allocate an array `tmp` of size `length of allOrbitFiles` + `length of currentOrbitFiles`. 2. Copy data from `allOrbitFiles` over to `tmp`. 3. Copy data from `currentOrbitFiles` to `tmp` 4. Reassign `allOrbitFiles` to the new array `tmp`. 5. Garbage collect the old `allOrbitFiles`.

So the doubling of memory usage comes after Step 1. I would imagine (but don't know for sure) that this would actually occur in any garbage collected language I'm familiar with as well (Java, Python, Javascript).

havercosine 2 years ago |

Solid post. It also shows how powerful Julia is: allowing to operate at different levels of abstractions (down to seeing the assembly) using the same set of tools.

a1o 2 years ago |

I don't know what is used to render this post, but the table of content as a floating icon would work best being closer to bottom at left, on mobile, because there is a scrollbar floating on top of it at the right that makes hard to tap it, and also because the eyes on the screen look at top of the screen mostly.

SinePost 2 years ago |

It is quite refreshing to see software optimization be explained so simply and elegantly.

dang 2 years ago |

What scientists must know about hardware to write fast code (2020) - https://news.ycombinator.com/item?id=29601342 - Dec 2021 (29 comments)

kylepdm 2 years ago | |

FYI the underlying link in that previous discussion post seems to be defunct and kind of suspicious.

dang 2 years ago | | |

Ok, I've disabled the link at the top and posted https://news.ycombinator.com/item?id=37759249. Thanks!

cpach 2 years ago | | |

Looks like biojulia.net got taken over by spammers :(

OnlyMortal 2 years ago |

Having been asked to port CERN C++ code to the Mac, I can tell you that some scientists don’t know or even care about performance.

For those folks, getting the output they need is much more important than the CPU cycles - as it should be.

As a C++ programmer, I posed the question as to why they don’t hire coders to do this for them. The answer was cost which rather surprised me given the cost of the LHC.

amadio 2 years ago | |

This is not true. Maybe a PhD student doesn't care much (or doesn't know), but we care deeply about software performance at CERN. I've worked myself on optimizations in detector simulation and data analysis software (Geant4 and ROOT) for a few years. Later in this decade, when HL-LHC comes online, the only way to be able to cope with the 10x increase in data rate from experiments and a matching increase in simulation requirements will be to optimize as much as we can the software we have, because we will not have the money to just buy 10x the hardware we have now.

OnlyMortal 2 years ago | | |

It came via Bristol University. Make of that what you will.

prhcbsc 2 years ago |

adding onto multithreading, other parallelization models such as OpenMP or OMPSs take sequential code and parallelise it. They delegate onto a runtime system the efficient execution of the code to achieve a balance between programmers productivity and code performance.

But for large problems the article falls short. Scientific applications may need to use several computers at a time, COMP Superscalar (COMPSs) is a task-based programming model which aims to ease the development of applications for distributed infrastructures. COMPSs programmers do not need to deal with the typical duties of parallelization and distribution, such as thread creation and synchronization, data distribution, messaging or fault tolerance. Instead, the model is based on sequential programming, which makes it appealing to users that either lack parallel programming expertise or are looking for better programmability. Other popular frameworks such as LEGION offer a lower-level interface.

Delk 2 years ago |

That's a good writeup with a lot of general knowledge on program optimization. It might get a bit dense at times with the details of x86 assembly, but I suppose it might be worth it if performance is important enough that understanding e.g. data dependencies between subsequent instructions pays off.

A minor detail I find a bit confusing, though, is explaining the potential benefits of SMT/hyperthreading with an example where threads are spending some of their time idle (or sleeping).

I don't know Julia so I don't know if sleep is implemented with busy-waiting or something there, but generally if a thread is put to sleep, the thread gets blocked from being run until the timer expires or the sleep is interrupted. The operating system doesn't schedule the blocked thread for running on the CPU in the first place, so a thread that's sleeping is not sharing a CPU core with another thread that's being executed.

So the example does not finish 8 jobs almost as fast as 4 or 1 jobs using 4 cores due to SMT; it's rather that half of the time each of the threads is not even being scheduled for running. A total of eight concurrent jobs/threads works out to approximately four of them being eligible to run at a time, matching the four physical cores available.

If there are only four concurrent jobs/threads, each sleeping half of the time, you end up not utilizing the four cores fully because on average two of the cores will be idle with no thread scheduled.

AFAIK SMT should only really be beneficial in cases of stalls due to CPU internal reasons such as cache misses or branch mispredictions, not in cases of threads being blocked for I/O (or sleeping).

The post is of course correct in that the example computation benefits from a higher number of concurrent jobs because of each thread being blocked half of the time. However, that's unrelated to SMT.

Considering how meticulous and detailed the post generally is, I think it would make sense to more clearly separate SMT from the benefits of multithreading in case of partially I/O-bound work.

jakobnissen 2 years ago | |

Author here. You are right - in this case Julia's scheduler would only run half the threads. The example is poor and I will find another.

Thanks for the heads up!

eande 2 years ago |

Is there a similar aggregation website example for C++?

Narishma 2 years ago | |

Most of this applies to C++, or any other compiled language.

jschveibinz 2 years ago |

This subject is taught in undergrad computer architecture courses along with machine coding. As an EE, I learned it in grad school.

eesmith 2 years ago | |

Congratulations. Studying one field means you know that field.

This link is not meant for you. It is meant for a scientist, and most scientists do not also have an EE degree or CS degree.

How much graduate level biology, oceanography, physics, geology, chemistry, meteorology, or other scientific field do you know?

All of those have subfields where computational performance is important. My experience is scientists are more likely to pick up the software skills than EEs are willing to pick up the science background. (In part because scientific software development generally pays less well than commercial software development.)

mjan22640 2 years ago | | |

Math was always a must for a scientist, todays computer science is also a must. The study programmes should reflect that.

aeturnum 2 years ago | |

While I could certainly put most of this together from my undergrad CS education, I would not say this "subject [was] taught" to me during undergrad. Instead, as with much of undergrad, you get pieces of it along the way - but collecting it together and writing for a somewhat-lay audience has a lot of value. This is also more up to date than my under grad education from ~14 years ago! It has clear explanations for things like Hyperthreading, which existed at the time I was in undergrad, but hadn't really made its way into the curriculum yet.

jschveibinz 2 years ago | |

Well, apparently I offended a lot of people by merely providing a piece of information about computer architecture curriculum. I apologize for commenting.

willprice89 2 years ago | | |

With no context about your emotional state, it came off as a bit brusque, and I think most people read comments like that in a snarky, condescending tone.

If you had prepended the comment with something like "I love this topic!" to show enthusiasm or approval, you probably would have gotten a much different response.

tempodox 2 years ago | |

Not everyone does.