The Parallelism Blues: when faster code is slower(pythonspeed.com) |
The Parallelism Blues: when faster code is slower(pythonspeed.com) |
Julia has composible multithreading, and using that model fixed composing FFTW threads with Julia's. This can be done to OpenBLAS as well, and IIRC there is a PR open for it.
My laptop's 9980HK will boost to ~4.5 GHz when only loaded to a single core.
However, when I load up all 8 cores, it might only sustain ~3.5 GHz.
Therefore the 8 cores might not actually result in the work being completed 8 times as fast, only 6.2x (8*[3.5/4.5]) real-time due to the lowered clock rate of each individual core.
This will show up as additional user time, since each individual core is able to do less work for each unit of time (seconds) compared to the single-core case.
Basically impossible by Ahmdal's law.
The only thing I can think of would be that the additional threads would kick the CPU into using a higher frequency, but a single thread using 100% of the CPU should already do that.
The title is also misleading; it suggests that the wall clock time might be longer for parallel code in certain cases. While not impossible, that isn't what the article covers.
That is exactly the case, if CPU is the bottleneck in your already-parallel application. It's a case where we really shouldn't be layering different parallel bits together in one codebase, but might be doing it naively.
Parallelism is specifically the stuff that actually does happen completely independently on all processing units, that actually goes Nx as fast on N units (clock depression aside). Concurrency refers to the overhead of coordinating activity of those units, that keeps you from getting your Nx. It is overhead on top of any actually serial parts of the computation, which Amdahl's law addresses.
In other words: Parallelism giveth, and concurrency taketh away.
The distinction gets more useful the more you think about the subject.
I'm on my phone, so rather than trying to type out an explanation, I'm going to link to Wikipedia: https://en.wikipedia.org/wiki/Hyper-threading
You can also encounter a funny software engineering effect. Refactoring an array-based algorithm to work in parallel often involves the introduction of block-decomposition. This change allows sub-problems to be spread to different workers. Sometimes, this change will also accelerate the sequential code, as the new block-oriented access patterns improve cache efficiency!
Basically, given two small arrays A and B (such that both fit in the CPU caches), if the computations to do are sum(A) * sum(B) and sum(B) / sum(A), having two threads that do the computations will be more than twice as fast as a single thread. In the two thread case, A and B will be fetched in parallel (assuming left-to-right evaluation), incurring only one round of pipeline stalls for data fetching, whereas the single threaded case will have to incur two pipeline stalls from memory fetching.