AMD-powered Frontier supercomputer breaks the exascale barrier(tomshardware.com) |
AMD-powered Frontier supercomputer breaks the exascale barrier(tomshardware.com) |
It kind of shows the difference in priority spending, when nuclear labs get >1000 petaflop super computers, and the weather service (that helps with disasters that affect many Americans each year) gets a new one that is 1.2% of the speed.
https://www.noaa.gov/media-release/us-to-triple-operational-....
Research spending is based on the potential for discovery. As a species we have studied weather since the beginning of time. How long have we been doing nuclear research? A century?
Is there even an opportunity cost here? Or is it an economy of scale? As we build more supercomputers the costs go down. So NOAA and ORNL both get what they need for less.
The US is way behind on weather modelling, in part due to lack of computing power available to do the grids at sufficiently small cells compared to Europe and other parts of the world. That means less accurate predictions and less advance notice of impending disasters, which means more risk of loss of life and impact on infrastructure and the economy (and vice versa, inaccuracy can lead to more caution than is necessary, which has economic impact too). The US has to lean on Europe etc. for predictions.
https://cliffmass.blogspot.com/2020/02/smartphone-weather-ap...
Talks about the fact that IBM / Weather.com actually uses a more accurate system than the NWS uses, because the NWS is still stuck on GFS (been several years now since congress passed an act to force NOAA to update away from it, and unfortunately it takes time)
Even at a 3-day lead time, GFS was still suggesting landfall for hurricane Sandy outside the New York region, the longer lead times provided by other centers (with more computing power) were very important for preparation [1].
Even on the science side, increased computing power enables a host of new discoveries. Even storing the locations for all the droplets in a small cloud would require an excessive amount of memory, let alone doing any processing [2]. Increased computer power enables us to better understand how clouds respond to their environment, which is a key uncertainty in predicting climate change.
Many disciplines of meteorology are also much newer than nuclear physics. Cloud physics (for example) only really got started with the advent of weather radar (so the 1940s). Before that, even simple questions (such as can a cloud without any ice in it produce rain?) were unknown.
Even today, we still have difficulty seeing into the most intense storms. You cannot fly an aircraft in there, and radar has difficulty distinguishing different types of particle (ice, liquid, mushy ice, ice with liquid on the surface, snow) and is not good at coutning the number of particles either.
Even after thousands of years, we are onlyjust now getting the tools to understand it. There is a lot left to discover about the weather!
[1] - https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/201...
[2] - https://www.cloudsandclimate.com/blog/clouds_and_climate/#id...
"An estimate of future HPC needs should be both demand-based and reasonable. From an operational NWP perspective, a four-fold increase in model resolution in the next ten years (sufficient for convection-permitting global NWP and kilometer-scale regional NWP) requires on the order of 100 times the current operational computing capacity. Such an increase would imply NOAA needs a few exaflops of operational computing by 2031. Exascale computing systems are already being installed at Oak Ridge National Laboratory (1.5 exa floating point operations per second (EF)) and Argonne Labs (1.0 EF) and it is likely that these national HPC laboratories will approach 100 EF by 2031. Because HPC resources are essential to achieving the outcomes discussed in this report, it is reasonable for NOAA to aspire to a few percent of the computing capacity of these other national labs at a minimum. Substantial investments are also needed in weather research computing. To achieve a 3:1 ratio of research to operational HPC, NOAA will need an additional 5 to 10 EF of weather research and development computing by 2031. Since research computing generally does not require high-availability HPC, it should cost substantially less than operational HPC and should be able to leverage a hybrid of outsourced, cloud and excess compute resources."[1]
[1]https://sab.noaa.gov/wp-content/uploads/2021/11/PWR-Report_2...
Cynicism is unwarranted, but it fits the current zeitgeist, biases and feels good.
Would you prefer the research being performed based on empirical testing instead of running simulations?
I can't speak for NOAA, but my experience with supercomputing has been that there is no abstraction of computation, your workload is very much tied to hardware assumptions.
They aren't just used for global-scale geophysical processes like weather and climate or complex physics simulations. For example, oil companies rent time to analytically reconstruct the 3-dimensional structure of what's underneath the surface of the Earth from seismic recordings.
In fact, such use accounts for the vast majority of the compute use.
Also, yesterday Tom's hardware had a detailed article: https://www.tomshardware.com/news/amd-powered-frontier-super... 29 MW total, 400 kW per rack(!)
And, anyone else is like me and wants to see actual pictures or videos of the supercomputer, instead of a rendering like in venturebeat article? Well, head here, ORNL has a very short video: https://www.youtube.com/watch?v=etVzy1z_Ptg We can see among other things: that it's water-cooled (the blue and red tubing), at 0m3s we see a PCB labelled "Cray Inc Proprietary ... Sawtooth NIC Mezzanine Card"
Surely the people at these labs will want to run ordinary DL frameworks at some point - or do they have the money and time to always build entirely custom stacks?
[1] AMD Instinct MI250x in this case.
And there are another 2 (3?) faster systems coming online in the next year or so.
The AMD GPUs with the CDNA ISA have surpassed in energy efficiency both the NVIDIA A100 GPUs and the Fujitsu ARM with SVE CPUs, which had been the best previously.
Unfortunately, AMD has stopped selling at retail such GPUs suitable for double-precision computations.
Until 5 or 6 years ago, the AMD GPUs were neither the fastest nor the most energy-efficient, but they had by far the best performance per dollar of any devices that could be used for double-precision floating-point computations.
However, when they have made the transition to RDNA, they have separated their gaming and datacenter GPUs. The former are useless for DP computations and the latter cannot be bought by individuals or small companies.
Looking at “double-precision GFlops” columns there [1] they don’t seem terribly bad, more than twice as fast compared to similar nVidia chips [2]
While specialized extremely expensive GPUs from both vendors are way faster with many TFlops of FP64 compute throughput, I wouldn’t call high-end consumer GPUs useless for FP64 workloads.
The compute speed is not terribly bad, and due to some architectural features (ridiculously high RAM bandwidth, RAM latency hiding by switching threads) in my experience they can still deliver a large win compared to CPUs of comparable prices, even in FP64 tasks.
[1] https://en.wikipedia.org/wiki/Radeon_RX_6000_series#Desktop
[2] https://en.wikipedia.org/wiki/GeForce_30_series#GeForce_30_(...
Is it feasible to run eight models on one supercomputer, or is that inefficient?
https://status.alcf.anl.gov/theta/activity
And I believe it is more efficient to have a single large cluster. As there are large overheard costs of power, cooling, and having a physical space to put the machine in. Plus a personnel cost to maintain the machines.
[1] - https://www.hpcwire.com/off-the-wire/nsf-announces-upcoming-...
Intel was supposed to build the first Exascale system for ANL [1] [2]. to be installed by 2018. They completely and utterly messed up the execution, partly drive by 10nm failure, went back to the drawing board multiple times, and now Raja switched the whole thing to GPUs, a technology that Intel has no previous success with and rebased it to 2 ExaFlops peak, meaning they probably expect 1 EF sustained performance, a 50% efficiency. No other facility would ever consider Intel as a prime contractor again. ANL hitched their wagon to the wrong horse.
1. https://www.alcf.anl.gov/aurora 2. https://insidehpc.com/2020/08/exascale-exasperation-why-doe-...
This is like the time the Athlon64 and it’s on die memory controller was kicking the Pentiums around.
https://www.olcf.ornl.gov/frontier/
There's also detailed architecture specs on Crusher, an identical (but smaller) system:
https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide...
https://www.nextplatform.com/2021/10/26/china-has-already-re...
Why not? While I don't remember what was the previous US's x86 cluster that ranked as top of Top500 List (RoadRunner in 2009?), China's Tianhe-3 and OceanLight are direct successors of Tianhe-2A and TaihuLight, which are once fastest and still in top 10. These seems more promising to me.
I once had an terrible experience with AMD ~10 years ago that made me swear off them for good. Had something to do with software but I remember it taking several days of work/solutions.
Willing to give them another try soon though. I never seem to even use the full power of whatever CPU I get, lol.
I should point out that there were significant USB problems on AMD B550, X570 chipsets (eventually addressed via BIOS updates).
Unfortunately some professional audio gear is only certified for use with Intel chipsets and I have experienced some deal-breaking latency issues with ASIO drivers. For gaming I will be happy to continue using AMD - but for music I will probably switch back to Intel for my next rig.
Also sums up my AMD experience 10 years ago. Stuff just wasn't working :/
Onward to a zettaflop around 2037?
Does this No. 1 position have something to do with the ban on exporting advanced technology to China?
You have a blend of very specific domain specific knowledge (e.g. they know the hardware - the interconnects more than the CPUs) and old skool Unix system administration.
Thinking about it, the most powerful supercomputer in the world is pretty much a million consumer processors, working in parallel. That's going to stay pretty constant, since cost scales roughly linearly.
If X is the processing power of $1k of consumer hardware, the bigger X gets, the less there is a difference in the class of problems that you can solve with X or X * 1e6 processing power.
Basically I'm estimating the benefit ratio to be (log SupercomputerSize - log ConsumerSize)/log ConsumerSize, and that keeps decreasing.
I understand you are joking, but it's a legitimate benchmark, one which I've seen at least Anandtech using. For instance, a quick web search found an article from last year (https://www.anandtech.com/show/16478/64-cores-of-rendering-m...) which shows an AMD CPU (a Ryzen 9) running Crysis without hardware acceleration at 1080p at nearly 20 FPS. As that article says, it's hard to go much higher than that, due to limitations of the Crysis engine.
- For practical use, and non overclocked, the EC12 at 5.5 Ghz: https://www.redbooks.ibm.com/redbooks/pdfs/sg248049.pdf
or
- An AMD FX-8370 floating in Liquid Nitrogen at 8.7 Ghz: https://hwbot.org/benchmark/cpu_frequency/rankings#start=0#i...
The real pain for us is that there’s no decent consumer grade chips with ROCm compatibility for us to do development on. AMD have made it very clear they only care about the data centre hardware when it comes to ROCm, but I have no idea what kind of developer workflow they’re expecting there.
(I work for a DOE lab but views are my own, etc.)
[1] As an example, see the approach in: https://github.com/flatironinstitute/cufinufft/pull/116
Hopefully AMD gets the Rx 6800xt working with ROCm consistently, but even then, the 6800xt is RDNA2, while the supercomputer Mx250x is closer to the Vega64 in more ways.
So all in all, you probably want a Vega64, Radeon VII, or maybe an older MI50 for development purposes.
I don't know about that. A lot of these labs are doing physics simulations and are probably happy to stick with their dense-matrix multiply / BLAS routines.
Deep learning is a newer thing. These national labs can run them of course, but these national labs have existed for many decades and have plenty of work to do without deep learning.
> or do they have the money and time to always build entirely custom stacks?
Given all the talk about OpenMP compatibility and Fortran... my guess is that they're largely running legacy code in Fortran.
Perhaps some new researchers will come in and try to get some deep-learning cycles in the lab and try something new.
The biggest challenge the national labs face is that there's not really any budget (or appetite) to rewrite software to take advantage of hardware features (particularly the GPU-based accelerator that's all the rage nowadays). You might be able to get a code rewritten once, but an era where every major HPC hardware vendor wants you to rewrite your code into their custom language for their custom hardware results in code that will not take advantage of the power of that custom hardware. OpenMP, being already fairly widespread, ends up becoming the easiest avenue to take advantage of that hardware with minimal rewriting of code (tuning a pragma doesn't really count as rewriting).
The must used linear algebra library is written in Fortran. There's nothing "legacy" about it, it's just that nobody was able to replicate its speed in C.
[1]: https://github.com/ECP-CANDLE/Benchmarks
[2]: https://www.exascaleproject.org/research-project/candle/
I quit after getting vaccinated for COVID, only stayed because of the pandemic.
The biggest problem was that Intel simply couldn't execute. They couldn't design and manufacture hardware in a timely manner without too many bugs. I think this was due to poor management practices. My direct manager was amazing, but my skiplevel was always dealing with fires. It felt like instead of the effort being orchestrated that someone approached a crowd of engineers and used a bullhorn to tell them the big goal and that was it. The left hand had no idea what the right hand was doing.
I often called Intel an 'ant hill', because the engineers would swarm a project just like ants do a meal. Some would get there and pull the project forward, some would get on top and uselessly pull upward, and more than I'd like would get behind the project and pull it backwards. Just a mindless swarm of effort, which generally inefficiently kinda did the right thing sometimes.
The inability to execute started to effect my work. When I got a ticket to complete something, I just wouldn't. There was a very good chance that I'd have an extra few weeks (due to slippage) or the task would never need to get done, because the hardware would never appear. Planning was impossible.
Conversely, sometimes hardware CAME OUT OF NOWHERE, not simple stuff, but stuff like laptops made by partners. Just randomly my manager would ask me to support a product we were told directly wouldn't exist, but now did. I needed to help our partner with support right now. Our partners were starting to hate us and it was palpable in meetings.
I'm so glad I quit, I was being worked to the bone on a project which will probably fail and be a massive liability. Even if the economy crashes, and I can't get a job for years, and end up broke, it'll still have been worth it. I also only made 110K/yr base.
Do you know if anything has changed at Intel? Is it reasonable to expect changes within a year and a half of starting on the job given the size of the company and the changes needed?
https://physics.stackexchange.com/questions/348854/parker-so...
When people talk about a supercomputer being 'fast' they generally mean FLOPS - floating point operations per seconds, which isn't clock-speed.
Multiplying the number of processors by the clock speed of the processors, and then multiplying that product by the number of floating-point operations the processors can perform in one second, as done for supercomputers FLOPS, does not help me :-)
The video below compares 8150 against CPUs from 2020 (i.e. no 5900x or 12900KS), includes data from 8370.
The set of problems that fit into a single node is growing. At least in some fields where the added benefit of more data is less important than, say, more precise measurements.
1. As an undergraduate, join a research group that needs to run simulations on a supercomputer.
2. As a grad student, join a research group that works with supercomputers.
3. As a software engineer or IT person, join a research group at a university. They need people too, but fair warning: the pay is...subpar.
4. Join a national laboratory in some capacity. This route necessitates working for your country's government or military, which may or may not be palatable to you depending on how you feel about your gov't/military.
5. Join a giant multinational company that has supercomputers and uses them. Exxon is a good example. They have massive supercomputing power.
Unless you're an undergrad, I'm afraid all the ways I know of suck in some way or another. I did 1 & 3. As for the rest, I think 2 would make the most sense if you have BS, because you can go get a masters in a year or so while getting the experience.
Although keep in mind they have a lot of momentum that is going largely in the right way to begin with. They have among the best logic designers, circuit designers, EDA, silicon research and manufacturing process and technologies, and software division in the world. Despite Intel having had a > 5 year train wreck in their 10nm manufacturing technology, they're able to release CPUs which are for many cases among the best if not the best in the market which goes to show how far ahead they were and how good their design capability still is.
So I think the problem is both bigger and smaller than people think (i.e., they've not completely crashed and burned, but it won't be a matter of just wiping the slate clean and ordering the engineers to deliver on the next product).
I haven't sold the $INTC I got as comp, and that probably speaks louder than whatever I say here.
The money pouring into research disproves this.
In fact, it makes no sense at all to clam that our collective understanding of a phenomenon is already satisfactory and all research was already done after a few years of the first real world test.
For context, the Oklahoma City bombing was a few decades into the past but it still motivates a great deal of research in multiple research paths, even though none of it is rocket science or involves cutting-edge physics.
Yea, because money always goes to the most important and most useful research! /s
I'm not entirely sure how it compared the ECMWF model during last years hurricane season, but I do think its improved substantially.
Edit: I should say that their recommendation is to write the kernels in ‘hip’ which is supposed to be their cross device wrapper for both cuda or ROCm. I’m writing in Julia however so that’s not possible.
Ultimately, however, no system is measured by its clock rate- or by its cache size- or by its MIPS. Because no real workload is truly determined by a simple linear function of those variables.
Time to completion will depend on task.
One word: antitrust. The discrete GPU market these days consists of Nvidia and AMD, with Intel only just now dipping its toes into the market (I don't think there's anything saleable to retail customers yet). Nvidia buying AMD would make it a true monopoly in that market, and there's no way that would pass antitrust regulators. Nvidia recently tried to buy ARM, and even that transaction was enough for antitrust regulators to say no.
Nvidia actually used to develop chipsets for AMD processors include onboard GPUs, they did for Intel as well but they had a much more serious relationship with AMD in my estimation. This stopped with the ATI purchase since ATI is nvidia's main competitor the two companies stopped working together. Intel later killed all 3rd party chipset altogether and AMD had to do a lot of chipset work they weren't doing before.
I sometimes wonder what would have happened if they had merged back then. I personally think a Jensen Huang run AMD would have done much better than AMD+ATI did in that era. I could easily see ATI having collapsed. What would the consoles use now? Would nvidia have been as aggressive as it has been without the strategic weakness of now controlling the platform it's products run on?
Sure, the new owners could re-negotiate with Intel, and maybe nothing would change. But who knows? A combined AMD/nVidia might be a sufficient threat to Intel they might pull some desperate moves.
(In some timeline, this turns out to be the boost that makes RISC-V the new "standard" ISA, but I am not so optimistic it is the one we live in.)
There really used to be a lot of intra-generational tweaking and refinement, like if you look back at Maxwell there were really at least 3 and I suspect 4 total steppings of the maxwell architecture (GM107, GM204/GM200, and GM206 - and I suspect GM200 was a separate "stepping" too due to how much higher it clocks than GM204 - which is the opposite of what you'd expect from a big chip). Kepler had at least 4 major versions (GK1xx, GK110B, GK2xx, GK210), Fermi had at least 2 (although that's where I'm no longer super familiar with the exact details).
Anyway point is there used to be a lot more intra-generational refinement, and I think that has largely stopped, it's just thrown over the wall and done. And I think the reason for that is that if NVIDIA really cranked full-steam ahead they'd be getting far enough ahead of AMD to potentially start raising antitrust concerns. We are now in the era of "metered performance release", just enough to stay ahead of AMD but not enough to actually raise problems and get attention from antitrust regulators.
Same thing for the choice of Samsung 8nm for Ampere and TSMC 12nm for Turing, while AMD was on TSMC 7nm for both of those. Sure, volume was a large part of that decision, but they're already matching AMD with a 1-node deficit (Samsung 8nm is a 10+, and the gap between 10 and TSMC 7 is huge to begin with) and they were matching with a 1.5 node deficit during the Turing generation (12FFN is a TSMC 16+ node - that is almost 2 full nodes to TSMC 7nm). They cannot just make arbitrarily fast processors that dump on AMD, or regulators will get mad, so in that case they might as well optimize for cost and volume instead. If they had done a TSMC 7nm against RDNA1 they probably would be starting to get in that danger zone - I'm sure they were watching it carefully during the Maxwell era too.
(the people who imagined some giant falling-out between TSMC are pretty funny in hindsight. (A) NVIDIA still had parts at TSMC anyway, and (B) TSMC obviously couldn't have provided the same volume as Samsung did, certainly not at the same price, and volume ended up being a godsend during the pandemic shortages and mining. Yeah, shortages sucked, but they could still have been worse if NVIDIA was on TSMC and shipping half or 2/3rds of their current volume.)
Of course now we may see that dynamic flip with AMD moving to MCM products earlier, or maybe that won't be for another year or so yet rumors are suggesting monolithic midrange chips will be AMD's first product. Or perhaps "monolithic", being technically MCM but with cache dies/IO dies rather than multiple compute dies. But with RDNA3 AMD is potentially poised to push NVIDIA a little bit, rather than just the controlled opposition we've seen for the past few generations, hence NVIDIA reportedly moving to TSMC N5P and going quite large with a monolithic chip to compete.
I am a maintainer for rocSOLVER (the ROCm LAPACK implementation) and I personally own an RX 6800 XT. It is very similar to the officially supported W6800. Are there any specific issues you're concerned about?
I know the software and I have the hardware. I'd be happy to help track down any issues.
I might be operating off of old news. But IIRC, the 6800 wasn't well supported when it first came out, and AMD constantly has been applying patches to get it up-to-speed.
I wasn't sure what the state of the 6800 was (I don't own it myself), so I might be operating under old news. As I said a bit earlier, I use the Vega64 with no issues (for 256-thread workgroups. I do think there's some obscure bug for 1024-thread workgroups, but I haven't really been able to track it down. And sticking with 256-threads is better for my performance anyway, so I never really bothered trying to figure this one out)
With respect to your issue running 1024 threads per block, if you're running out of VGPRs, you may want to try explicitly specify the max threads per block as 1024 and see if that helps. I recall that at one point the compiler was defaulting to 256 despite the default being documented as 1024.
Like, it's always seemed like there's a certain amount of fatalism around Undefined Behavior in C/C++, like this is somehow how it has to be to write fast code but... it's not. You can just declare things as actually forbidden rather than just letting the compiler identify a boo-boo and silently do whatever the hell it wants.
Of course it's not the right tool for every task, I don't think you'd write bit-twiddling microcontroller stuff in fortran, or systems programming. But for the HPC space, and other "scientific" code? Fortran is a good match and very popular despite having an ancient legacy even by C/C++ standards (both have, of course, been updated through time). Little less flexible/general, but that allows less-skilled programmers (scientists are not good programmers) to write fast code without arcane knowledge of the gotchas of C/C++ compiler magic.
For a crude approximation, Fortran is somewhat equivalent to C code where all pointer function arguments are marked with the restrict keyword.
> Like, it's always seemed like there's a certain amount of fatalism around Undefined Behavior in C/C++, like this is somehow how it has to be to write fast code but... it's not. You can just declare things as actually forbidden rather than just letting the compiler identify a boo-boo and silently do whatever the hell it wants.
Well, it's kind more dangerous than C, in this aspect. The aliasing restriction is a restriction on the Fortran programmer; the compiler or runtime is not required to diagnose it, meaning that the Fortran compiler is allowed to optimize assuming that two pointers don't alias.
That being said, in general I'd say Fortran has less footguns than C or C++, and is thus often a better choice for a domain expert that just wants to crunch numbers.
My understanding is that most supercomputers have the vendor provide their implementation of BLAS (e.g., if it's Intel-based, you're getting MKL) that's specifically tuned for that hardware. And these implementations stand a decent chance of being written in assembly, not Fortran.
The clearest form of this is in BLIS, which is a C framework you can drop your assembly kernel into, and then it makes a BLAS (along with some other stuff) for you. But the idea is also present in OpenBlas.
Lots of this is due to the legacy of gotoBlas (which was forked into OpenBlas, and partially inspired BLIS), written by the somewhat famous (in HPC circles at least) Kazushige Goto. He works at Intel now, so probably they are doing something similar.
Presumably that old Fortran code has survived many generations of ports: Connection Machine, DEC Alpha, Intel Itanium, SPARC and finally today's GPU heavy systems. The BLAS layer keeps getting rewritten but otherwise the bulk of the simulators still works.
The best BLAS libraries use C and Assembly. This is because BLAS is the de-facto standard interface for Linear Algebra code, and so it is worthwhile to optimize it to an extreme degree (given infinite programmer-hours, C can beat any language, because you can embed assembly in C).
But for those numerical codes which aren't incredibly hand-optimized, Fortran makes nice assumptions, it should be able to optimize the output of a moderately skilled programmer pretty well (hey we aren't all experts, right?).
The opposite relationship between many AMD GPUs and the available CPUs was true until 5-6 years ago, while NVIDIA had reduced the DP computation abilities of their non-datacenter GPUs many years before AMD, despite their previous aggressive claims about GPGPU being the future of computation, which eventually proved to be true only for companies and governments with exceedingly deep pockets.
The memory in graphics cards is an order of magnitude faster, my current one has 480 GB/sec of that bandwidth. For this reason, even gaming GPUs can be much faster than CPUs on some workloads, despite the theoretical peak FP64 GFlops number is about the same.
Nevertheless, many of the problems of this kind require more memory than the 8 GB or 16 GB that are available on cheap GPUs, so the CPUs remain better for those.
On the other hand, there are a lot of problems whose time-consuming part can be reduced to multiplications of dense matrices. During the solution of all such problems, the CPUs will reach a large fraction of their maximum computational speed, regardless whether the operands fit in the caches or not (when they do not fit, the operations can be decomposed into sub-operations on cache-sized blocks, and in such algorithms the cache lines are reused enough times so that the time used for transfers does not matter).
I vaguely remember a consumer card having 1/4 the fp64 units of a similar data center one so that would get the 16x on paper.
Memory bandwidth / register file size would suggest another 2x from moving less data. My working heuristic on these things is compute is free because I fail to saturate the memory bus but no doubt some applications do actually run into that slowdown in practice. Matrix multiply probably does.
One typical problem is multiplying dense vector by a sparse matrix. Unlike multiplication of two dense matrices, I don’t think it’s possible to decompose into manageable pieces which would fit into caches to saturate the FP64 math of the CPU cores.
We have tested our software on nVidia Teslas in a cloud (the expensive ones with many theoretical TFlops of FP64 compute), the performance wasn’t too impressive.
Granted, RDNA and CDNA still have largely the same assembly language, so its still better than using say... NVidia GPUs. But I have to imagine that the 32-wide vs 64-wide difference is big in some use cases. In particular: low-level programs that use warp-level primitives, like DPP, shared-memory details and such.
I assume the super-computer programmers want a cheap system to have under their desk to prototype code that's similar to the big MI250x system. Vega56/64 is several generations old, while 6800 xt is pretty different architecturally. It seems weird that they'd have to buy MI200 GPUs for this purpose, especially in light of NVidia's strategy (where A2000 nvidia could serve as a close replacement. Maybe not perfect, but closer to the A100 big-daddy than the 6800xt is to the big daddy MI250x).
--------
EDIT: That being said: this is probably completely moot for my own purposes. I can't afford an MI250x system at all. At best I'd make some kind of hand-built consumer rig for my own personal purposes. So 6800 xt would be all I personally need. VRAM-constraints feel quite real, so the 16GBs of VRAM at that price makes 6800xt a very pragmatic system for personal use and study.
And why should your algorithm be the benchmark for supercomputer performance, rather than something that is at least somewhat related [1] to the workloads those machines run?
[1] We can of course argue endlessly that HPL is no longer a very representative benchmark for supercomputer workloads, but I digress.
Nobody but you is confused about this.
I was trolling a little bit, the people who downvoted my measure of speed :-) because the millions of FLOPS of a supercomputer, will help for parallel tasks but will not be "faster" for a common use case.
So fastest computer is one thing, most powerful is another.
Your argument is a bit like saying the fastest land speed vehicle isn't really the fastest because you can't go to the grocery store with it.