Intel Gaudi 3 AI Accelerator(intel.com) |
Intel Gaudi 3 AI Accelerator(intel.com) |
How much does a single 200Gbit active (or inactive) fiber cable cost? Probably thousands of dollars.. making even the cabling for each card Very Expensive. Nevermind the network switches themselves..
Simultaneously impressive and disappointing.
If you're going fiber instead of twinax it's another order of magnitude and a bit for trancievers, but cables are pretty cheap still.
You seem to be loading negative energy into this release from the get-go
If you search for "AOC Fiber", lots of resources will pop up. FS.com is one helpful resource.
https://community.fs.com/article/active-optical-cable-aoc-ri...
> Active optical cable (AOC) can be defined as an optical fiber jumper cable terminated with optical transceivers on both ends. It uses electrical-to-optical conversion on the cable ends to improve speed and distance performance of the cable without sacrificing compatibility with standard electrical interfaces.
Now, connecting the leaf switches to the spine is a different story...
You do read the licenses of SDKs you use, right?
Nothing but Nvidia hardware will ever "support" CUDA.
With Nvidia, the SXM connection pinouts have always been held proprietary and confidential. For example, P100's and V100's have standard PCI-e lanes connected to one of the two sides of their MegArray connectors, and if you know that pinout you could literally build PCI-e cards with SXM2/3 connectors to repurpose those now obsolete chips (this has been done by one person).
There are thousands, maybe tens of thousands of P100's you could pickup for literally <$50 apiece these days which technically give you more Tflops/$ than anything on the market, but they are useless because their interface was not ever made open and has not been reverse engineered openly and the OEM baseboards (Dell, Supermicro mainly) are still hideously expensive outside China.
I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.
See flash attention, triton, etc as core enabling libraries. Not to mention all of the custom CUDA kernels all over the place. Take all of this and then stack layers on top of them...
Unfortunately there is famously "GPU poor vs GPU rich". Pascal puts you at "GPU destitute" (regardless of assembled VRAM) and outside of implementations like llama.cpp that go incredible and impressive lengths to support these old archs you will very quickly run into show-stopping issues that make you wish you just handed over the money for >= 7.0.
I support any use of old hardware but this kind of reminds me of my "ancient" X5690 that has impressive performance (relatively speaking) but always bites me because it doesn't have AVX.
But my work/fun is in CFD. One of the main codes I use for work was written to be supported primarily at the time of Pascal. Other HPC stuff too that can be run via OpenCL, and is still plenty compatible. Things compiled back then will still run today; It's not a moving target like ML has been.
> I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.
Very cool hobby. It’s also unfortunate how stringent e-waste rules lead to so much perfectly fine hardware to be scrapped. And how the remainder is typically pulled apart to the board / module level for spares. Makes it very unlikely to stumble over more or less complete-ish systems.
Yes, it has a decent memory bandwidth (~750 GB/s) and it runs CUDA. But it only has 16 GB and doesn't support tensor cores or low precision floats. It's in a weird place.
Now that video gaming is secondary (tertiary?) to Nvidia’s revenue stream, they could give a shit which brand gamers prefer. It’s small time now. All that matters is who companies are buying their GPUs from for AI stuff. Break down that CUDA wall and it’s open-season. I wonder how they plan to stave that off. It’s only a matter of time before people get tired of writing C++ code to interface with CUDA.
A while ago NVIDIA and the GraalVM team demoed grCUDA which makes it easy to share memory with CUDA kernels and invoke them from any managed language that runs on GraalVM (which includes JIT compiled Python). Because it's integrated with the compiler the invocation overhead is low:
https://developer.nvidia.com/blog/grcuda-a-polyglot-language...
And TornadoVM lets you write kernels in JVM langs that are compiled through to CUDA:
There are similar technologies for other languages/runtimes too. So I don't think that will cause NVIDIA to lose ground.
But maybe there was more I missed. I'll take another look.
I think this is literally the impetus for their OAM spec which makes the pinout open and shareable. Up until that, they had to keep the actual designs of the baseboards out of the public due to that part being still controlled Nvidia IP.
Or, these salvage operations just stripped racks and kept the small stuff and e-waste the racks because they think it's the more efficient use of their storage space and would be easier to sell, without thinking correctly.
The Gaudi 3 multi-chip package also looks interesting. I see 2 central compute dies, 8 HBM die stacks, and then 6 small dies interleaved between the HBM stacks - curious to know whether those are also functional, or just structural elements for mechanical support.
This is one of the secret recipes of Intel. They can use older tech and push it a little further to catch/surpass current gen tech until current gen becomes easier/cheaper to produce/acquire/integrate.
They have done it with their first quad core processors by merging two dual core processors (Q6xxx series), or by creating absurdly clocked single core processors aimed at very niche market segments.
We have not seen it until now, because they were sleeping at the wheel, and knocked unconscious by AMD.
Any other examples of this? I remember the secret sauce being a process advantage over the competition, exactly the opposite of making old tech outperform the state of the art.
Would you say this means Intel is "back," or just not completely dead?
I truly do hope it is successful so we can have some alternative accelerators.
WHAT‽ It's basically got the equivalent of a 24-port, 200-gigabit switch built into it. How does that make sense? Can you imaging stringing 24 Cat 8 cables between servers in a single rack? Wait: How do you even decide where those cables go? Do you buy 24 Gaudi 3 accelerators and run cables directly between every single one of them so they can all talk 200-gigabit ethernet to each other?
Also: If you've got that many Cat 8 cables coming out the back of the thing how do you even access it? You'll have to unplug half of them (better keep track of which was connected to what port!) just to be able to grab the shell of the device in the rack. 24 ports is usually enough to take up the majority of horizontal space in the rack so maybe this thing requires a minimum of 2-4U just to use it? That would make more sense but not help in the density department.
I'm imagining a lot of orders for "a gradient" of colors of cables so the data center folks wiring the things can keep track of which cable is supposed to go where.
I hope to work on this for AMD MI300x soon. My company just got added to the MLCommons organization.
Seems like an okay $8,000 - $30,000 investment, and bare metal server maintenance isn’t that complicated these days.
I didn't know "terabytes (TB)" was a unit of memory bandwidth...
"The Intel Gaudi 3 accelerator, architected for efficient large-scale AI compute, is manufactured on a 5 nanometer (nm) process"
https://www.semianalysis.com/p/is-intel-back-foundry-and-pro...
https://docs.nvidia.com/cuda/parallel-thread-execution/index...
Think prototype consumer product with total cost preferably < $500, definitely less than $1000.
I can't speak to the ease of configuration but know that some people have used these successfully.
That's amusing. :D
what is the programming interface here ? this is not CUDA right ...so how is this being done ?
https://www.intel.com/content/dam/www/central-libraries/us/e...
Not the best of times for stuff that doesn't fit matrix processing units.
Also, hardware has a lifecycle. At some point the old hardware isn't worth running in a large scale operation because it consumes more in electricity to run 24/7 than it would cost to replace with newer hardware. But then it falls into the hands of people who aren't going to run it 24/7, like hobbyists and students, which as a manufacturer you still want to support because that's how you get people to invest their time in your stuff instead of a competitor's.
Yesterday I thought about installing and trying to use https://news.ycombinator.com/item?id=39372159 (Reor is an open-source AI note-taking app that runs models locally.) and feed it my markdown folder but I stop midway, asking myself "don't I need some kind of powerful GPU for that ?". And now I am thinking "wait, should I wait for `standard` pluggable AI computing hardware device ? Is that Intel Gaudi 3 something like that ?".
> The Gaudi 3 accelerators inside of the nodes are connected using the same OSFP links to the outside world as happened with the Gaudi 2 designs, but in this case the doubling of the speed means that Intel has had to add retimers between the Ethernet ports on the Gaudi 3 cards and the six 800 Gb/sec OSFP ports that come out of the back of the system board. Of the 24 ports on each Gaudi 3, 21 of them are used to make a high-bandwidth all-to-all network linking those Gaudi 3 devices tightly to each other. Like this:
> As you scale, you build a sub-cluster with sixteen of these eight-way Gaudi 3 nodes, with three leaf switches – generally based on the 51.2 Tb/sec “Tomahawk 5” StrataXGS switch ASICs from Broadcom, according to Medina – that have half of their 64 ports running at 800 GB/sec pointing down to the servers and half of their ports pointing up to the spine network. You need three leaf switches to do the trick:
> To get to 4,096 Gaudi 3 accelerators across 512 server nodes, you build 32 sub-clusters and you cross link the 96 leaf switches with a three banks of sixteen spine switches, which will give you three different paths to link any Gaudi 3 to any other Gaudi 3 through two layers of network. Like this:
The cabling works out neatly in the rack configurations they envision. The idea here is to use standard Ethernet instead of proprietary Infiniband (which Nvidia got from acquiring Mellanox). Because each accelerator can reach other accelerators via multiple paths that will (ideally) not be over-utilized, you will be able to perform large operations across them efficiently without needing to get especially optimized about how your software manages communication.
https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet
> RDMA over Converged Ethernet (RoCE) or InfiniBand over Ethernet (IBoE)[1] is a network protocol which allows remote direct memory access (RDMA) over an Ethernet network. It does this by encapsulating an InfiniBand (IB) transport packet over Ethernet.
Sounds very cool.O.5-1.5/2m copper cables are easily available and cheap and 4-8m (and even longer) is also possible with copper but tends to be more expensive and harder to get by.
Even 800gb is possible with copper cables these days but you’ll end up spending just as much if not more on cabling as the rest of your kit…https://www.fibermall.com/sale-460634-800g-osfp-acc-3m-flt.h...
100GBe is only supported on twinax anyway, so Cat8 is irrelevant here. The other 3 ports are probably QSFP or something.
But I'm not how big and how expensive a 24 channel cat 8 snake would be (!).
Althought at least for Python JITs, Intel seems to also be doing something.
And then there is the graphical debugging experience for GPGPU on CUDA, that feels like doing CPU debugging.
This experience is largely a misalignment between what AMD thinks their product is and what the Linux world thinks software is. My pet theory is it's a holdover from the GPU being primarily a games console product as that's what kept the company alive through the recent dark times. There's money now but some of the best practices are sticky.
In games dev, you ship a SDK. Speaking with personal experience here as I was on the playstation dev tools team. That's a compiler, debugger, profiler, language runtimes, bunch of math libs etc all packaged together with a single version number for the whole thing. A games studio downloads that and uses it for the entire dev cycle of the game. They've noticed that compiler bugs move so each game is essentially dependent on the "characteristics" of that toolchain and persuading them to gamble on a toolchain upgrade mid cycle requires some feature they really badly want.
HPC has some things in common with this. You "module load rocm-5.2" or whatever and now your whole environment is that particular toolchain release. That's where the math libraries are and where the compiler is.
With that context, the internal testing process makes a lot of sense. At some point AMD picks a target OS. I think it's literally "LTS Ubuntu" or a RedHat release or similar. Something that is already available anyway. That gets installed on a lot of CI machines, test machines, developer machines. Most of the boxes I can ssh into have Ubuntu on them. The userspace details don't matter much but what this does do is fix the kernel version for a given release number. Possibly to one of two similar kernel versions. Then there's a multiple month dev and testing process, all on that kernel.
Testing involves some largish number of programs that customers care about. Whatever they're running on the clusters, or some AI things these days. It also involves a lot of performance testing where things getting slower is a bug. The release team are very clear on things not going out the door if things are broken or slower and it's not a fun time to have your commit from months ago pulled out of the bisection as the root cause. That as-shipped configuration - kernel 5.whatever, the driver you build yourself as opposed to the one that kernel shipped with, the ROCm userspace version 4.1 or so - taken together is pretty solid. It sometimes falls over in the field anyway when running applications that aren't in the internal testing set but users of it don't seem anything like as cross as the HN crowd.
This pretty much gives you the discrepancy in user experience. If you've got a rocm release running on one of the HPC machines, or you've got a gaming SDK on a specific console version, things work fairly well and because it's a fixed point things that don't work can be patched around.
In contrast, you can take whatever linux kernel you like and use the amdkfd driver in that, combined with whatever ROCm packages your distribution has bundled. Last I looked it was ROCm 5.2 in debian, lightly patched. A colleague runs Arch which I think is more recent. Gentoo will be different again. I don't know about the others. That kernel probably isn't from the magic list of hammered on under testing. The driver definitely isn't. The driver people work largely upstream but the gitlab fork can be quite divergent from it, much like the rocm llvm can be quite divergent from the upstream llvm.
So when you take the happy path on Linux and use whatever kernel you happen to have installed, that's a codebase that went through whatever testing the kernel project does on the driver and reflects the fraction of a kernel dev branch that was upstream at that point in time. Sometimes it's very stable, sometimes it's really not. I stubbornly refuse to use the binary release of ROCm and use whatever driver is in Debian testing and occasionally have a bad time with stability as a result. But that's because I'm deliberately running a bleeding edge dev build because bugs I stumble across have a chance of me fixing them before users run into it.
I don't think people using apt-get install rocm necessarily know whether they're using a kernel that the userspace is expected to work with or a dev version of excitement since they look the same. The documentation says to use the approved linux release - some Ubuntu flavour with a specific version number - but doesn't draw much attention to the expected experience if you ignore that command.
This is strongly related to the "approved cards list" that HN also hates. It literally means the release testing passed on the cards in that list, and the release testing was not run on the other ones. So you're back into the YMMV region, along with people like me stubbornly running non-approved gaming hardware on non-approved kernels with a bunch of code I built from source using a different compiler to the one used for the production binaries.
None of this is remotely apparent to me from our documentation but it does follow pretty directly from the games dev / HPC design space.
Pascal isn’t incredibly cheap by comparison because it’s some secret hack. It’s cheap by comparison because most of the market (AI/ML) doesn’t want it. Speaking of which…
At the risk of “No True Scotsman” what qualifies as HPC gets interesting but just today I was at a Top500 site that was talking about their Volta system not being worth the power, which is relevant to parent comment but still problematic for reasons.
I mentioned llama.cpp because the /r/locallama crowd, etc has actually driven up the cost of used Pascal hardware because they treat it as a path to get VRAM on the cheap with their very very narrow use cases.
If we’re talking about getting a little FP64 for CFD that’s one thing. ML/AI is another. HPC is yet another.
I think about other areas in tech where you can use whatever language, but it isn’t practical to do so. I can write a backend API server in Swift… or perhaps more relevant- I can use AMD’s ROCm to do… anything.
However the Gaudi chips are built on top of SynapseAI, another API from before the Habana acquisition. I don’t know if there’s a plan to support oneAPI on Gaudi, but it doesn’t look like it at the moment.
If you're willing to do the testing and post the pin out, I'll send it your way.
Would like to have it sent back when you're finished though, and I'd pay shipping both ways.
I do have the ground pins mapped out and could share that which should save some time.
Although not for 200 Gbps, at that rate you either use big twinax DACs, or go to fibre.
You’re joking, right? They will price it to match current H100 pricing. Multiply your estimate by 10x.
This figure is old and I dont think $15 cuts it anymore. My guess would be $20 if not exceeds it.
Still, your point stands. It's crazy how that 2016 GPU has two thirds the FP32 power of this new 2024 unobtanium card and infinitely more FP64.
Is there a similar "magic value card" for low memory (2GB?) 8-bit LLMs?
Since memory is the expensive bit, surely there are low cost low memory models?
The P40 was aimed at the image and video cloud processing market I think, and thus the GDDR ram instead of HBM, so it got more VRAM but at much less bandwidth.
I didn’t have an X5690 because the TDP was too high for my server’s heatsinks, but I had 90W variants of the same generation. To me, two at idle produced noticeable heat, though not as much as four idling in a PowerEdge R910 did. The R910 idled at around 300W.
There’s always Folding@Home if you don’t mind the electric bill. Plex is another option. I know a guy running a massive Plex server that was on Westmere/Nehalem Xeons until I gave him my R720 with Haswell Xeons.
It looks pathetic indeed. Makes many people question: if THAT'S democracy, then maybe it's not worth fighting for.
> All the sane and rational people are rooting for you here in the U.S.
The same could be said about russian people (sane and rational ones). But what do both people have in common? The answer is: currently both nations are helpless to change what their government does.
> are rooting for you here in the U.S.
I know. We all truly know and greatly appreciate that. There would be no Ukraine if not American weapons and help.
> There’s always Folding@Home
Makes little sense power-wise.
Intel’s TDP problems and AVX clock issues leave a bitter taste in the mouth.
This isn't the same as getting more out of the same over and over again.
Don't know if this counts, but feels directionally similar.
Please make SYCL a priority, cross platform code would make AMD GPUs a viable alternative in the future.
Opencl and hip compilers are in llvm trunk, just bring a runtime from GitHub. Openmp likewise though with much more of the runtime in trunk, just bring libhsa.so from GitHub or debian repos. All of it open source.
There's also a bunch of machine learning stuff. Pytorch and Triton, maybe others. And non-C++ languages, notably Fortran, but Julia and Mojo have mostly third party implementations as well.
I don't know what the UXL foundation is. I do know what sycl is, but aside from using code from intel I don't see what it brings over any of the other single source languages.
At some point sycl will probably be implemented on the llvm offload infra Johannes is currently deriving from the openmp runtime, maybe by intel or maybe by one of my colleagues, at which point I expect people to continue using cuda and complaining about amdgpu. It seems very clear to me that extra GPU languages aren't the solution to people buying everything from Nvidia.
Standalone cards are more like dev kits.
(I’ve been tracking Tenstorrent for 3+ years and currently have Grayskull in ML test rig together with 3090)
My latest ASUS was good enough, but I didn't (and probably won't) build any newer systems, so ABITs will have the crown.
I seem to recall that only Quake 2 or 3 was capable of actually using that second processor during a game, but that wasn't the point ;)
It was fun to play with but you'd also expect the higher-end desktop to e.g. handle x264 videos which was not the case (search for q6600 on videolan forum). And depressingly many cheaper CPUs of the time did it easily.
I loved it, to be honest.
Do you have any more evidence as to why these categorically don't work?
They don't. Loud voices parroting George, with nothing to back it up.
Here are another couple good links:
https://www.evp.cloud/post/diving-deeper-insights-from-our-l...
https://www.databricks.com/blog/training-llms-scale-amd-mi25...
Not even from cloud providers which should be telling enough.
Rocm is sensitive to matching kernel version to driver version to userspace version. Staying very much on the kernel version from a official release and using the corresponding driver is drastically more robust than optimistically mixing different components. In particular, rocm is released and tested as one large blob, and running that large blob on a slightly different kernel version can go very badly. Mixing things from GitHub with things from your package manager is also optimistic.
Imagine it as huge ball of code where cross version compatibility of pieces is totally untested.
Even Nvidia GPU's are tricky to sandbox, and it sounds like the AMD cards are really easy for the tenant to break (or at least force a restart of the underlying host).
AWS does have a Gaudi instance which is interesting, but overall I don't see why Azure, AWS & Google would deploy AMD or Intel GPU's at scale vs their own chips.
They need some competitor to Nvidia to help negotiate, but if its going to be a painful software support story suited to only a few enterprise customers, why not do it with your own chip?
Modern CPUs have moved towards becoming simpler and more flexible in their execution with specialized hardware (GPUs, etc) for the more parallel and repetitive tasks that Itanium excelled at.
Had it not happened, PC makers wouldn't have had any other alternative other than buy PCs with Windows / Itanium, no matter what.
At Itanium's launch, an x86 Windows Server could use Physical Address Extension to support 128GBs of RAM. In an alt timeline where x86-64 never happened, we'd have likely seen PAE perk down to consumer level operating systems to support greater than 4GB of RAM. It was supported on all popular consumer x86 CPUs from Intel and AMD at the time.
The primary reasons we have the technologies we have today was wide availability and wide support. Itanium never achieved either. In a timeline without x86-64 there might have been room for IBM Power to compete with Xeon/Opteron/Itanium. The console wars would have still developed the underlying technologies used by Nvidia for it's ML products, and Intel would likely be devoting resource into making Itanium an ML powerhouse.
We'd be stuck with x86, ARM or Power as a desktop option.
What's your take on projects like https://github.com/corundum/corundum I'm trying to get better at FPGA design, perhaps learn PCIe and some such but Vivado is intimidating (as opposed to Yosys/nextpnr which you seem to hate) should I just get involved with a project like this to acclimatise somewhat?
If you can point me to a repro I'll add it to my todo list. You can probably tag me in the github issue if that's where you reported it.
To an SRE, this is a nightmare to read. Cuda is bad in this regard (can often prevent major kernel version updates), but this is worse.
George has a nice explanation of doorbells:
You'd hope at $15,000+ per unit, you wouldn't have to reset it at all...
i never said i hated yosys/nextpnr? i said somewhere that yosys makes the uber strange decision to use C++ as effectively a scripting language ie gluing and scheduling "passes" together - like they seemed to make the firm decision to diverge from tcl but diverged into absurd territory. i wish yosys were great because it's open source and then i could solve my own problems as they occurred. but it's not great and i doubt it ever will be because building logic synthesis, techmapping, timing analysis, place and route, etc. is just too many extremely hard problems for OSS.
all the vendor tools suck. it's just a fact that both big fpga manufacturers have completely shit software devs working on those tools. the only tools that i've heard are decent are the very expensive suites from cadence/siemenns/synopsis but i have yet to be in a place that has licenses (neither school nor day job - at least not in my team). and mind you, you will still need to feed the RTL or netlist or whatever that those tools generate into vivado (so you're still fucked).
so i don't have advice for you on RTL - i moved one level up (ISA, compilers, etc.) primarily because i could not effectively learn by myself i.e., without going to "apprentice" under someone that just has enough experience to navigate around the potholes (because fundamentally if that's what it takes to learn then you're basically working on ineluctably entrenched tech).
AMD's driver is in your kernel, all the userspace is on GitHub. The ISA is documented. It's entirely possible to treat the ASICs as mass market subsidized floating point machines and run your own code on them.
Modulo firmware. I'm vaguely on the path to working out what's going on there. Changing that without talking to the hardware guys in real time might be rather difficult even with the code available though.
I haven't found the firmware source code yet - digging through confluence and perforce tries my patience and I'm supposed to be working on llvm - but I hear it's written in assembly, where one of the hurdles to open sourcing it is the assembler is proprietary. I suspect there's some common information shared with the hardware description language (tcl and verilog or whatever they're using). To the extent that turns out to be true, it'll be immune to C style undefined behaviour, but I wouldn't bet on it being free from buffer overflows.
That said, if you're on the inside, like I am, and you talk to people at AMD (just got off two separate back to back calls with them), rest assured, they are dedicated to making this stuff work.
Part of that is to build a developer flywheel by making their top end hardware available to end users. That's where my company Hot Aisle comes into play. Something that wasn't available before outside of the HPC markets, is now going to be made available.
This is exactly the void I'm trying to fill.
At which price point to be honest, it still shouldn't be needed.
AMD are lucky everyone expects this nowadays, or people might consider sueing.