Intel Gaudi 3 AI Accelerator

435 points by goldemerald 2 years ago | 250 comments

mk_stjames 2 years ago |

One nice thing about this (and the new offerings from AMD) is that they will be using the "open accelerator module (OAM)" interface- which standardizes the connector that they use to put them on baseboards, similar to the SXM connections of Nvidia that use MegArray connectors to thier baseboards.

With Nvidia, the SXM connection pinouts have always been held proprietary and confidential. For example, P100's and V100's have standard PCI-e lanes connected to one of the two sides of their MegArray connectors, and if you know that pinout you could literally build PCI-e cards with SXM2/3 connectors to repurpose those now obsolete chips (this has been done by one person).

There are thousands, maybe tens of thousands of P100's you could pickup for literally <$50 apiece these days which technically give you more Tflops/$ than anything on the market, but they are useless because their interface was not ever made open and has not been reverse engineered openly and the OEM baseboards (Dell, Supermicro mainly) are still hideously expensive outside China.

I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.

kkielhofner 2 years ago | |

Pascal series are cheap because they are CUDA compute capability 6.0 and lack Tensor Cores. Volta (7.0) was the first to have Tensor Cores and in many cases is the bare minimum for modern/current stacks.

See flash attention, triton, etc as core enabling libraries. Not to mention all of the custom CUDA kernels all over the place. Take all of this and then stack layers on top of them...

Unfortunately there is famously "GPU poor vs GPU rich". Pascal puts you at "GPU destitute" (regardless of assembled VRAM) and outside of implementations like llama.cpp that go incredible and impressive lengths to support these old archs you will very quickly run into show-stopping issues that make you wish you just handed over the money for >= 7.0.

I support any use of old hardware but this kind of reminds me of my "ancient" X5690 that has impressive performance (relatively speaking) but always bites me because it doesn't have AVX.

mk_stjames 2 years ago | | |

This is all very true for Machine-Learning research tasks, were yes, if you want that latest PyTorch library function to work you need to be on the latest ML code.

But my work/fun is in CFD. One of the main codes I use for work was written to be supported primarily at the time of Pascal. Other HPC stuff too that can be run via OpenCL, and is still plenty compatible. Things compiled back then will still run today; It's not a moving target like ML has been.

gymbeaux 2 years ago | | |

Hey that’s not fair, the X5690 is VERY efficient… at heating a home in the winter time.

JonChesterfield 2 years ago | |

I really like this side to AMD. There's a strategic call somewhere high up to bias towards collaboration with other companies. Sharing the fabric specifications with broadcom was an amazing thing to see. It's not out of the question that we'll see single chips with chiplets made by different companies attached together.

01HNNWZ0MV43FF 2 years ago | | |

Maybe they feel threatened by ARM on mobile and Intel on desktop / server. Companies that think they're first try to monopolize. Companies that think they're second try to cooperate.

rhelz 2 years ago | | |

Well, lets not forget, AMD is AMD because they reverse-engineered Intel chips....

formerly_proven 2 years ago | |

The price is low because they’re useless (except for replacing dead cards in a DGX), if you had a 40$ PCIe AIC-to-SXM adapter, the price would go up a lot.

> I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.

Very cool hobby. It’s also unfortunate how stringent e-waste rules lead to so much perfectly fine hardware to be scrapped. And how the remainder is typically pulled apart to the board / module level for spares. Makes it very unlikely to stumble over more or less complete-ish systems.

KeplerBoy 2 years ago | | |

I'm not sure the prices would go up that much. What would anyone buy that card for?

Yes, it has a decent memory bandwidth (~750 GB/s) and it runs CUDA. But it only has 16 GB and doesn't support tensor cores or low precision floats. It's in a weird place.

gymbeaux 2 years ago | |

As “humble” as NVIDIA’s CEO appears to be, NVIDIA the company (he’s been running this whole time), made decision after decision with the simple intention of killing off its competition (ATI/AMD). Gameworks is my favorite example- essentially if you wanted a video game to look as good as possible, you needed an NVIDIA GPU. Those same games played on AMD GPUs just didn’t look as good.

Now that video gaming is secondary (tertiary?) to Nvidia’s revenue stream, they could give a shit which brand gamers prefer. It’s small time now. All that matters is who companies are buying their GPUs from for AI stuff. Break down that CUDA wall and it’s open-season. I wonder how they plan to stave that off. It’s only a matter of time before people get tired of writing C++ code to interface with CUDA.

mike_hearn 2 years ago | | |

You don't need to use C++ to interface with CUDA or even write it.

A while ago NVIDIA and the GraalVM team demoed grCUDA which makes it easy to share memory with CUDA kernels and invoke them from any managed language that runs on GraalVM (which includes JIT compiled Python). Because it's integrated with the compiler the invocation overhead is low:

https://developer.nvidia.com/blog/grcuda-a-polyglot-language...

And TornadoVM lets you write kernels in JVM langs that are compiled through to CUDA:

https://www.tornadovm.org

There are similar technologies for other languages/runtimes too. So I don't think that will cause NVIDIA to lose ground.

buildbot 2 years ago | |

The SXM2 interface is actually publicly documented! There is an open compute spec for a 8-way baseboard. You can find the pinouts there.

mk_stjames 2 years ago | | |

I had read their documents such as the spec for the Big Basin JBOG, where everything is documented except the actual pinouts on the base board. Everything leading up to it and from it is there but the actual MegArray pinout connection to a single P100/V100 I never found.

But maybe there was more I missed. I'll take another look.

mk_stjames 2 years ago | | |

Upon further review... I think any actual base board schematics / pinouts touching the Nvidia hardware directly is indeed kept behind some sort of NDA or OEM license agreement and is specifically kept out of any of those documents for the Open Compute project JBOG rigs.

I think this is literally the impetus for their OAM spec which makes the pinout open and shareable. Up until that, they had to keep the actual designs of the baseboards out of the public due to that part being still controlled Nvidia IP.

wmf 2 years ago | |

Why don't they sell used P100 DGX/HGX servers as a unit? Are those bare P100s only so cheap precisely because they're useless?

mk_stjames 2 years ago | | |

I have a theory some big cloud provider moved a ton of racks from SXM2 P100's to SXM2 V100's (those were a thing) and thus orphaned an absolute ton of P100's without their baseboards.

Or, these salvage operations just stripped racks and kept the small stuff and e-waste the racks because they think it's the more efficient use of their storage space and would be easier to sell, without thinking correctly.

pavelstoev 2 years ago | |

Best Tflops/$ is actually 4090, then 3090. Also L4

lostmsu 2 years ago | |

P100s would not give you more Tflops/$ if you take electricity into account.

neilmovva 2 years ago |

A bit surprised that they're using HBM2e, which is what Nvidia A100 (80GB) used back in 2020. But Intel is using 8 stacks here, so Gaudi 3 achieves comparable total bandwidth (3.7TB/s) to H100 (3.4TB/s) which uses 5 stacks of HBM3. Hopefully the older HBM has better supply - HBM3 is hard to get right now!

The Gaudi 3 multi-chip package also looks interesting. I see 2 central compute dies, 8 HBM die stacks, and then 6 small dies interleaved between the HBM stacks - curious to know whether those are also functional, or just structural elements for mechanical support.

bayindirh 2 years ago | |

> A bit surprised that they're using HBM2e, which is what Nvidia A100 (80GB) used back in 2020.

This is one of the secret recipes of Intel. They can use older tech and push it a little further to catch/surpass current gen tech until current gen becomes easier/cheaper to produce/acquire/integrate.

They have done it with their first quad core processors by merging two dual core processors (Q6xxx series), or by creating absurdly clocked single core processors aimed at very niche market segments.

We have not seen it until now, because they were sleeping at the wheel, and knocked unconscious by AMD.

JonChesterfield 2 years ago | | |

> This is one of the secret recipes of Intel

Any other examples of this? I remember the secret sauce being a process advantage over the competition, exactly the opposite of making old tech outperform the state of the art.

mvkel 2 years ago | | |

Interesting.

Would you say this means Intel is "back," or just not completely dead?

alexey-salmin 2 years ago | | |

Oh dear, Q6600 was so bad, I regret ever owning it

tmikaeld 2 years ago | |

I was just about to comment on this, apparently all production capacity for hbm is tapped out until early 2026

kylixz 2 years ago |

This is a bit snarky — but will Intel actually keep this product line alive for more than a few years? Having been bitten by building products around some of their non-x86 offerings where they killed good IP off and then failed to support it… I’m skeptical.

I truly do hope it is successful so we can have some alternative accelerators.

riskable 2 years ago |

> Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into every Intel Gaudi 3 accelerator

WHAT‽ It's basically got the equivalent of a 24-port, 200-gigabit switch built into it. How does that make sense? Can you imaging stringing 24 Cat 8 cables between servers in a single rack? Wait: How do you even decide where those cables go? Do you buy 24 Gaudi 3 accelerators and run cables directly between every single one of them so they can all talk 200-gigabit ethernet to each other?

Also: If you've got that many Cat 8 cables coming out the back of the thing how do you even access it? You'll have to unplug half of them (better keep track of which was connected to what port!) just to be able to grab the shell of the device in the rack. 24 ports is usually enough to take up the majority of horizontal space in the rack so maybe this thing requires a minimum of 2-4U just to use it? That would make more sense but not help in the density department.

I'm imagining a lot of orders for "a gradient" of colors of cables so the data center folks wiring the things can keep track of which cable is supposed to go where.

sairahul82 2 years ago |

Can we expect the price of 'Gaudi 3 PCIe' to be reasonable enough to put in a workstation? That would be a game changer for local LLMs

wongarsu 2 years ago | |

Probably not. An 40GB Nvidia A100 is arguably reasonable for a workstation at $6000. Depending on your definition an 80GB A100 for $16000 is still reasonable. I don't see this being cheaper than an 80GB A100. Probably a good bit more expensive, seeing as it has more RAM, compares itself favorably to the H100, and has enough compelling features that it probably doesn't have to (strongly) compete on price.

0cf8612b2e1e 2 years ago | | |

Surely NVidia’s pricing is more what the market will bear vs an intrinsic cost to build. Intel being the underdog should be willing to offer a discount just to get their foot in the door.

narrator 2 years ago | | |

Isn't it much better to get a Mac Studio with an M2 Max and 192gb of Ram and 31 terraflops for $6599 and run llama.cpp?

chessgecko 2 years ago | | |

I think you're right on the price, but just to give some false hope. I think newish hbm (and this is hbm2e which is a little older) is around $15/gb so for 128 gb thats $1920. There are some other cogs, but in theory they could sell this for like $3-4k and make some gross profit while getting some hobbyist mindshare/research code written for it. I doubt they will though, it might eat too much into profits from the non pcie variants.

Workaccount2 2 years ago | | |

Interestingly they are using HBME2 memory which is a few years old at this point. The price might end up being surprisingly good because of this.

CuriouslyC 2 years ago | |

Just based on the RAM alone, let's just say if you can't just buy a Vision Pro without a second thought about the price tag, don't get your hopes up.

ipsum2 2 years ago | |

It won't be under $10k.

rileyphone 2 years ago |

128GB in one chip seems important with the rise of sparse architectures like MoE. Hopefully these are competitive with Nvidia's offerings, though in the end they will be competing for the same fab space as Nvidia if I'm not mistaken.

latchkey 2 years ago | |

AMD MI300x is 192GB.

tucnak 2 years ago | | |

Which would be impressive had it _actually_ worked for ML workloads.

kaycebasques 2 years ago |

Wow, I very much appreciate the use of the 5 Ws and H [1] in this announcement. Thank you Intel for not subjecting my eyes to corp BS

[1] https://en.wikipedia.org/wiki/Five_Ws

belval 2 years ago | |

I wonder if with the advent of LLMs being able to spit out perfect corpo-speak everyone will recenter to succint and short "here's the gist" as the long version will become associated to cheap automated output.

latchkey 2 years ago |

> the only MLPerf-benchmarked alternative for LLMs on the market

I hope to work on this for AMD MI300x soon. My company just got added to the MLCommons organization.

yieldcrv 2 years ago |

Has anyone here bought an AI accelerator to run their AI SaaS service from their home to customers instead of trying to make a profit on top of OpenAI or Replicate

Seems like an okay $8,000 - $30,000 investment, and bare metal server maintenance isn’t that complicated these days.

shiftpgdn 2 years ago | |

Dingboard runs off of the owner's pile of used gamer cards. The owner frequently posts about it on twitter.

1024core 2 years ago |

> Memory Boost for LLM Capacity Requirements: 128 gigabytes (GB) of HBMe2 memory capacity, 3.7 terabytes (TB) of memory bandwidth ...

I didn't know "terabytes (TB)" was a unit of memory bandwidth...

throwup238 2 years ago | |

It’s equivalent to about thirteen football fields per arn if that helps.

gnabgib 2 years ago | |

Bit of an embarrassing typo, they do later qualify it as 3.7TB/s

SteveNuts 2 years ago | | |

Most of the time bandwidth is expressed in giga/gibi/tera/tebi bits per second so this is also confusing to me

nahnahno 2 years ago | |

About as relevant a measure of speed as parsecs

throwaway4good 2 years ago |

Worth noting that it is fabbed by TSMC.

InvestorType 2 years ago |

This appears to be manufactured by TSMC (or Samsung). The press release says it will use a 5nm process, which is not on Intel's roadmap.

"The Intel Gaudi 3 accelerator, architected for efficient large-scale AI compute, is manufactured on a 5 nanometer (nm) process"

ac29 2 years ago | |

Habana was an acquisition and their use of TSMC predates the acquisition.

modeless 2 years ago | | |

Yeah, but if Intel can't even get internal customers to adopt their foundry services it seems to bode poorly for the future of the company.

geertj 2 years ago |

I wonder if someone knowledgeable could comment on OneAPI vs Cuda. I feel like if Intel is going to be a serious competitor to Nvidia, both software and hardware are going to be equally important.

einpoklum 2 years ago |

If your metric is memory bandwidth or memory size, then this announcement gives you some concrete information. But - suppose my metric for performance is matrix-multiply-add (or just matrix-multiply) bandwidth. What MMA primitives does Gaudi offer (i.e. type combinations and matrix dimension combinations), and how many of such ops per second, in practice? The linked page says "64,000 in parallel", but that does not actually tell me much.

alecco 2 years ago |

Gaudi 3 has PCIe 4.0 (vs. H100 PCIe 5.0, so 2x the bandwidth). Probably not a deal-breaker but it's strange for Intel (of all vendors) to lag behind in PCIe.

wmf 2 years ago | |

N5, PCIe 4.0, and HBM2e. This chip was probably delayed two years.

alecco 2 years ago | | |

Good point, it's built on TSMC while Intel is pushing to become the #2 foundry. Probably it's because Gaudi was made by an Israeli company Intel acquired in 2019 (not an internal project). Who knows.

https://www.semianalysis.com/p/is-intel-back-foundry-and-pro...

KeplerBoy 2 years ago | |

The whitepaper says it's PCIe 5 on Gaudi 3.

ancharm 2 years ago |

Is the scheduling / bare metal software open source through OneAPI? Can a link be posted showing it if so?

cavisne 2 years ago |

Is there an equivalent to this reference for Intel Gaudi?

https://docs.nvidia.com/cuda/parallel-thread-execution/index...

AnonMO 2 years ago |

it's crazy that Intel can't manufacture its own chips atm, but it looks like that might change in the coming years as new fabs come online.

colechristensen 2 years ago |

Anyone have experience and suggestions for an AI accelerator?

Think prototype consumer product with total cost preferably < $500, definitely less than $1000.

jsheard 2 years ago | |

The default answer is to get the biggest Nvidia gaming card you can afford, prioritizing VRAM size over speed. Ideally one of the 24GB ones.

Hugsun 2 years ago | |

You can get very cheap tesla P40s with 24gb of ram. They are much much slower than the newer cards but offer decent value for running a local chatbot.

I can't speak to the ease of configuration but know that some people have used these successfully.

JonChesterfield 2 years ago | |

I liked my 5700XT. That seems to be $200 now. Ran arbitrary code on it just fine. Lots of machine learning seems to be obsessed with amount of memory though and increasing that is likely to increase the price. Also HN doesn't like ROCm much, so there's that.

hedgehog 2 years ago | |

What else in on the BOM? Volume? At that price you likely want to use whatever resources are on the SoC that runs the thing and work around that. Feel free to e-mail me.

dist-epoch 2 years ago | |

All new CPUs will have so called NPUs inside them. For helping running models locally.

mirekrusin 2 years ago | |

Rent or 3090, maybe used 4090 if you're lucky.

jononor 2 years ago | |

What is the workload?

wmf 2 years ago | |

AMD Hawk Point?

MrYellowP 2 years ago |

https://www.dwds.de/wb/Gaudi

That's amusing. :D

sandGorgon 2 years ago |

>Intel Gaudi software integrates the PyTorch framework and provides optimized Hugging Face community-based models – the most-common AI framework for GenAI developers today. This allows GenAI developers to operate at a high abstraction level for ease of use and productivity and ease of model porting across hardware types.

what is the programming interface here ? this is not CUDA right ...so how is this being done ?

wmf 2 years ago | |

PyTorch has a bunch of backends including CUDA, ROCm, OneAPI, etc.

sandGorgon 2 years ago | | |

i understand. but which backend is intel committing to ? not CUDA for sure. or have they created a new backend

chessgecko 2 years ago |

I feel a little misled by the speedup numbers. They are comparing lower batch size h100/200 numbers to higher batch size gaudi 3 numbers for throughput (which is heavily improved by increasing batch size). I feel like there are some inference scenarios where this is better, but its really hard to tell from the numbers in the paper.

andersa 2 years ago |

Price?

amelius 2 years ago |

Missing in these pictures are the thermal management solutions.

InitEnabler 2 years ago | |

If you look at one of the pictures you can get a peak at what they look like (I think...) in the bottom right.

https://www.intel.com/content/dam/www/central-libraries/us/e...

wmf 2 years ago | |

It's going to look very similar to an Nvidia SXM or AMD MI300 heatsink since these all have similar form factors.

KeplerBoy 2 years ago |

vector floating point performance comes in at 14 Tflops/s for FP32 and 28 Tflop/s for FP16.

Not the best of times for stuff that doesn't fit matrix processing units.

mpreda 2 years ago |

How much does one such card cost?

metadat 2 years ago |

> Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into every Intel Gaudi 3 accelerator

How much does a single 200Gbit active (or inactive) fiber cable cost? Probably thousands of dollars.. making even the cabling for each card Very Expensive. Nevermind the network switches themselves..

Simultaneously impressive and disappointing.

lillecarl 2 years ago | |

https://www.fs.com/de-en/products/115636.html 2 meters seems to be about 100$, which isn't unreasonable.

If you're going fiber instead of twinax it's another order of magnitude and a bit for trancievers, but cables are pretty cheap still.

You seem to be loading negative energy into this release from the get-go

metadat 2 years ago | | |

You're going to need a lot more than 2 meters... It's probably AOC (Active-Optical Fiber Cable), they're pricey even for 40Gbit, at DC lengths.

throwaway2037 2 years ago | |

What do you mean by active vs inactive fiber cable? I tried to Google about this distinction, but I couldn't find anything helpful.

metadat 2 years ago | | |

My off-the-cuff take: AOC's are a specific kind of fiber optic cable, typically used in data center applications for 100Gbit+ connections. The alternate types of fiber are typically referred to as passive fiber cables, e.g. simplex or duplex, single-mode (single fiber strands, usually in a yellow jacket) or multi-mode (multiple fiber strands, usually in an orange jacket). Each type of passive fiber cable has specific applications and requires matching transceivers, whereas AOCs are self-contained with the transceivers pre-terminated on.

If you search for "AOC Fiber", lots of resources will pop up. FS.com is one helpful resource.

https://community.fs.com/article/active-optical-cable-aoc-ri...

> Active optical cable (AOC) can be defined as an optical fiber jumper cable terminated with optical transceivers on both ends. It uses electrical-to-optical conversion on the cable ends to improve speed and distance performance of the cable without sacrificing compatibility with standard electrical interfaces.

YetAnotherNick 2 years ago |

So now hardware companies stopped reporting FLOP/s number and reports in arbitrary unit of parallel operation/s.

AnonMO 2 years ago | |

1835 tflops fp8. you have to look for it, but they posted it. The link in the op is just an announcement. the white paper has more info. https://www.intel.com/content/www/us/en/content-details/8174...

m3kw9 2 years ago |

Can you run Cuda on it?

boroboro4 2 years ago | |

No one runs Cuda, everyone runs PyTorch. Which you can run on it.

m3kw9 2 years ago | | |

So does it support cuda or not are are you gonna argue little things all day?

brcmthrowaway 2 years ago |

Does this support apple silicon?

whalesalad 2 years ago |

https://www.merriam-webster.com/dictionary/gaudy

riazrizvi 2 years ago | |

That’s an i. He’s one the the greatest architects of all time. https://www.archdaily.com/877599/10-must-see-gaudi-buildings...

jagger27 2 years ago | |

https://en.wikipedia.org/wiki/Antoni_Gaud%C3%AD

TheAceOfHearts 2 years ago | |

Honestly, I thought the same thing upon reading the name. I'm aware of the reference to Antoni Gaudí, but having the name sound so close to gaudy seems a bit unfortunate. Surely they must've had better options? Then again I don't know how these sorts of names get decided anymore.

prewett 2 years ago | | |

'Gaudi' is properly pronounced Ga-oo-DEE in his native Catalan, whereas (in my dialect) 'gaudy' is pronounced GAW-dee. My guess is Intel wasn't even thinking about 'gaudy' because they were thinking about "famous architects" or whatever the naming pool was. Although, I had heard that the 'gaudy' came from the architect's name because of what people thought of his work. (I'm not sure this is correct, it was just my first introduction to the word.)

whalesalad 2 years ago | | |

to be fair intel is not known for naming things well.

> RDMA over Converged Ethernet (RoCE) or InfiniBand over Ethernet (IBoE)[1] is a network protocol which allows remote direct memory access (RDMA) over an Ethernet network. It does this by encapsulating an InfiniBand (IB) transport packet over Ethernet.