Building Meta's GenAI infrastructure

Building Meta's GenAI infrastructure(engineering.fb.com)

664 points by mootpt 2 years ago | 303 comments

float8 got a mention! x2 more FLOPs! Also xformers has 2:4 sparsity support now so another x2? Is Llama3 gonna use like float8 + 2:4 sparsity for the MLP, so 4x H100 float16 FLOPs? Pytorch has fp8 experimental support, whilst attention is still complex to do in float8 due to precision issues, so maybe attention is in float16, and RoPE / layernorms in float16 / float32, whilst everything else is float8?

GamerAlias 2 years ago | |

I was thinking why is this one guy on HN so deeply interested and discussing technical details from a minor remark. Then I clocked the name. Great work on Gemma bugs

danielhanchen 2 years ago | | |

Oh thanks :) I always like small details :)

andy99 2 years ago | |

Is there float8 support in any common CPU intrinsics? It sounds interesting but curious what will be the impact if any on CPU inference.

teaearlgraycold 2 years ago | | |

I’m curious if there’s a meaningful quality difference between float8 and some uint8 alternative (fixed precision or a look up table).

ashvardanian 2 years ago | | |

Nope. Moreover, simulating it even with AVX-512 is quite an experience. Been postponing it for 2 years now... But first of all, you need to choose the version of float8 you want to implement, as the standards differ between GPU vendors.

ipsum2 2 years ago | |

You're still bounded by memory bandwidth, so adding multiples to FLOPs is not going to give you a good representation of overall speedup.

jabl 2 years ago | | |

Well, those smaller floats require less BW to transfer back and forth as well. Perhaps not a reduction linear in the size of the float, as maybe smaller floats require more iterations and/or more nodes in the model graph to get an equivalent result.

But rest assured there's an improvement, it's not like people would be doing it if there wasn't any benefit!

danielhanchen 2 years ago | | |

I'm not sure exactly on how NVIDIA calculates FLOPs, but I do know for Intel's FLOPs, it's calculated from how many FMA units, how many loads can be done in tandem, and what the throughput is. And ye fp8 requires 2x less space. Sparse 2:4 might be less pronounced, since the matrix first needs to be constructed on the fly, and there is like a small matrix of indicator values.

j45 2 years ago | |

Is it safe to assume this is the same float16 that exists in Apple m2 chips but not m1?

j45 2 years ago | | |

Clarification: bfloat16

“bfloat16 data type and arithmetic instructions (AI and others)”

https://eclecticlight.co/2024/01/15/why-the-m2-is-more-advan...

boywitharupee 2 years ago | |

care to explain why attention has precision issues with fp8?

danielhanchen 2 years ago | | |

Oh so float8's L2 Norm from float32 is around I think 1e-4, whilst float16 is 1e-6. Sadly attention is quite sensitive. There are some hybrid methods which just before the attention kernel which is done in fp8, upcasts the Q and K from the RoPE kernel to become float16, then also leaves V to be in float8. Everything is done in fp8 on the fly, and the output is fp8. This makes errors go to 1e-6.

dougdonohoe 2 years ago |

Having lived through the dot-com era, I find the AI-era slightly dispiriting because of the sheer capital cost of training models. At the start of the dot-com era, anyone could spin up an e-commerce site with relatively little infrastructure costs. Now, it seems, only the hyper-scale companies can build these AI models. Meta, Google, Microsoft, Open-AI, etc.

islewis 2 years ago |

I know we won't get it this from FB, but I'd be really interested to see how the relationship of compute power to engineering hours scales.

They mention custom building as much as they can. If FB magically has the option to 10x the compute power, would they need to re-engineer the whole stack? What about 100x? Is each of these re-writes just a re-write, or is it a whole order of magnitude more complex?

My technical understanding of what's under the hood of these clusters is pretty surface level- super curious if anyone with relevant experience has thoughts?

bilekas 2 years ago | |

I'm not 100% sure but I would.make an educated guess that that cluster in the first image for example is a sample of scalable clusters, so throwing more hardware at it could bring improvements but sooner or later the cost to improvements will call for an optimization or rewrite as you call it, so a bit of both usually. It seems a bit of a balancing act really!

jvalencia 2 years ago | |

The cost of training quickly outpaces the cost of development as context length increases. So hardware is cheap until it isn't anymore, by orders of magnitude.

samstave 2 years ago | | |

But there is still significant cost in the physical buildouts of new pods/DCs, whatever and the human engineering hours to physically build, even though its a mix of resources across the vendors and FB? - it still would be interesting to know man hours into the physical build of the HW.

tintor 2 years ago | |

"just a re-write"

mirekrusin 2 years ago | | |

...the idea is that at some point it "just re-writes" itself.

jvanderbot 2 years ago |

So, I'd love to work on optimizing pipelines like this. How does one "get into" it? It seems a ML scientist with some C/C++ and infra knowledge just dips down into the system when required? Or is it CUDA/SIMD experts who move "up" into ML?

fuddle 2 years ago |

How much are they paying for H100's? If they are paying $10k: 350,000 NVIDIA H100 x $10k = $3.5b

gingergoat 2 years ago |

The article doesn't mention MTIA, meta's custom ASIC for training & inference acceleration. https://ai.meta.com/blog/meta-training-inference-accelerator...

I wonder if they will use it in RSC.

benreesman 2 years ago |

I think it’s always useful to pay attention to the history on stuff like this and it’s a rare pleasure to be able to give some pointers in the literature along with some color to those interested from first-hand experience.

I’d point the interested at the DLRM paper [1]: that was just after I left and I’m sad I missed it. FB got into disagg racks and SDN and stuff fairly early, and we already had half-U dual-socket SKUs with the SSD and (increasingly) even DRAM elsewhere in the rack in 2018, but we were doing huge NNs for recommenders and rankers even for then. I don’t know if this is considered proprietary so I’ll play it safe and just say that a click-prediction model on IG Stories in 2018 was on the order of a modest but real LLM today (at FP32!).

The crazy part is they were HOGWILD trained on Intel AVX-2, which is just wild to think about. When I was screwing around with CUDA kernels we were time sharing NVIDIA dev boxes, typically 2-4 people doing CUDA were splitting up a single card as late as maybe 2016. I was managing what was called “IGML Infra” when I left and was on a first-name basis with the next-gen hardware people and any NVIDIA deal was still so closely guarded I didn’t hear more than rumors about GPUs for training let alone inference.

350k Hopper this year, Jesus. Say what you want about Meta but don’t say they can’t pour concrete and design SKUs on a dime: best damned infrastructure folks in the game pound-for-pound to this day.

The talk by Thomas “tnb” Bredillet in particular I’d recommend: one of the finest hackers, mathematicians, and humans I’ve ever had the pleasure to know.

[1] https://arxiv.org/pdf/1906.00091.pdf

[2] https://arxiv.org/pdf/2108.09373.pdf

[3] https://engineering.fb.com/2022/10/18/open-source/ocp-summit...

[4] https://youtu.be/lQlIwWVlPGo?si=rRbRUAXX7aM0UcVO

DEDLINE 2 years ago |

I wonder if Meta would ever try to compete with AWS / MSFT / GOOG for AI workloads

mjburgess 2 years ago |

I'd be great if they could invest in an alternative to nvidia -- then, in one fell swoop, destroy the moats of everyone in the industry.

math_dandy 2 years ago | |

A company moving away from Nvidia/CUDA while the field is developing so rapidly would result in that company falling behind. When (if) the rate of progress in the AI space slows, then perhaps the big players will have the breathing room to consider rethinking foundational components of their infrastructure. But even at that point, their massive investment in Nvidia will likely render this impractical. Nvidia decisively won the AI hardware lottery, and that's why it's worth trillions.

whiplash451 2 years ago | | |

People said the same thing when tensorflow was all the rage and pytorch was a side project.

Granted, HW is much harder than SW, but I would not discount Meta's ability to displace NVIDIA entirely.

mjburgess 2 years ago | | |

I'm more concerned to avoid nvidia (et al.) market domination, than chasing the top-edge of the genAI benefits sigmoid. This will prevent much broad-based innovation.

paxys 2 years ago | |

Except that "one fell swoop" would realistically be 20+ years of research and development from the top minds in the semiconductor industry.

logicchains 2 years ago | | |

It's not the hardware keeping NVidia ahead, it's the software. Hardware-wise AMD is competitive with NVidia, but their lack of a competitive CUDA alternative is hurting adoption.

brucethemoose2 2 years ago | |

Facebook very specifically bought and customized Intel SKUs tailored for AI workloads for some time.

John23832 2 years ago | |

https://engineering.fb.com/2023/10/18/ml-applications/meta-a...

aeyes 2 years ago | |

Isn't Google trying to do this with their TPUs?

elwell 2 years ago |

> Meta’s long-term vision is to build artificial general intelligence (AGI)

valzam 2 years ago | |

Don't worry, this goal will change with the next hype cycle

latchkey 2 years ago | | |

I pity the fools that think AI is just another internet hype cycle.

hendersoon 2 years ago |

350k H100 cards, around ten billion dollars just for the GPUs. Less if Nvidia gives a volume discount, which I imagine they do not.

renegade-otter 2 years ago | |

It will be ironic if Meta sinks all this money into the new trend and finds out later that it has been a huge boondoggle, just as publishers followed Facebook's "guidance" on video being the future, subsequently gutting the talent pool and investing into video production and staff - only to find out it was all a total waste.

motoxpro 2 years ago | | |

It already paid off. When the world moved from determinisic to probablistic ad modeling. That's why their numbers are so good right now compared to every other advertiser

tayo42 2 years ago | | |

What does video not be in the future mean? In social media tiktok and reels are everywhere?

foobarian 2 years ago | | |

There is still hope then for cheap gaming GPUs some day soon! I have pretty much the last 10 years of flagship releases to catch up on...

echelon 2 years ago | | |

As a practitioner in the field, I can assure you this is not a boondoggle.

Those GPUs are going to subsume the entire music, film, and gaming industries. And that's just to start.

alexsereno 2 years ago |

Honestly Meta is consistently one of the better companies at releasing tech stack info or just open sourcing, these kinds of articles are super fun

rshm 2 years ago | |

I think some elements of this stack might flow into the open compute.

adamnemecek 2 years ago | |

Do you find this informative?

alexsereno 2 years ago | | |

Yes of course - it depends on what lens though. If you mean "I'm learning to build better from this" then no, but its very informative on Meta's own goals and mindset as well as real numbers that allow comparison to investment in other areas, etc. Also the point was mostly that Meta does publish a lot in the open - including actual open source tech stacks etc. They're reasonably good actors in this specific domain.

wseqyrku 2 years ago |

> Commitment to open AI innovation

I see what you did there, Meta.

owenpalmer 2 years ago | |

Haha, I noticed that too xD

zone411 2 years ago |

Meta is still playing catch-up. Might be hard to believe but according to Reuters they've been trying to run AI workloads mostly on CPUs until 2022 and they had to pull the plug on the first iteration of their AI chip.

https://www.reuters.com/technology/inside-metas-scramble-cat...

axpy906 2 years ago | |

Definitely has some pr buzz and flex in the article. Now I see why.

latchkey 2 years ago |

> we have successfully used both RoCE and InfiniBand clusters for large, GenAI workloads (including our ongoing training of Llama 3 on our RoCE cluster) without any network bottlenecks.

Interesting dig on IB. RoCE is the right solution since it is open standards and more importantly, available without a 52+ week lead time.

loeg 2 years ago | |

Yeah, and RoCE isn't single vendor. I'm not sure IB scales to the relevant cluster sizes, either.

anonymousDan 2 years ago | | |

Is NVLink just not scalable enough here?

seydor 2 years ago |

This is great news for Nvidia and their stock, but are they sure the LLMs and image models will scale indefinitely? nature and biology has a preference for sigmoids. What if we find out that AGI requries different kinds of cpu capabilities

jiggawatts 2 years ago | |

If anything, NVIDIA H100 GPUs are too general purpose! The optimal compute for AI training would be more specialised, but then would be efficient at only one NN architecture. Until we know what the best architecture is, the general purpose clusters remain a good strategy.

spencerchubb 2 years ago |

All this compute and my Instagram Reels feed still isn't as good as my TikTok feed

zeroonetwothree 2 years ago | |

What does that have to do with Gen AI

lmm 2 years ago | | |

If Gen AI doesn't have anything to do with "Meta"'s actual business then WTF are they setting all this money on fire for?

spencerchubb 2 years ago | | |

GenAI infra is the same as regular AI infra. They used GenAI in the title because it's a buzzword.

mrkramer 2 years ago |

"Share this: Hacker News" Noice

BonoboIO 2 years ago | |

I thought at first "what are you talking about", when i check my uBlock filters. Was blocking the whole "Share this" content section.

Sharing on Hacker News ... they now their audience.

mrkramer 2 years ago | | |

I also use uBlock but my filters are the default ones and I saw it without any problem but tbh this is the first time that I saw some post on the Web have HN as a share option or the first time that I was surprised seeing it. Maybe it has something to do with Google ranking "trusted human information and knowledge" higher than "non-human" information and knowledge[0] or simply some Meta software engineer loves and uses HN so s/he decided to include HN as well, idk.

[0] https://news.ycombinator.com/item?id=39423949

pinko 2 years ago |

The link mentions "our internal job scheduler" and how they had to optimize it for this work -- does anyone know what this job scheduler is called, or how it works?

KaiserPro 2 years ago | |

it might be twine: https://www.usenix.org/system/files/osdi20-tang.pdf

but I suspect its not that, because Twine is optimised for services rather than batch processing, and doesn't really have the concept of priorities.

radicality 2 years ago | | |

I would think it’s probably that. Also, has this been renamed to Twine from Tupperware?

zerop 2 years ago |

> At Meta, we handle hundreds of trillions of AI model executions per day

Such a large number, makes sense?

GeneralMayhem 2 years ago | |

Sure. 100T/day * 1day/86400sec ~= 1B/sec. They're probably considering at least a few hundred candidates per impression, and every impression is going to go through _at least_ two models (relevance and pCTR/revenue), so you could get there just with online serving at 5Mqps, which is plausible. But they're also going to be doing a lot of stuff in batch - spam predictions, ad budget forecasts, etc - so that every candidate actually runs through four or five different models, and every actual impression could do more than that.

sangnoir 2 years ago | |

How many ads does Meta serve a day, and how many AI model executions are done for each one? Repeat the same for stories, post and comment recommendations on Facebook and Instagram, and you have very big numbers. To that, Add VR, internal modeling and other backoffice/ offline analyses over billions of users and you'll easily get into the trillions.

dakiol 2 years ago | |

What's an "AI model execution"? When I ask something to ChatGPT and it answers to me, does that count as 1 "AI model execution" for OpenAI?

pants2 2 years ago | |

Perhaps there's some combinatorics where every time an ad or post is displayed to the user, it runs through some hundreds/thousands of candidates and computes their relevance.

ilaksh 2 years ago |

"Everything You Wanted to Know About GenAI at Meta, Except the One Thing You Honestly Care About" (Llama 3).

dekhn 2 years ago |

it's really interesting just how similar these systems are to the designs adopted for HPC over the past few decades. I'm salty because it took a while for the ML community to converge on this (20+K GPUs connected by a real fabric with low latency and high bandwidth).

sashank_1509 2 years ago |

Metas backing itself into a corner with its admirable commitment to open source. Unfortunately, at some point when they decide to monetize their billions spent and try to release a closed source model, the level of vitriol they will deal with will be an order of magnitude above what even OpenAI is experiencing. I don’t think they realize that!

bigcat12345678 2 years ago | |

Meta's commitment to Open Source is well under calculation.

OCP is a way to rally lower-tier vendors to form a semi-alliance to keep up with super-gorilla like AWS & Google.

LLaMA has already gained much more than its cost (look at the stock price, and the open source ecosystem built surrounding LLaMA, and Google's open source Gemma models which is a proof of Meta's success).

IMHO, Meta's Open Source strategy already covered at least 5 years in prospect. That's enough to finesse a 180 degree turn around if necessary (i.e., from open source to close source)

Horffupolde 2 years ago | |

The general public doesn’t care. Only developers.

marmaduke 2 years ago |

Just for comparison, Swiss CSCS new Alps system will get 5k GH200 nodes (each with a H100).

dazhbog 2 years ago |

Searched H100 and an Amazon link popped up. Good reviews.

https://www.amazon.com/Tesla-NVIDIA-Learning-Compute-Graphic...

mejutoco 2 years ago | |

Those reviews are hilarious

delanyoyoko 2 years ago |

You've got to read "open" roughly 3x in a paragraph.

papichulo2023 2 years ago | |

If they release models I dont care honestly, they can brag about that as much as they want.

lvl102 2 years ago |

This reads more like a flex for the investment community.

codingjaguar 2 years ago |

"By the end of 2024, we’re aiming to continue to grow our infrastructure build-out that will include 350,000 NVIDIA H100 GPUs as part of a portfolio that will feature compute power equivalent to nearly 600,000 H100s." This AI game is getting into a GPU war. Heard that Meta is pushing a lot of CPU wordloads to GPU to co-locate with model inference for infra simplicity.

delegate 2 years ago |

Subtitled 'Here's what you'll never be able to do'.

froonly 2 years ago |

lmfao at the Meta folks not giving any credit whatsoever to the company that actually came up with and implemented the infrastructure work.

jfkfif 2 years ago | |

What’s the company?

sangnoir 2 years ago | | |

Facebook.

pwb25 2 years ago |

so tired of this, not everyone need to work with AI stuff. work on facebook that is a disaster page instead

sidcool 2 years ago |

Those are some seriously great engineering numbers. Mera, with all the negative pressure it receives (rightfully so) is an engineering powerhouse.

But I do wonder how they foresee monetising this.

pedrovhb 2 years ago |

Meta seems to actually be taking all the right steps in how they're contributing to open source AI research. Is this a "commodotize your complement" kind of situation?

CuriouslyC 2 years ago |

Yann wants to be open and Mark seems happy to salt the earth.

torginus 2 years ago | |

I genuinely think one of the most plausible short-term dangers of AI is the creation of lifelike bots which will be absolutely indistinguishable from real humans in short-form online interaction.

Since people don't want to talk to algorithms, this would result in them shunning all social media, which is a huge danger to companies in the space.

bananabrick 2 years ago | |

What do you mean?

CuriouslyC 2 years ago | | |

In pretty much every interview, Yann has talked about how important that AI infrastructure is open and distributed for the good of humanity, and how he wouldn't work for a company that wasn't open. Since Mark doesn't have an AI product to cannibalize, it's in his interest to devalue the AI products of others ("salting the earth").

choppaface 2 years ago |

Total cluster they say will reach 350k H100, which at $30k street price is about $10b.

In contrast, Microsoft is spending over $10b per quarter capex on cloud.

That makes Zuck look conservative after his big loss on metaverse.

https://www.datacenterdynamics.com/en/news/q3-2023-cloud-res...

yuliyp 2 years ago | |

That's a weird comparison. The GPU is only a part of the capex: there's the rest of the servers and racks, the networking, as well as the buildings/cooling systems to support that.

KaiserPro 2 years ago | |

the biggest cost at meta is infra.

> In contrast, Microsoft is spending over $10b per quarter capex on cloud.

to service other people's work load. Its a different business.

baby 2 years ago | |

What loss lol. Stop the fud

Legend2440 2 years ago | | |

Has literally anyone spent money on the metaverse? Maybe it'll still take off in the future, but it's a $40b loss so far.