Meta MTIA v2 – Meta Training and Inference Accelerator

Meta MTIA v2 – Meta Training and Inference Accelerator(ai.meta.com)

189 points by _yo2u 2 years ago | 60 comments

jsheard 2 years ago |

I like the interactive 3D widget showing off the chip. Yep, that sure is a metal rectangle.

whilenot-dev 2 years ago | |

Really annoys me that the loading animation of these before-/after-images doesn't finish on firefox and that it won't let me drag the knob with the separator. ...no "Under the hood" for me.

a_wild_dandan 2 years ago | | |

Dragging the top-left corner works, for some reason. Really bizarre UI issue.

TulliusCicero 2 years ago | |

Exactly what I was thinking. Like showing off a model of a blank DVD.

huevosabio 2 years ago | |

jajajaj I thought the same! I thought maybe someone with hardware experience can make a sense of this?

modeless 2 years ago |

Intel Gaudi 3 has more interconnect bandwidth than this has memory bandwidth. By a lot. I guess they can't be fairly compared without knowing the TCO for each. I know in the past Google's TPU per-chip specs lagged Nvidia but the much lower TCO made them a slam dunk for Google's inference workloads. But this seems pretty far behind the state of the art. No FP8 either.

leetharris 2 years ago | |

They are different architectures optimized for different things.

From the Meta post: "This chip’s architecture is fundamentally focused on providing the right balance of compute, memory bandwidth, and memory capacity for serving ranking and recommendation models."

Optimizing for ranking/recommendation models is very different from general purpose training/inference.

janalsncm 2 years ago | | |

Translation: you don’t need to serve 96 layer transformers for ranking and recommendation. You’re probably using a neural net with around 10-20 million parameters. But it needs to be fast and highly parallelizable, and perhaps perform well in lower precisions like f16. And it would be great to have a very large vector LUT on the same chip.

modeless 2 years ago | | |

Yeah, it may fit their current workload perfectly, but it doesn't seem very future proof with the limited bandwidth. Given how fast ML is evolving these days I question if it makes sense to design and deploy a chip like this. I guess they do have a very large workload that will benefit immediately.

chabons 2 years ago | |

> Intel Gaudi 3 has more interconnect bandwidth than this has memory bandwidth.

LPDDR5 vs HBMe2. I'm guessing there's a 2-5x price difference between those, but even so it's an interesting choice, I don't know any other accelerators which spec DDR. But yeah, without exact TCO numbers it's hard to compare exactly.

namibj 2 years ago | | |

Bandwidth is far more power hungry for DDR, but capacity is far cheaper.

If the bandwidth capability of DDR suffices, HBM isn't worth it.

At least with LPDDR's; GDDRs may well not be worth it under data center TCO considerations due to the high interface power usage. Feel free to correct me if I'm mistaken, the numbers in question aren't too easy to search for so I didn't confirm this (LPDDR vs. GDDR) part.

chessgecko 2 years ago | |

Also its at 90 watts vs 900 watts for gaudi 3, the flops/mem bw per watt is much more comparable.

modeless 2 years ago | | |

With high end chips like that it's often possible to get dramatically better efficiency by running it at less than peak power consumption, like 90% performance at 50% power or something like that. It's hard to compare the numbers in a fair way.

moffkalast 2 years ago | | |

It would be interesting if this could be made into a reasonably priced (lmao) card for home inference if they intend to mass produce it.

Can't imagine any other reason other than cost as to why they went with LPDDR5, LPDDR5X has more bandwidth and GDDR6 has even more.

cma 2 years ago | |

Only 48MB of SRAM on Gaudi 3 per die (96 MB across both) vs 256MB here maybe increases the memory bandwidth needs for Gaudi. Way different power consumption too.

mlsu 2 years ago |

Certainly an interesting looking chip. It looks like it's for recommendation workloads. Are those workloads very specific, or is there a possibility to run more general inference (image, language, etc) on this accelerator?

And, they mention a compiler in PyTorch, is that open sourced? I really liked the Google Coral chips -- they are perfect little chips for running image recognition and bounding box tasks. But since the compiler is closed source it's impossible to extend them for anything else beyond what Google had in mind for them when they came out in 2018, and they are completely tied to Tensorflow, with a very risky software support story going forward (it's a google product after all).

Is it the same story for this chip?

chessgecko 2 years ago |

I thought MTIA v2 would use the mx formats https://arxiv.org/pdf/2302.08007.pdf, guess they were too far along in the process to get it in this time.

Still this looks like it would make for an amazing prosumer home ai setup. Could probably fit 12 accelerators on a wall outlet with change for a cpu, would have enough memory to serve a 2T model at 4bit and reasonable dense performance for small training runs and image stuff. Potentially not costing too much to make either without having to pay for cowos or hbm.

I'd definitely buy one if they ever decided to sell it and could keep the price under like $800/accelerator.

buildbot 2 years ago | |

I suppose it might, there are not a lot of details (what kind of sparsity for example?) about what they mean in terms of INT8 support - it could be MXINT8, or something else.

Glad someone was thinking the same thing I was though!

chessgecko 2 years ago | | |

its gotta be that 2/4 sparsity that everyone has, but I haven't seen used anywhere right? If they put it in though they must be using it, but I'm not sure for what. And without details I think its a good bet that int8 is the standard int8.

Wishful thinking maybe they'll announce selling it with the giant llama3 cause there's no good, cheap way to inference something like that at home at the moment and this could change that.

teaearlgraycold 2 years ago |

Still seems pretty primitive. Very cool though.

I can only imagine the lack of fear Jensen experiences when reading this.

airstrike 2 years ago | |

It would be foolish to underestimate the long term capabilities of a sufficiently funded and driven competitor

moffkalast 2 years ago | |

adjusts black leather jacket "Look at what they need to mimic a fraction of our power."

prng2021 2 years ago |

3x performance but >3x TDP. Am I missing something or is that unimpressive?

jrgd 2 years ago |

I find it weird that not everyone agree Meta and Facebook and social networks in general are doing some good the the society and our democracies; yet they manage to spend incredible amount of money/energy/time to develop solutions to problems we aren't exactly sure are worth solving…

pptr 2 years ago | |

What is worth solving in your opinion? Should they not make their service more efficient?

I assume this helps reduce their server and electricity costs. At a certain scale these things pay off.

ixaxaar 2 years ago | |

If all this turns out to be useless, burning their cash for nothing seems like a great way to accelerate tech while going down. I guess that would actually be a positive thing.

duchenne 2 years ago |

Is it possible to buy it?

ein0p 2 years ago |

Come on, Zuck, undermine Google Cloud and take NVIDIA down a few pegs by offering this for purchase in good quantities.

sroussey 2 years ago |

Pretty large increase in performance over v1, particularly in sparse workloads.

Low power 25W

Could use higher bandwidth memory if their workloads were more than recommendation engines.

tasty_freeze 2 years ago | |

First gen was 25W. The new one is 90W.

sroussey 2 years ago | | |

Ah, thanks for the correction.

Still relatively low compared to GPUs.

throwaway48476 2 years ago |

It's interesting that they are not separating training and inference.

noiseinvacuum 2 years ago | |

This is specifically designed for inference for recommendations models. It’s not for LLM training or inference.

xnx 2 years ago |

My mind still boggles that a BBS+ads company would think it needs to design its own chips.

libria 2 years ago | |

Or that an online bookseller would try to rent out compute.

falcor84 2 years ago | |

"Depending on how you want to think about it, it was funny or inevitable or symbolic that the robotic takeover did not start at MIT, NASA, Microsoft or Ford. It started at a Burger-G restaurant ..."

https://marshallbrain.com/manna1

searchableguy 2 years ago | | |

https://www.ycombinator.com/companies/ofone/jobs/u2E2fCX-fou...

I saw this YC startup ad right after I finished reading this.

pksebben 2 years ago | | |

dangit, I've got things I should be doing. Posting interesting stories during business hours continues grumbling incoherently

rsynnott 2 years ago | |

Well, the first commercial computer was created by a company whose primary business was running cafes... https://en.wikipedia.org/wiki/LEO_(computer)

hackerlight 2 years ago | |

You're thinking like a startup founder where you should only focus on innovating your main product. FB is a mature company where some vertical integration can make sense.

okdood64 2 years ago | |

They literally print money; smart move for them to make this investment imo.

bevekspldnw 2 years ago |

Pretty fascinating they mention applications for ad serving but not Metaverse.

I feel like Zuck figured out he’s just running an ads network, the world is a long way anway from some VR fever dream, and to focus on milking each DAU for as many clicks as possible.

ec109685 2 years ago | |

It’s not a gpu, and these chips aren’t able to generate images fast enough at inference time to be usable in VR context.

photonbeam 2 years ago | |

Hes always known what pays the bills

bevekspldnw 2 years ago | | |

I dunno, he burned a lot of cash on metaverse and wasn’t focused on FB. All the top talent was moved over to Metaverse and FB was treated as career killer. My impression is ads work is once again a good career play. People chase promo.