From the Meta post: "This chip’s architecture is fundamentally focused on providing the right balance of compute, memory bandwidth, and memory capacity for serving ranking and recommendation models."
Optimizing for ranking/recommendation models is very different from general purpose training/inference.
LPDDR5 vs HBMe2. I'm guessing there's a 2-5x price difference between those, but even so it's an interesting choice, I don't know any other accelerators which spec DDR. But yeah, without exact TCO numbers it's hard to compare exactly.
If the bandwidth capability of DDR suffices, HBM isn't worth it.
At least with LPDDR's; GDDRs may well not be worth it under data center TCO considerations due to the high interface power usage. Feel free to correct me if I'm mistaken, the numbers in question aren't too easy to search for so I didn't confirm this (LPDDR vs. GDDR) part.
Can't imagine any other reason other than cost as to why they went with LPDDR5, LPDDR5X has more bandwidth and GDDR6 has even more.
And, they mention a compiler in PyTorch, is that open sourced? I really liked the Google Coral chips -- they are perfect little chips for running image recognition and bounding box tasks. But since the compiler is closed source it's impossible to extend them for anything else beyond what Google had in mind for them when they came out in 2018, and they are completely tied to Tensorflow, with a very risky software support story going forward (it's a google product after all).
Is it the same story for this chip?
Still this looks like it would make for an amazing prosumer home ai setup. Could probably fit 12 accelerators on a wall outlet with change for a cpu, would have enough memory to serve a 2T model at 4bit and reasonable dense performance for small training runs and image stuff. Potentially not costing too much to make either without having to pay for cowos or hbm.
I'd definitely buy one if they ever decided to sell it and could keep the price under like $800/accelerator.
Glad someone was thinking the same thing I was though!
Wishful thinking maybe they'll announce selling it with the giant llama3 cause there's no good, cheap way to inference something like that at home at the moment and this could change that.
I can only imagine the lack of fear Jensen experiences when reading this.
I assume this helps reduce their server and electricity costs. At a certain scale these things pay off.
Low power 25W
Could use higher bandwidth memory if their workloads were more than recommendation engines.
Still relatively low compared to GPUs.
I saw this YC startup ad right after I finished reading this.
I feel like Zuck figured out he’s just running an ads network, the world is a long way anway from some VR fever dream, and to focus on milking each DAU for as many clicks as possible.
Meta seems to be reported these numbers for this v2 chip:
708 TFLOPS/s (INT8) (sparsity)
354 TFLOPS/s (INT8)
And I see Nvidia reporting these numbers for its latest Blackwell chips https://www.anandtech.com/show/21310/nvidia-blackwell-archit... 4500 T(FL)OPS INT8/FP8 Tensor
Am I understanding correctly that Nvidia's upcoming Blackwell chips are 5-10x faster than this one Meta just announced?The development of this chip shows that it doesn't (and shouldn't!) matter to the ML teams at Meta how 'fast ML is evolving.'
Indeed what it demonstrates is that a huge, global, trillion-dollar business has operationalized an existing ML technology to the extent that they can invest into, and deploy, customized hardware for solving a business problem.
How ML "evolves" is irrelevant. They have a system which solves their problem, and they're investing in it.
You've gotta learn to walk before you can run
And building out specialized hardware does lock you in to a certain extent. Want to use more than 128GB of memory? Too bad, your $10B chip doesn’t support that.
Which is probably why Meta is also buying the biggest Nvidia datacenter cards by the shipload. There is no need to run inference for a small model - say for a text-ad recommendation system - on an H100 with attendant electricity and cooling costs.
You don’t always need a Ferrari to go to the store
It’s custom silicon designed for a specific, known workload. It’s not designed to be a general purpose part or to be future proofed for unknown future applications.
When a new application comes along with new requirements, the teams will use their experience to create a new chip targeting that new application.
That’s the great part about custom silicon: You’re not hitting general specs for general applications that you may not even know about yet. You’re building one very specific thing to do a very specific job and do it very well.
At Facebook's scale the spherical cow raw performance stats don't matter nearly as much as real world workloads per ops dollar. They can also repurpose their GPUs to other workloads and let their custom chips handle the boring baseline stuff.
E.g. it's common to have a full-width accumulator and e.g. s16 gradients with u8 activations and s8 weights, with the FMA (MAC) chain of the tensor multiply operation post-scaled with a learned u32 factor plus follow-up "learned" notify, which effectively acts as a fixed-point factor with learned position of it's point, to re-scale the outcome to the u8 activation range.
By having the gradients by sufficiently wider, it's practical to use a straight-through estimator for backpropagation. I read a paper (kinda two, actually) a few months ago that dealt with this (IIRC one of them was more about the hardware/ASIC aspects of fixed-point tensor cores, the other more about model training experiments with existing low precision integer-MAC chips IIRC particularly for interference in mind). If requested, I can probably find it by digging through my system(s); I would have already linked it/them if the cursory search hadn't failed.
Sorry if this make anyone feels bad. It certainly made myself uncomfortable typing it out though.