Looking to catch up with rivals in the AI production space, Meta is spending billions in its own AI efforts. A portion of those billions goes to recruiting AI researchers;. But an even bigger chunk is spent on hardware development, specifically chips to run and train Meta’s AI models.
After was revealed the newest fruit of its chip development efforts today, apparently a day behind Intel was announced the latest AI accelerator hardware. Called the “next-generation” Meta Training and Inference Accelerator (MTIA), the successor to last year’s MTIA v1, the chip runs models that include ranking and recommending display ads on Meta properties (eg Facebook).
Compared to MTIA v1, which was built on a 7nm process, the next-generation MTIA is 5nm. (In chip manufacturing, “process” refers to the size of the smallest component that can be built on the chip.) The next-generation MTIA is a physically larger design, packed with more processing cores than its predecessor. And while it consumes more power – 90 W vs. 25 W – it also has more internal memory (128 MB vs. 64 MB) and runs at a higher average clock speed (1.35 GHz from 800 MHz).
Meta says next-generation MTIA is currently live in 16 of its data center regions and offers up to 3x overall better performance compared to MTIA v1. If that “3x” claim sounds a little vague, you’re not wrong — we thought so, too. But Meta would only volunteer that the figure came from testing the performance of “four base models” on both chips.
“Because we control the entire stack, we can achieve greater efficiency compared to commercially available GPUs,” Meta writes in a blog post shared with TechCrunch.
Meta’s hardware reveal — which comes just 24 hours after a press briefing about the company’s various ongoing AI production initiatives — is unusual for a number of reasons.
One, Meta reveals in the suspension that it’s not currently using next-gen MTIA for production AI training workloads, though the company claims it has “several projects underway” that explore it. Second, Meta admits that the next-generation MTIA won’t replace GPUs for running or training models — but will complement them.
Reading between the lines, the Meta moves slowly — perhaps slower than it would like.
Meta’s AI teams are almost certainly under pressure to cut costs. The company is going to spend a is appreciated $18 billion by the end of 2024 in GPUs for training and running AI production models, and — with training costs for cutting-edge production models in the tens of millions of dollars — in-house hardware presents an attractive alternative.
And while the Meta’s hardware drags, the opponents are moving forward, much to the surprise of the Meta’s leadership, I’d suspect.
Google this week made its fifth-generation custom chip for training artificial intelligence models, the TPU v5p, generally available to Google Cloud customers and unveiled its first dedicated chip for operational models, the Axion. Amazon has several custom AI chip families under its belt. And Microsoft last year entered the fray with the Azure Maia AI Accelerator and the Azure Cobalt 100 CPU.
In the suspension, Meta says it took less than nine months to “go from first silicon to production models” of the next-gen MTIA, which to be fair is shorter than the typical window between Google’s TPUs. But Meta has a lot of work to do if it hopes to achieve a measure of independence from third-party GPUs – and meet its tough competition.