Tensordyne Napier chip delivers 13x AI speed boost

How Tensordyne’s Napier chip works

A US-based AI company called Tensordyne has completed the tape-out of its new Napier chip, which it claims significantly outperforms Nvidia’s Blackwell and upcoming Rubin processors in token throughput and efficiency. The chip is built on TSMC’s 3nm process and is now in production, with the company already forecasting more than $200 million in demand for systems built around it. The processor is the core of the Tensordyne Napier TDN system, developed with Broadcom and HPE Juniper Networks. The platform replaces traditional large-scale multiplication with logarithmic mathematics, essentially swapping out heavy multiplication for simpler addition. That shift, the company says, drives better performance per watt across frontier AI models. Tensordyne also designed its own memory architecture. Each processor integrates fast SRAM alongside HBM memory to minimize idle compute cycles, which is a common bottleneck when running the largest models. A proprietary scale-up fabric called TDN Link provides sub-microsecond communication between processors — again, aiming to keep compute units busy rather than waiting on data.

Napier vs. Blackwell and Rubin: the numbers

The processor contains 138 billion transistors, 144 GB of HBM3E memory, 256 MB of SRAM, and delivers 2.1 petaflops of peak AI compute using dense FP8 format.

Its thermal design power is 300 watts.

Those specs alone don’t tell the story, but Tensordyne’s system-level comparisons do. It packages 72 of the processors into a TDN72 Inference Pod, similar to Nvidia’s NVL72 racks. A full rack combines four such pods — 288 chips total — for 608 petaflops of FP8 compute, 42 TB of HBM3e memory, and a rated power of 120 kW. That rack is air-cooled, which Tensordyne emphasizes as a cost advantage over liquid-cooled alternatives.

Compared to Nvidia’s Blackwell, it claims the Napier rack delivers 17 times more tokens per watt and 13 times more tokens per second. It also claims up to $33 million more annual revenue per rack, presumably from faster inference throughput that allows more queries to be processed in the same time. Against Nvidia’s future Rubin platform, Tensordyne says a single Napier rack can handle multi-trillion parameter models at 1,000 tokens per second per user. To match that performance, the rival would need nine Rubin racks combined with Groq LPX racks.

Why power and cooling costs matter

Current AI infrastructure is heavily constrained by power consumption.

Tensordyne points out that power and cooling make up about half the cost of major AI deployments. Solutions such as 800V DC power distribution exist but come with large deployment costs. By reducing total power draw through more efficient processing, the Napier platform aims to lower those infrastructure expenses directly — without requiring exotic power systems.

There is, of course, the question of whether these claims hold up in real-world third-party testing. Tensordyne has not yet published independent benchmarks, and the tape-out milestone means the chip is only now entering production.

Beta deployments will be the first real test.

A bet on inference, not training

The company is focusing on AI inference rather than training.

That’s a narrower target than Nvidia’s full stack, but it’s also where much of the industry’s cost pressure sits as models move from research into production. Inference workloads are typically more sensitive to latency and power efficiency than training runs, which can tolerate more overhead.

If Tensordyne’s system delivers on its stated metrics — and if Broadcom and HPE Juniper can help scale the manufacturing and integration — it could force Nvidia to respond more aggressively on efficiency, not just raw compute. The next few months will show whether the market agrees with that demand forecast.