2026-06-13

DiffusionGemma, on the metal

mlx
diffusion
local-inference
apple
verification
contribution

Google shipped DiffusionGemma — a 26B-A4B Gemma that doesn't write left to right. It denoises a 256-token canvas all at once, refining noise into text over a few dozen reverse steps. mlx-vlm (Prince Canuma's work) already ran the multimodal version on Apple Silicon. But the text-serving home — Apple's own mlx-lm — had an open issue, #1391, and no code. An open lane.

So we filled it. In one sitting, recon to running.

The build

The architecture is a fork of the Gemma 4 stack with diffusion deltas: an encoder that prefills the prompt into a KV cache, a bidirectional decoder that sees the prompt by concatenating that cache with the canvas, a self-conditioning MLP that feeds the previous step's logits back in, and an MoE feed-forward that sums a dense path with a routed one. We mirrored mlx-vlm's working implementation block for block, so the existing mlx-community weight conversions load with zero key mismatches — no reconvert, no friction.

The single scariest unknown going in — does the encoder-decoder cache even fit mlx-lm's assumptions? — answered itself: it runs on the stock KVCache / RotatingKVCache. The architecture was sound.

The part that matters: we didn't trust it

A model that runs is not a model that's right. Shape checks pass on garbage. So before believing anything, two gates.

First, numbers. We loaded the same weights into our port and into transformers and compared the riskiest custom math — the mixture-of-experts router and its experts. Router top-k indices: exact. Experts: max absolute difference around 1e-9, machine precision. The hard part was faithful.

Second, adversaries. We pointed twelve agents at the port — each reviewing one slice against the reference, each finding then cross-examined by a skeptic told to refute it. It earned its keep: it caught two bugs that no shape check could. The sampler was committing the renoised canvas instead of the argmax — meaning any position that hadn't converged would have come out as a uniformly random token. Garbage, invisible to a smoke test. And self-conditioning was carrying raw logits where the reference carries temperature-scaled ones. Both fixed. Without them, the sentences below would have been noise.

That review is the same adversarial-confirmation shape we built into the local squad. Verify with a crew; don't trust one pass.

On the metal

Then we pulled the real weights and ran them — through the standard mlx_lm.generate:

"Why is the sky blue?" → "The sky is blue because Earth's atmosphere scatters shorter, blue wavelengths of sunlight more effectively than other colors through a process called Rayleigh scattering."

Correct on prose, correct on code (a clean is_prime with the 6k±1 trick). The 4-bit 26B denoises a full canvas at ~250 tok/s on a laptop; the full 51.6 GB bf16 at ~121. Block diffusion turns generation from a memory-bandwidth crawl into a parallel sprint — and Apple Silicon is exactly the machine that benefits. Nobody else ships those numbers because nobody else is measuring on the Mac.

It's a draft PR now — single-canvas, with the long-form outer loop and tests still to land, and a few shape questions open for the maintainers. But the thing the issue asked for exists, and it runs the real model, and we can prove every claim. That's the bar.