hellomatik
ClinicsPricing
BlogResearchSign inSee how it works
  • Blog
  • Research
  • Clinics
  • Pricing
  • Sign in
  • See how it works
hellomatik
GitHubLinkedInX (Twitter)MediumSubstack

Product

  • Clinics

Company

  • Blog
  • See how we can help your business

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy

© 2026 Hellomatik. All rights reserved.

Hellomatik S.L. · Madrid, Spain · administracion@hellomatik.com

Language:EnglishEspañolDeutschFrançaisSvenska
May 22, 2026·Research·12 min read

Pushing Four Raspberry Pis to the Limit with distributed-llama

How we ran a 30B-parameter Mixture-of-Experts model (Qwen3-30B-A3B) on four CPU-only Raspberry Pi 5 boards at 15.143 tok/s decode — bit-exact and +16.1% over the public record — and hit the LPDDR4X memory wall.

View paper
D
Daniel Correa

A 30-billion-parameter Mixture-of-Experts model (Qwen3-30B-A3B) runs on four Raspberry Pi 5 boards — CPU-only hardware, no GPU, no NPU — at 15.143 tok/s decode: bit-exact and +16.1% above the highest publicly documented result for this model and hardware class (13.04 tok/s, b4rtaz #255). This is a condensed technical write-up of our report; the full paper — every table, figure and raw log — is one click away.

Results at a glance

MetricValue
Decode throughput (vs the public ceiling)15.143 tok/s
Improvement over b4rtaz #255 (13.04 tok/s, decode)+16.1%
Sustained serving throughput (prefill included)14.449 tok/s
Time-to-first-token (TTFT)557 ms
Per-node DRAM bandwidth (sustained / vendor ceiling)11.4 / 17 GB/s
Hardware4× Pi 5 16GB
Constraints honouredbit-exact, no overclock, no model change
Same model and hardware class (Qwen3-30B-A3B Q40, 4× Raspberry Pi 5), decode metric — a bit-exact +16.1% improvement (n=20).

Why this is hard

Edge inference of LLMs on consumer single-board computers is increasingly studied as a privacy-preserving, low-cost alternative to the cloud. The Raspberry Pi 5 is the most widely deployed ARM SBC with enough RAM (16 GB) to host quantised models, and its Gigabit Ethernet lets small clusters be formed. But the Pi 5 has no usable accelerator — the VideoCore VII GPU lacks general compute shaders — so inference runs on the CPU, where decode is memory-bandwidth-bound.

We adopt distributed-llama v0.16.5 with twelve source-level changes developed in this work (eight framework patches plus four bit-exact kernel and op-fusion optimisations), plus persistent runtime kernel tuning.

Five hard constraints

Every optimisation obeys five rules, adopted as design rules. They rule out most of the lossy speedups in the literature — which is exactly what makes what survives production-deployable, with zero risk of quality regression:

  • Bit-exact output — SHA-256 of the first 100 generated token-ids matches a fixed reference (seed=42, temperature=0).
  • No kernel rebuild — stock Pi OS kernel 6.12.75.
  • No model change — Qwen3-30B-A3B Q40 is fixed.
  • No CPU overclock — silicon stays at the nominal 2.4 GHz.
  • No quality reduction — top-k unchanged at 8, no down-quantisation, no expert pruning.

System design

Per node: Broadcom BCM2712 (4× Cortex-A76 @ 2.4 GHz, ARMv8.2-A); 16 GB LPDDR4X @ 4267 MT/s (≈17 GB/s theoretical); NVMe via PCIe Gen 2; integrated Gigabit Ethernet (measured intra-cluster latency 0.226 ms); Debian 13, kernel 6.12.75 aarch64; no usable accelerator. The cluster is one root coordinator (also serving the HTTP API) plus three workers, full-duplex Gigabit Ethernet, with tensor-parallel synchronisation per transformer layer.

Software: distributed-llama v0.16.5 + our patches; model Qwen3-30B-A3B Q40 (30B total, 128 experts, top-k = 8, ~3B active/token); jemalloc 2 via LD_PRELOAD; a small Python proxy that sanitises OpenAI-strict client requests.

Methodology

We follow MLPerf Inference v5.1 conventions: two warm-up runs (discarded), n = 20 measurement runs, fixed prompt, temperature = 0, single client. Statistics as mean, median, stdev, 95% CI and p50/p90/p99. Every improvement claim is validated bit-exact by SHA-256-hashing the generated token-id sequence against a fixed reference, unless stated otherwise.

The optimisation trajectory

Optimisation trajectory and final clean-room result. Bars 1–8 report sustained throughput; the final green bar is the Stage-17 clean-room #255 replication on the decode metric. Dashed line: the public ceiling.

The change with the largest effect: dense → MoE

The change with the largest single effect was migrating from Llama 3.1 8B (dense, ~5 GB Q40 weights, all parameters active per token) to Qwen3-30B-A3B (MoE, 128 experts, top-k = 8, ~3 GB active per token). Bandwidth-bound throughput improved 59% (7.18 → 11.4 tok/s) with no other change. MoE is sparse activation: for top-8-of-128 the activation ratio is 8/128 = 6.25% of expert weights plus shared parameters — and the effective bandwidth multiplier matches the observed speedup almost exactly.

Eight framework patches

The patches fix critical bugs and unlock optimisation flags. Highlights:

#PatchWhy it mattered
1NnByte → NnUint for nBatchesuint8_t overflow: nbatches=256 silently became 0 (256 mod 256), tripping an embedding-layer assertion.
2Force finish_reason = stop/lengthEmpty finish_reason sent strict OpenAI clients into infinite retry loops.
3try/catch around json::parseMalformed bodies threw uncaught exceptions, abort-trapping the daemon (SIGABRT).
6posix_memalign(64, n) for pipesARM64 default new[] is 16 B aligned; cache-line (64 B) alignment is required for vectorised NEON.
7TCP SO_RCVBUF/SO_SNDBUF = 8 MBDefault 208 KiB buffers caused write-blocking under burst sync at end-of-layer.

Two counter-intuitive bit-exact changes

Both run against received wisdom on the same code base:

  1. Removing the software prefetch (Stage 12, +1.05%). The NEON+dotprod matmul inner loop carried two __builtin_prefetch calls. Across five swept variants, removing both was the only one that helped: the Cortex-A76's hardware prefetcher (stride detection on sequential weight access) outperforms the manual hints, which compete for issue slots in the 4-wide decoder.

  2. A true single-pass SILU·MUL fusion (Stage 13, +1.5%). The earlier fusion was a two-call wrapper that kept the intermediate resident in memory (3 loads + 2 stores per element). We rewrote it as one NEON loop holding values in registers — 2 loads + 1 store per element — bit-exactness preserved (identical arithmetic; only the intermediate writeback is removed).

Before (two-pass) vs after (single-pass): the SILU·MUL fusion drops the intermediate store/load between the two loops — 2 loads + 1 store per iteration instead of 3 + 2, bit-exact.

Network chunk-size and runtime NIC tuning (Stages 14–15)

The writeMany/readMany all-reduce loop capped each send()/recv() syscall at 4 KB. With the larger TCP buffers from Stage 9 that is 4× more syscalls than needed; widening to 16 KB cut the count fourfold (+1.25%). A runtime NIC bundle (enlarged RX ring, Receive Flow Steering off CPU 0, deferred NAPI), persisted as a systemd unit, added +1.99% combined. Both are bit-exact — only NIC scheduling changes, no model arithmetic.

Tuning the host Linux mattered as much as the code

It is tempting to credit a result like this to the inference engine. On identical 16 GB silicon, most of the deployable speedup came from re-tuning the host operating system as aggressively as the application — and every lever below holds the bit-exact constraint, so none costs output quality:

  • SDRAM_BANKLOW=1 in the bootloader EEPROM — a firmware memory-interleave change that lifted effective read bandwidth from 8.3 to 12.5 GB/s per node. +5.9%, bit-exact, no overclock.
  • --nthreads 3 instead of 4, freeing one core, with the Ethernet IRQ and Receive Packet Steering pinned to that freed core (CPU 3), plus jemalloc (tuned arena/tcache). Together +10.3%.
  • performance CPU governor at 2.4 GHz, and mlock-ed weights (~4.45 GB locked/worker) so the model never pages out.
  • Network / VM stack (Stage 9, TIER 0): TCP BBR, enlarged SO_RCVBUF/SO_SNDBUF, GRO disabled (it adds 50–200 µs of pure overhead to the 510 kB all-reduce bursts on 1 GbE), busy_poll, vm.swappiness=1. +2.12%.

The whole state is frozen into systemd units, so a cold cluster boots straight into the optimised configuration — no manual steps, fully reproducible. The point is general: on a memory-bandwidth-bound workload the operating system is not a neutral substrate. Firmware memory interleaving, core/IRQ placement, the allocator, page residency and the network stack each move the needle by single-digit percentages that compound — and on the same hardware they add up to more than the source-level code changes do.

Results: throughput and stability

Run-to-run variance is low — a coefficient of variation of 0.52%. The distribution below is the n = 20 sample at an intermediate Stage-6 snapshot (mean 12.708 tok/s); the final configuration reaches 14.449 tok/s (best run 14.557, 95% CI ±0.038), with the later stages raising throughput without changing this latency behaviour.

Per-run throughput across 20 measurement runs (Stage 6 snapshot, warm-ups excluded). Coefficient of variation 0.52%.

Time-to-first-token is 557 ms mean (p50 545 ms, max 638 ms). Throughput scales with response length up to an asymptote — shorter responses are TTFT-dominated:

Sustained throughput vs response length (log x). Converges near 12.86 tok/s for responses ≥ 400 tokens.

Prefill is the practical limit. Prefill rate stays above 15 tok/s up to 2K-token prompts, but for 20K-token prompts (large agent system prompts) the projected prefill time exceeds 20 minutes — the upper bound of usability for full agent workloads here.

Under sustained load all nodes sit at 54–56 °C with zero throttling (threshold 85 °C). The root carries extra orchestration/serving buffers; workers keep >9 GB of headroom for KV-cache growth.

Memory utilisation per node during sustained inference.

Where the time goes: the memory wall

ARM PMU profiling (perf stat, 60 s of sustained inference) pins the bottleneck: 49% backend-stalled cycles, 11.4 of ~17 GB/s DRAM per node, IPC 1.74. The sub-0.1% TLB and branch-mispredict rates confirm the 16 KB-page setup is already optimal; objdump shows 322 udot/sdot NEON dot-products in the inner loop — already at the ARMv8.2-A limit. The cores sit idle waiting on DRAM, which is why every compute-side experiment returns zero gain.

Per-token wall-clock breakdown. The Q40 matmul dominates; the synchronisation barrier is the only software-addressable slack.

The ceiling is physical. On a log scale, the Pi 5's LPDDR4X is ~16× below an Apple M4 Pro and ~200× below an H100:

Cross-platform memory bandwidth (log scale). This is the physical ceiling the cluster reaches.

The telemetry finding

During the final clean-room sweep we found two background monitoring agents running on all four nodes (the "idle background workload" of our earlier methodology). On a memory-bandwidth-bound workload these are not free: they periodically walk system memory statistics, consuming DRAM bandwidth the decode phase needs.

BackgroundnDecode (tok/s)Prefill (tok/s)
Monitoring ON1014.397 ± 0.15318.74
Monitoring OFF (clean-room)2015.143 ± 0.09718.81

Stopping the agents raised decode by +5.18% (CIs do not overlap) while leaving prefill unchanged — a double dissociation: decode is bandwidth-bound, the compute-bound prefill is the negative control. It is also a practical lesson: co-located observability silently taxes memory-bound inference, and a monitored production node under-performs a clean-room benchmark by several percent.

Stage 16: a WFE/SEV barrier

The inter-step barrier busy-spun on an atomic with an ARM yield hint, so three waiting threads continuously re-read a cache line the advancing thread writes — coherency traffic that contends with the thread driving network I/O. We replaced the spin with the ARMv8 event mechanism: waiters issue wfe (low-power wait-for-event) and the advancing thread broadcasts sev. Signalling-only, so bit-exact by construction; it also lowers power and heat on the waiting cores.

Stage 16 WFE/SEV barrier vs yield-spin, same-session cold paired A/B (n=40/arm). +0.48%, bit-exact, Welch t=2.45, p=0.014. Error bars: 95% CI.

What did not work

In keeping with reproducibility norms, we document every intent-to-treat attempt. A selection of the 26 catalogued dead-ends:

ConfigurationRoot cause of failure
Llama 3.3 70B Q40 on 4× Pi 538 GB weights force aggressive swap; 0.15 tok/s under thrashing.
EXO frameworkDepends on Apple MLX (Metal + Neural Engine + UMA); does not build on ARM Linux.
prima.cppZMQ topology discovery hangs >10 min on Pi 5; never bootstraps.
llama.cpp + RPC25× regression vs single-node (network-overhead-dominated pipeline parallelism).
ARM I8MM / SMMLA repackThe Cortex-A76 does not implement I8MM (grep -c i8mm /proc/cpuinfo = 0); it is an A78+ feature.
Transparent hugepages, jumbo frames, NUMA interleave, PGOTested and reverted — neutral or harmful on this DRAM-bound workload.
Software prefetch (tiered PLDL2KEEP+L1)−0.47%; the HW prefetcher already wins (see Stage 12).

The recurring reason for inapplicability is structural: the highest-leverage modern techniques need a newer kernel, a hardware feature the A76 lacks, or a reboot/rebuild we ruled out — which is precisely what makes the surviving bit-exact wins deployable on stock hardware.

Caveats and limitations

  • The +16.1% headline mixes effects — our optimised 16 GB cluster vs b4rtaz's vanilla 8 GB cluster (and --nthreads 3 vs 4). On identical 16 GB silicon, vanilla → our build → fully tuned is 13.15 → 13.88 → 15.143 (+15.2%), of which only +5.6% is attributable to source code; the rest is the model choice, the firmware flag, runtime configuration and a clean environment. The paper's Table 22 breaks this down lever by lever.
  • "Bit-exact" means SHA-256 vs our own canonical build (flags include -ffast-math), not strict IEEE-754. Energy (J/token) is the one standard edge metric not yet measured. All numbers are for one model, one site, one batch of silicon.

Conclusion and future work

The headline number matters less than the method behind it. Most of the speedup came not from the model or from novel kernels, but from treating the operating system as a tunable part of the inference stack — firmware memory interleaving, core and IRQ placement, the allocator, page residency, and a network stack tuned for the low-latency, bursty all-reduce traffic between nodes. None of it touched the hardware, the clock, or the model's output quality. That is the transferable lesson: this class of OS- and network-level optimisation applies to any distributed CPU inference system, not only distributed-llama.

We have reached the memory wall the Pi 5 presents — the bytes moved per token are fixed by the model and its quantisation, and no further bit-exact reduction of that traffic is available. The remaining software lever, Async Tensor Parallelism, overlaps communication with computation rather than moving fewer bytes.

What's next. We plan to release a Linux image tuned for distributed inference — an OS pre-configured for low inter-node latency on dllama-style clusters — so these gains are available out of the box, without changing hardware, overclocking, or lowering model quality. We also plan to run the same OS- and network-level pass on other open-source models, to test how far the approach generalises beyond Qwen3-MoE.

Resources

  • Paper, code & raw data: github.com/hellomatik-org/distributed-llama (branch pi5-cluster, paper/).
  • The public reference we compare against: b4rtaz/distributed-llama discussion #255.

Keep exploring

ResearchMore research & benchmarks→HellomatikAI receptionist for clinics→HellomatikBook a demo→