← The Graveyard
Edge Inference Playground

Run and benchmark small language + vision models on cheap edge hardware before committing to a deployment — find out what is actually usable offline.

Exploring App (embedded static folder)

Next to-dos

  • Decide the "usable" latency budget — proposed < 1.5s first token
  • Add the Jetson Orin Nano to the matrix and re-run the suite
  • Add a vision model (Moondream2 vs quantized LLaVA) — pick one
  • Log wall-power with a USB meter, not just thermals
  • Build the side-by-side run comparison dashboard

Recent activity

  • Created project · 4 hours ago
  • To-do added — Stand up the benchmark CLI + CSV logging · 4 hours ago
  • To-do added — Get llama.cpp building on the Pi 5 · 4 hours ago
  • To-do added — Decide the "usable" latency budget — proposed < 1.5s first token · 4 hours ago
  • To-do added — Add the Jetson Orin Nano to the matrix and re-run the suite · 4 hours ago
  • To-do added — Add a vision model (Moondream2 vs quantized LLaVA) — pick one · 4 hours ago
  • To-do added — Log wall-power with a USB meter, not just thermals · 4 hours ago
  • To-do added — Build the side-by-side run comparison dashboard · 4 hours ago

Design doc

Edge Inference Playground — design doc

What it is: A test bench for running small AI models on inexpensive edge hardware (Raspberry Pi 5, Jetson Orin Nano, an old mini-PC) and measuring whether they're actually usable offline — latency, tokens/sec, memory, and thermals — before betting a real product on them.

The problem it solves

"Can a 3B model run on a Pi?" has a different answer every month. Instead of guessing from blog posts, this gives one repeatable harness: same prompts, same metrics, across devices and quantizations.

Approach

  • One CLI that loads a model (llama.cpp / Ollama), runs a fixed prompt suite, and logs metrics to CSV.
  • A small web dashboard to compare runs side by side.
  • Start text-only; add a vision model later for "describe what the camera sees" experiments.

Hardware matrix

Device RAM Notes
Raspberry Pi 5 8 GB The baseline. Cheap, fanless-ish.
Jetson Orin Nano 8 GB GPU accel — the interesting one.
Mini-PC (N100) 16 GB x86 control group.

Metrics tracked

First-token latency, tokens/sec (sustained), peak RAM, package temp after 5 min, watts at the wall.

Stack

Python CLI · llama.cpp + Ollama · a tiny FastAPI dashboard · GGUF quantizations (Q4_K_M, Q5, Q8).

Open questions

  1. Target latency budget — what counts as "usable" for an interactive assistant? Proposed: < 1.5s first token.
  2. Is the Jetson worth the price/complexity over a Pi 5 for our use case?
  3. Which vision model to add first — Moondream2 vs. a quantized LLaVA?
  4. Do we need a cooling solution, or is throttling acceptable for bursty use?

Decision log

  • 2026-05-30 — Standardize on GGUF + llama.cpp as the common runtime across devices.