Run and benchmark small language + vision models on cheap edge hardware before committing to a deployment — find out what is actually usable offline.
Next to-dos
- Decide the "usable" latency budget — proposed < 1.5s first token
- Add the Jetson Orin Nano to the matrix and re-run the suite
- Add a vision model (Moondream2 vs quantized LLaVA) — pick one
- Log wall-power with a USB meter, not just thermals
- Build the side-by-side run comparison dashboard
Recent activity
- Created project · 4 hours ago
- To-do added — Stand up the benchmark CLI + CSV logging · 4 hours ago
- To-do added — Get llama.cpp building on the Pi 5 · 4 hours ago
- To-do added — Decide the "usable" latency budget — proposed < 1.5s first token · 4 hours ago
- To-do added — Add the Jetson Orin Nano to the matrix and re-run the suite · 4 hours ago
- To-do added — Add a vision model (Moondream2 vs quantized LLaVA) — pick one · 4 hours ago
- To-do added — Log wall-power with a USB meter, not just thermals · 4 hours ago
- To-do added — Build the side-by-side run comparison dashboard · 4 hours ago
Design doc
Edge Inference Playground — design doc
What it is: A test bench for running small AI models on inexpensive edge hardware (Raspberry Pi 5, Jetson Orin Nano, an old mini-PC) and measuring whether they're actually usable offline — latency, tokens/sec, memory, and thermals — before betting a real product on them.
The problem it solves
"Can a 3B model run on a Pi?" has a different answer every month. Instead of guessing from blog posts, this gives one repeatable harness: same prompts, same metrics, across devices and quantizations.
Approach
- One CLI that loads a model (llama.cpp / Ollama), runs a fixed prompt suite, and logs metrics to CSV.
- A small web dashboard to compare runs side by side.
- Start text-only; add a vision model later for "describe what the camera sees" experiments.
Hardware matrix
| Device | RAM | Notes |
|---|---|---|
| Raspberry Pi 5 | 8 GB | The baseline. Cheap, fanless-ish. |
| Jetson Orin Nano | 8 GB | GPU accel — the interesting one. |
| Mini-PC (N100) | 16 GB | x86 control group. |
Metrics tracked
First-token latency, tokens/sec (sustained), peak RAM, package temp after 5 min, watts at the wall.
Stack
Python CLI · llama.cpp + Ollama · a tiny FastAPI dashboard · GGUF quantizations (Q4_K_M, Q5, Q8).
Open questions
- Target latency budget — what counts as "usable" for an interactive assistant? Proposed: < 1.5s first token.
- Is the Jetson worth the price/complexity over a Pi 5 for our use case?
- Which vision model to add first — Moondream2 vs. a quantized LLaVA?
- Do we need a cooling solution, or is throttling acceptable for bursty use?
Decision log
- 2026-05-30 — Standardize on GGUF + llama.cpp as the common runtime across devices.
Latrop