Next to-dos

Decide the "usable" latency budget — proposed < 1.5s first token
Add the Jetson Orin Nano to the matrix and re-run the suite
Add a vision model (Moondream2 vs quantized LLaVA) — pick one
Log wall-power with a USB meter, not just thermals
Build the side-by-side run comparison dashboard

Recent activity

Created project · 4 hours ago
To-do added — Stand up the benchmark CLI + CSV logging · 4 hours ago
To-do added — Get llama.cpp building on the Pi 5 · 4 hours ago
To-do added — Decide the "usable" latency budget — proposed < 1.5s first token · 4 hours ago
To-do added — Add the Jetson Orin Nano to the matrix and re-run the suite · 4 hours ago
To-do added — Add a vision model (Moondream2 vs quantized LLaVA) — pick one · 4 hours ago
To-do added — Log wall-power with a USB meter, not just thermals · 4 hours ago
To-do added — Build the side-by-side run comparison dashboard · 4 hours ago

Design doc

Edge Inference Playground — design doc

What it is: A test bench for running small AI models on inexpensive edge hardware (Raspberry Pi 5, Jetson Orin Nano, an old mini-PC) and measuring whether they're actually usable offline — latency, tokens/sec, memory, and thermals — before betting a real product on them.

The problem it solves

"Can a 3B model run on a Pi?" has a different answer every month. Instead of guessing from blog posts, this gives one repeatable harness: same prompts, same metrics, across devices and quantizations.

Approach

One CLI that loads a model (llama.cpp / Ollama), runs a fixed prompt suite, and logs metrics to CSV.
A small web dashboard to compare runs side by side.
Start text-only; add a vision model later for "describe what the camera sees" experiments.

Hardware matrix

Device	RAM	Notes
Raspberry Pi 5	8 GB	The baseline. Cheap, fanless-ish.
Jetson Orin Nano	8 GB	GPU accel — the interesting one.
Mini-PC (N100)	16 GB	x86 control group.

Metrics tracked

First-token latency, tokens/sec (sustained), peak RAM, package temp after 5 min, watts at the wall.

Stack

Python CLI · llama.cpp + Ollama · a tiny FastAPI dashboard · GGUF quantizations (Q4_K_M, Q5, Q8).

Open questions

Target latency budget — what counts as "usable" for an interactive assistant? Proposed: < 1.5s first token.
Is the Jetson worth the price/complexity over a Pi 5 for our use case?
Which vision model to add first — Moondream2 vs. a quantized LLaVA?
Do we need a cooling solution, or is throttling acceptable for bursty use?

Decision log

2026-05-30 — Standardize on GGUF + llama.cpp as the common runtime across devices.