Self-hosted inference on a single consumer GPU: 14 models on an RTX 3060
14 open-weight instruct models, 1,204 requests, run entirely on one consumer RTX 3060 through Marigold's local Docker Compose stack. No cloud account, no managed infrastructure -- this is the self-hosted path, measured directly.
Setup
One RTX 3060, run through the local Docker Compose stack described in the
project README -- Postgres for the queue, a single worker process, no cloud
component anywhere in the path. Each of the 14 models ran 86 sequential
requests (concurrency 1), split across five prompt groups: a fixed-prefix
set (33 requests), a structured-output set (20), and three varying-context
sets at short, medium, and long context length (11 each). Every request
captured server-side inference time, instantaneous and peak VRAM, and
instantaneous and peak GPU power draw directly from the worker process.
max_tokens was capped at 256.
Throughput and parameters
Output tok/s vs parameter count. Parameter counts parsed from model names where unambiguous; phi-3.5-mini-instruct is omitted from this chart pending confirmation of its parameter count.
Per-request telemetry
Three views of the same 1,204 requests, in run order, each line a model. Click any model in the legend to hide or show it across all three charts -- useful for comparing two or three models directly without the rest of the field in the way.
The charts contain:
- Mean power draw (watts) per request, in run order.
- Peak VRAM (GB) per request -- the figure that determines whether a model fits on a given card.
- Server-side inference time (seconds) per request.
Observations
Wall-clock time tracks output length more than model size. qwen3-8b and deepseek-r1-distill-qwen-1.5b both produce the longest average completions (251 and 255 tokens against a 256 cap) and are, respectively, the slowest and fourth-slowest model in the set by mean inference time -- despite deepseek-r1-distill-qwen-1.5b being one of the two smallest models in VRAM terms (1.7GB). Its tokens-per-second figure (27.05) is in fact the second-highest of all 14 models. It is not a slow model; it is a model that writes a long chain of thought before answering, and the 256-token cap is doing most of the work in both directions -- capping the longest reasoning traces and giving smaller models a ceiling they rarely reach.
One model never stopped naturally. cognitivecomputations/dolphin-2.9-llama3-8b hit exactly 256 completion tokens on all 86 of its requests -- every single one. Every other model in the set varies its completion length below the cap on at least some requests. This is worth checking against the model's stop-token configuration before treating its latency and throughput numbers as representative; as measured, they reflect the cap, not a natural response length.
The fastest model by a clear margin is the smallest one with a clean stop behaviour. meta-llama/llama-3.2-1b-instruct reaches 48.78 tok/s and a 3.18s mean inference time, both comfortably ahead of the rest of the field, with no sign of hitting the token cap.
| Model | Note |
|---|---|
| microsoft/phi-3.5-mini-instruct | Parameter count not parsed from model name; omitted from the throughput-vs-params chart pending confirmation. |
| cognitivecomputations/dolphin-2.9-llama3-8b | Hit the 256-token cap on 100% of requests. Latency and throughput figures reflect the cap, not necessarily natural response length. |
| GBP / 1M tok column (all models) | Computed from single, non-sustained requests with idle time between them, against a single consumer GPU's transient per-request power draw. A sustained, back-to-back generation workload would likely draw more power than these per-request averages show, so the real gap to a hosted API's per-token price is probably smaller than this column suggests. |
Run totals
Across all 1,204 requests: 280,224 tokens (95,517 prompt, 184,707 completion). The GPU was actively generating for 2h 17m (the sum of per-request inference time); the benchmark took 6h 1m end to end at the client. The roughly 3h 45m gap between those two figures is queue and poll overhead from the API's async design -- submit, then poll for a result -- not idle GPU time or wasted compute. A further 53m 36s went to model load and switching across the 14 models, tracked separately from inference time.
Total electricity used while generating: 226.4 Wh. At the current Ofgem direct-debit cap (24.67p/kWh, April-June 2026), that is 5.6 pence for the entire 1,204-request run. Per-model cost per million tokens, using the same method, is in the table above; every model in this set comes out under the price of an equivalent hosted Flash-class API call, though that comparison carries a caveat -- see the cost note in the table.
Conclusions
A single consumer GPU, with no cloud account and no managed infrastructure, ran 14 instruct models through 1,204 requests for five and a half pence in electricity. That is the headline number for anyone weighing self-hosting against an API bill: the marginal cost of generation on hardware you already own is close to zero for this size class of model.
The more useful finding for choosing a model is not size but behaviour. Wall-clock time and token cost tracked completion length far more closely than parameter count -- a 1.5B reasoning model and an 8B model can land in the same latency band if both are writing long outputs, and the cheapest model per token here was not the smallest one on disk, it was the one that stopped generating soonest. The 256-token cap also did real work in this run, and at least one model (`dolphin-2.9-llama3-8b`) never demonstrated that it would stop on its own without it -- worth confirming before relying on its numbers, or on that model, for anything latency-sensitive.
Implementation notes
The raw benchmark harness output is a CSV of one row per request, with per-request power and VRAM sampled directly on the worker. A short Python script converts it to the JSON embedded in this page: one array of per-request records for the telemetry charts, one array of per-model summary statistics for the throughput chart and table. Both data and the conversion script are published alongside the code on GitHub.
The per-request chart above uses a small reusable technique: each model's contiguous run of requests becomes its own Chart.js dataset, so the built-in legend-toggle behaviour works per model with no extra plumbing, and a background colour band keyed to the same dataset marks each model's range on the x-axis. The same function will drive the telemetry charts for every future benchmark round -- embeddings, image generation, ASR, and TTS all produce the same shape of per-request data.