Vision-language models on a consumer GPU: four models, 4,004 images, one RTX 3060
Four open-weight vision-language models, 1,001 distinct images from COCO val2017, 4,004 requests total, run on a single RTX 3060 through Marigold's local Docker Compose stack.
Setup
One RTX 3060 12 GB, running Marigold's local Docker Compose stack -- Postgres for the queue, a single worker process, no cloud component in the path.
Each of the four models processed 1,001 COCO val2017 images in sequence, with the prompt "Describe this image." and a 256-token output cap.
Images were downloaded once before the run and held in memory as base64 data URIs, so no per-request network fetch contributed to timing.
Every request captures server-side inference time, peak VRAM, and GPU power draw directly from the worker process via nvml, using the same telemetry pipeline as the instruct model benchmark.
One model was excluded: HuggingFaceTB/SmolVLM2-2.2B-Instruct requires AutoModelForImageTextToText rather than the AutoModelForVision2Seq class the current handler uses. This is a handler limitation, not a model deficiency, and is noted for a future handler revision.
Models
The four models span three families and two vision-encoding strategies. The 3B and 8B parameter counts include the vision encoder and projection layers, not the language model backbone alone.
| Model | Family | Params | Quant | Vision encoding |
|---|---|---|---|---|
| google/paligemma2-3b-mix-448 | PaliGemma | 3B | bf16 | Fixed 1,030 tokens per image |
| qwen/qwen2.5-vl-7b-instruct | Qwen2.5-VL | 7B | NF4 | Dynamic: 73--554 tokens per image |
| llava-hf/llava-v1.6-mistral-7b-hf | LLaVA-NeXT | 7B | NF4 | Dynamic anyres tiling: 1,277--2,943 tokens |
| huggingfacem4/idefics2-8b | Idefics2 | 8B | NF4 | Fixed 338 tokens per image |
PaliGemma2 and Idefics2 use fixed-grid encoders that resize every image to a predetermined resolution before encoding, producing an identical token count regardless of aspect ratio or content. Qwen2.5-VL and LLaVA-NeXT use dynamic encoders that tile or crop based on the input image's dimensions. LLaVA-NeXT's anyres strategy is the most aggressive: a single image that produces 338 tokens in Idefics2 produces up to 2,943 tokens in LLaVA-NeXT, depending on its native resolution.
Throughput and parameters
Median output tok/s vs parameter count. PaliGemma2 runs bf16; the three 7--8B models run NF4 4-bit quantisation.
Vision encoding is the throughput variable
Parameter count does not predict throughput in this set. PaliGemma2 at 3B leads at 28.8 tok/s median; Idefics2 at 8B trails at 4.6 tok/s. The figure that tracks throughput is input token count.
PaliGemma2 passes exactly 1,030 tokens to the language model for every image, across all 1,001 images and every aspect ratio in the set. Idefics2 passes exactly 338. Qwen2.5-VL produced a mean of 370 input tokens, ranging from 73 to 554 across the dataset. LLaVA-NeXT produced a mean of 2,229 tokens, ranging from 1,277 to 2,943.
LLaVA-NeXT's mean inference time is 14.75 seconds per image against PaliGemma2's 2.01 seconds. Parameter count (7B vs 3B) accounts for a fraction of that gap; the remainder is the attention computation over a token sequence that is, on average, more than twice as long as Qwen2.5-VL and seven times as long as PaliGemma2.
Dynamic encoders preserve more image detail than fixed-grid encoders, which matters for tasks where image content is dense: OCR, chart reading, document processing. The COCO val2017 images used in this benchmark are general scene photography and do not stress that difference. A document or chart benchmark would produce a different throughput ordering.
Per-request telemetry
Three views of the same 4,004 requests, in run order. Click any model in the legend to hide or show it across all three charts.
The charts show:
- Mean GPU power draw per request in watts, as reported by
nvml. - Peak VRAM per request in GB, including activations and static model layers.
- Server-side inference time per request in seconds.
PaliGemma2's power draw (mean 78.5 W) sits well below the three larger models. This reflects both the smaller parameter count and the shorter per-request inference time: less sustained compute load means less energy drawn per image. LLaVA-NeXT runs at 138.3 W mean for 14.75 seconds per image; PaliGemma2 runs at 78.5 W for 2.01 seconds. The energy per image compounds across the dataset.
Idefics2's load time (mean 62.86 seconds per request) is an order of magnitude above the other three.
This is the consequence of running with idle_timeout=0 in the worker configuration: the model reloads from scratch on every request rather than staying resident between calls.
The reload cost appears in every row of the telemetry rather than as a one-time cold-start cost.
The inference time itself (mean 3.44 seconds) is within the range of the other models; the load time is not.
LLaVA-NeXT's VRAM peak (mean 5.62 GB) is lower than Qwen2.5-VL (6.74 GB) and Idefics2 (6.94 GB) despite substantially longer input sequences. All three are NF4 4-bit quantised; the VRAM difference between them reflects activation memory during encoding rather than weight size.
The same image, four models
COCO image 286458 is a diptych: two photographs of open suitcases placed side by side, with a "baggage" watermark at the lower right. All four models received the same prompt against the same image and produced outputs ranging from 14 to 182 tokens, all stopping within the 256-token cap.
A silver suitcase lies on the floor, its contents spilling out. The suitcase contains a red bag with items neatly packed inside, including a brown bag with a white tag, a small white bottle, a brown bag on the right with a red zip, and a brown pillow with a leopard print cover. A black backpack sits on the floor next to the suitcase, while a small black bag with a long strap and a brown bag with a white tag are also visible. The floor is made of wood and tiled, and the light shines on the suitcase, reflecting off the shiny tile. The word "baggage" is written in yellow on the photo, and the overall mood is one of organisation and clutter.
An open suitcase with the word baggage next to it.
The image shows two photographs side by side, both depicting a suitcase filled with various items. The left photograph shows the suitcase open on a tiled floor, with its contents neatly organised. Inside the suitcase, there are several compartments and bags containing personal items such as toiletries, medications, and other essentials. A small bag is placed on top of the suitcase, possibly for additional storage or as a travel pillow.
The right photograph provides a closer view of the suitcase's contents, focusing on the items inside. It includes bottles, tubes, and other personal care products, suggesting that the suitcase is being packed for travel. The word "baggage" is written in yellow text at the bottom right corner of the right photograph.
The image is a composite of two photographs placed side by side, each showing a suitcase with its contents spilled out onto a tiled floor. The left side shows a black suitcase with various items, including a bottle, a toothbrush, and a pair of shoes, among other things. The right side shows a red suitcase with similar items, including a bottle, a toothbrush, and a pair of shoes, as well as other personal care items.
At the bottom of the image, there is a watermark with the text "baggage" in a stylised font, suggesting that the image may be related to luggage or travel. The focus is on the contents of the suitcases, which are typical of travel items.
Idefics2 produces a single accurate sentence. PaliGemma2 identifies the grid structure and describes it as a collage, producing a flowing paragraph. Qwen2.5-VL enumerates each panel by position and produces a structured list with spatial references. LLaVA-NeXT also enumerates by position but reaches the 256-token cap mid-sentence on the final panel, the only model to do so on this image. All four descriptions are accurate; the differences are in structure, verbosity, and spatial reasoning, not in factual correctness.
Caption length as a model characteristic
Across 1,001 images with the same prompt and the same 256-token cap, the four models produced outputs with mean lengths of 16.3, 61.1, 108.6, and 195.7 tokens respectively. These differences are stable across the dataset: Idefics2 produces roughly 16 tokens on a simple scene and roughly 16 tokens on a six-panel composite; LLaVA-NeXT produces roughly 195 tokens on both. Caption length is a property of how each model was trained, not a response to image complexity.
Mean output tokens per image across 1,001 COCO val2017 images, 256-token cap. LLaVA-NeXT hit the cap on 112 of 1,001 images; the other three hit it on at most 1.
This matters for cost at scale. LLaVA-NeXT ran at 138.3 W mean for 14.75 seconds per image and consumed 13.97p of electricity across 1,001 images. PaliGemma2 ran at 78.5 W for 2.01 seconds per image and consumed 1.20p across the same set. Both received the same prompt against the same images. The energy difference is not primarily explained by parameter count (7B vs 3B) or by quantisation (both NF4 and bf16 are in the set at different output lengths). It is explained by the number of tokens generated: more tokens means more decode steps, more time under sustained GPU load, and more energy drawn per image. Choosing a model for a batch captioning or indexing workload therefore requires understanding its natural output length on the target content, not only its benchmark throughput figure.
Run totals
Across all 4,004 requests: zero failures. Total inference energy: 976.9 Wh. At the Ofgem direct-debit cap (24.67p/kWh, April--June 2026), that is 24.1 pence for 4,004 image captioning requests across four vision-language models, on hardware with no cloud account and no managed infrastructure.
LLaVA-NeXT accounts for 13.97p of the 24.1p total. Idefics2 accounts for 3.27p. Qwen2.5-VL accounts for 5.66p. PaliGemma2 accounts for 1.20p.
Observations
Input token count, not parameter count, determines throughput. The throughput ordering -- PaliGemma2 at 28.8 tok/s, Qwen2.5-VL at 16.5 tok/s, LLaVA-NeXT at 13.4 tok/s, Idefics2 at 4.6 tok/s -- does not match the parameter ordering. It matches the mean input token count in reverse, with one exception: Idefics2's 338 fixed input tokens producing lower throughput than Qwen2.5-VL's 370 mean, which **does** reflect Idefics2's larger parameter count.
LLaVA-NeXT's stop token issue resolved at 256 tokens.
In an earlier 64-token run, LLaVA-NeXT hit the cap on all 1,001 images with a stop finish reason that was inconsistent with truncation.
At 256 tokens, it hit the cap on 112 of 1,001 images and stopped naturally on the remaining 889.
The model's natural completion length for scene descriptions sits in the 150--220 token range; the 64-token cap was below that range for most images.
Idefics2's load time dominates its elapsed time.
The 62.86-second mean load time per request is a configuration effect, not a model property.
With idle_timeout raised to keep the model resident between requests, the per-request load cost would collapse to a single cold-start event across the full 1,001-image pass.
Implementation notes
The benchmark harness is published under img2txt-providers in the marigold-benchmarks repository.
Images are sourced from COCO val2017, downloaded via the coco_val2017 generator in image_sources.py with MD5-keyed local caching.
The raw results CSV, conversion script, and per-request JSON data file are included alongside the post.