Open-weight model inference on dedicated UK hardware: benchmark results

Throughput and latency measurements across 23 instruct models, 8 embedding models, 3 text-eval models, and 7 TTS models. 738 completed jobs, zero errors, on private AWS infrastructure in eu-west-2

738Completed jobs

23Instruct models

41Total models

0Errors

Setup

Marigold is a private AI inference platform running on dedicated AWS hardware in London. All inference runs inside a private VPC; data does not leave the account. This is the first published benchmark of the full platform.

Three hardware tiers handle different workload classes:

Tier	Instance	GPU	VRAM	Workloads
gpu-sm	g4dn.2xlarge	NVIDIA T4	16GB	Instruct up to ~14B
gpu-lrg	g5.4xlarge	NVIDIA A10G	24GB	Instruct 27B
cpu	r5.xlarge	--	--	Embeddings, TTS, text-eval

Methodology: 8 rounds of 5 sequential requests per model at max_tokens=200. Throughput is mean output tokens per second. Latency is wall-clock time from job submission to result available on a warm worker, including queue wait time.

Instruct models: throughput

All models on the T4 gpu-sm tier unless noted. Click any column header to sort.

Output tok/s vs parameter count, coloured by model family. Square = A10G gpu-lrg tier. Hollow circle = high variance across rounds.

Throughput decreases with parameter count within a hardware tier. The 7B-9B band sits between 9 and 13 tok/s on the T4. The 12B-14B band drops to 6-7 tok/s. The 27B model on the A10G returns 7.3 tok/s -- faster than the 14B models on the T4, because it sits on appropriately-sized hardware.

Instruct models: latency

Wall-clock p50 from job submission to result available, by parameter band, on a warm worker at max_tokens=200. Values shown are band midpoints; see the table for the full ranges.

p50 latency by size band. Cold start (EC2 launch + EFS model load) adds 90-180 seconds on first request after idle; warm-worker numbers shown here.

The API is async: a submission returns a message_id and the result is retrieved from DynamoDB once processing completes. Queue wait time is included in the figures above. For workloads that submit many jobs at once, the async model is an advantage rather than a constraint.

Embedding models

All on the CPU tier (r5.xlarge). Latency is p50 at the API. Click to sort.

p50 latency in milliseconds, CPU tier.

Text-to-speech models

Seven languages, all CPU tier, all Facebook MMS family.

p50 latency by language. Variation reflects phoneme complexity and average word length across languages.

Text-eval models

Hardware ceiling

The practical single-GPU ceiling on the A10G (24GB) is approximately 27B parameters at 4-bit quantisation. Models in the 30B-32B range exceed available VRAM after CUDA context overhead. Support for 30B+ models on a multi-GPU tier is in development; results will follow as a separate post.

Methodology notes

This run used max_tokens=200 with sequential requests (5 per round, 8 rounds). Throughput at short generation lengths understates sustained decode rate; a follow-up run at max_tokens=512 with concurrent burst testing is planned.

Implementation notes

All inference runs on HuggingFace transformers with PyTorch and CUDA for GPU-accelerated models. Embedding, TTS, and text-eval models run on the same stack on CPU. The unified runtime means a single worker image and weight cache covers all model types.

vLLM was not used. vLLM offers better throughput for text generation workloads -- particularly continuous batching under concurrent load -- but is limited to that use case. Embedding models, TTS, image generation, and text classification are outside its scope. A platform serving all of these from a single API cannot adopt vLLM as a universal runtime. The throughput numbers in this benchmark reflect transformers sequential inference; vLLM integration for the instruct tier specifically is on the roadmap and would materially improve the tok/s figures for that model class.