Setup
Marigold is a private AI inference platform running on dedicated AWS hardware in London. All inference runs inside a private VPC; data does not leave the account. This is the first published benchmark of the full platform.
Three hardware tiers handle different workload classes:
| Tier | Instance | GPU | VRAM | Workloads |
|---|---|---|---|---|
| gpu-sm | g4dn.2xlarge | NVIDIA T4 | 16GB | Instruct up to ~14B |
| gpu-lrg | g5.4xlarge | NVIDIA A10G | 24GB | Instruct 27B |
| cpu | r5.xlarge | -- | -- | Embeddings, TTS, text-eval |
Methodology: 8 rounds of 5 sequential requests per model at
max_tokens=200. Throughput is mean output tokens per second.
Latency is wall-clock time from job submission to result available on a warm
worker, including queue wait time.
Instruct models: throughput
All models on the T4 gpu-sm tier unless noted. Click any column header to sort.
Output tok/s vs parameter count, coloured by model family. Square = A10G gpu-lrg tier. Hollow circle = high variance across rounds.
Throughput decreases with parameter count within a hardware tier. The 7B-9B band sits between 9 and 13 tok/s on the T4. The 12B-14B band drops to 6-7 tok/s. The 27B model on the A10G returns 7.3 tok/s -- faster than the 14B models on the T4, because it sits on appropriately-sized hardware.
Instruct models: latency
Wall-clock p50 from job submission to result available, by parameter band,
on a warm worker at max_tokens=200. Values shown are band midpoints;
see the table for the full ranges.
p50 latency by size band. Cold start (EC2 launch + EFS model load) adds 90-180 seconds on first request after idle; warm-worker numbers shown here.
The API is async: a submission returns a message_id and the result
is retrieved from DynamoDB once processing completes. Queue wait time is included in
the figures above. For workloads that submit many jobs at once, the async model is an
advantage rather than a constraint.
Embedding models
All on the CPU tier (r5.xlarge). Latency is p50 at the API. Click to sort.
p50 latency in milliseconds, CPU tier.
Text-to-speech models
Seven languages, all CPU tier, all Facebook MMS family.
p50 latency by language. Variation reflects phoneme complexity and average word length across languages.
Text-eval models
Hardware ceiling
The practical single-GPU ceiling on the A10G (24GB) is approximately 27B parameters at 4-bit quantisation. Models in the 30B-32B range exceed available VRAM after CUDA context overhead. Support for 30B+ models on a multi-GPU tier is in development; results will follow as a separate post.
Methodology notes
This run used max_tokens=200 with sequential requests (5 per round,
8 rounds). Throughput at short generation lengths understates sustained decode rate;
a follow-up run at max_tokens=512 with concurrent burst testing is planned.
Implementation notes
All inference runs on HuggingFace transformers with PyTorch
and CUDA for GPU-accelerated models. Embedding, TTS, and text-eval models run
on the same stack on CPU. The unified runtime means a single worker image and
weight cache covers all model types.
vLLM was not used. vLLM offers better throughput for text generation workloads --
particularly continuous batching under concurrent load -- but is limited to that
use case. Embedding models, TTS, image generation, and text classification are
outside its scope. A platform serving all of these from a single API cannot adopt
vLLM as a universal runtime. The throughput numbers in this benchmark reflect
transformers sequential inference; vLLM integration for the instruct
tier specifically is on the roadmap and would materially improve the tok/s figures
for that model class.