About these results:
Rankings reflect a single benchmark run with default parameters. Model performance varies with prompts, settings, and API versions.
These are not absolute rankings — run evalytic bench on your own use case for representative results.
Showcase 01 · Text2Img · 5 Models · 2 Prompts
Do I really need the flagship model?
flux-schnell delivers 88% of the winner's quality at 92% less cost. The 0.6 point gap costs 13× more per image to close.
Model Rankings
recraft-v3 wins at 4.74 — but flux-schnell scores 4.16 at $0.003 per image. That's 1,387 points per dollar vs 119. For most production workloads, the cheapest model is good enough.
Cost Efficiency (Score per Dollar)
Dimension Breakdown
Gallery
When prompts are straightforward, quality differences vanish — all 5 models hit 5.0. Differentiation happens on harder prompts like the neon sign above.
Showcase 02 · Img2Img · 4 Models · 1 Prompt · Face Metric
Why do users say "that's not me"?
One model completely destroys faces (similarity 0.03). ArcFace cosine similarity confirms: 3 models preserve identity, 1 fails catastrophically.
Model Rankings
The VLM judge's identity_preservation scores and ArcFace cosine similarity correlate near-perfectly. Two independent methods — a vision-language model and a deterministic face embedding — confirm the same ranking.
Gallery — "Place this person in a professional office with bookshelves"
Showcase 03 · Text2Img · 1 Model · 5 Prompt Pairs
Which prompt actually performs better?
"More detail = better results" is a myth. 1 out of 5 stuffed prompts actually scored worse than its minimal version. Test your prompts, don't guess.
Simple vs Stuffed Prompt Comparison
| Subject | Simple | Stuffed | Delta | Verdict |
|---|---|---|---|---|
| Cat on windowsill | 4.33 | 4.67 | +0.34 | Helped |
| Coffee shop | 4.33 | 4.33 | +0.00 | No change |
| Neon sign | 2.33 | 5.00 | +2.67 | Helped! |
| Mountain lake | 5.00 | 4.33 | -0.67 | Hurt |
| Robot reading | 3.00 | 4.67 | +1.67 | Helped! |
Gallery — Simple vs Stuffed
Showing the two most dramatic pairs — the biggest improvement and the only prompt where extra detail hurt.
Showcase 04 · Img2Img · 3 Models · 1 Prompt
Is my product photo still my product?
AI edits warp shapes, lose logos, change colors. One model was blocked by content policy for a sneaker photo. Silent failures are everywhere — measure, don't assume.
Model Rankings
Dimension Breakdown
Gallery — "Place this product on a marble kitchen countertop"
violation
Run your own benchmark.
One command. Real scores. Your models, your prompts, your data.