evalytic
arrow_back Home
info

About these results: Rankings reflect a single benchmark run with default parameters. Model performance varies with prompts, settings, and API versions. These are not absolute rankings — run evalytic bench on your own use case for representative results.

Showcase 01 · Text2Img · 5 Models · 2 Prompts

Do I really need the flagship model?

flux-schnell delivers 88% of the winner's quality at 92% less cost. The 0.6 point gap costs 13× more per image to close.

5
MODELS
10
IMAGES
$0.41
TOTAL COST
1m 44s
DURATION
100%
SUCCESS

Model Rankings

Winner $0.04/img
recraft-v3
4.74
Visual Quality5.0
Prompt Adherence5.0
Text Rendering5.0
Score/$
118.5
#2 $0.025/img
flux-dev
4.68
Visual Quality5.0
Prompt Adherence5.0
Text Rendering5.0
Score/$
187.2
#3 $0.08/img
ideogram-v3
4.63
Visual Quality4.5
Prompt Adherence5.0
Text Rendering5.0
Score/$
57.9
#4 $0.05/img
flux-pro
4.26
Visual Quality3.5
Prompt Adherence5.0
Text Rendering5.0
Score/$
85.2
Best Value $0.003/img
flux-schnell
4.16
Visual Quality5.0
Prompt Adherence4.5
Text Rendering3.5
Score/$
1,387
0.6 point gap. 13× price gap.

recraft-v3 wins at 4.74 — but flux-schnell scores 4.16 at $0.003 per image. That's 1,387 points per dollar vs 119. For most production workloads, the cheapest model is good enough.

COST EFFICIENCY RATIO
11.7×
schnell vs recraft

Cost Efficiency (Score per Dollar)

flux-schnell
1,387
flux-dev
187
recraft-v3
119
flux-pro
85
ideogram-v3
58

Dimension Breakdown

Visual Quality ★ differentiator (high variance)
recraft-v3
5.0
flux-dev
5.0
flux-schnell
5.0
ideogram-v3
4.5
flux-pro
3.5
Prompt Adherence ≈ near ceiling
flux-dev
5.0
flux-pro
5.0
recraft-v3
5.0
ideogram-v3
5.0
flux-schnell
4.5
Text Rendering ★ differentiator (high variance)
flux-dev
5.0
flux-pro
5.0
recraft-v3
5.0
ideogram-v3
5.0
flux-schnell
3.5

Gallery

"A neon sign reading 'OPEN 24/7' in a foggy downtown street at 2am"
flux-schnell
flux-schnell
3.67 / 5.0
flux-dev
flux-dev
5.0 / 5.0
flux-pro
flux-pro
4.0 / 5.0
recraft-v3
recraft-v3
5.0 / 5.0
ideogram-v3
ideogram-v3
4.67 / 5.0
"White sneaker on a marble countertop, soft shadows, product photography"

When prompts are straightforward, quality differences vanish — all 5 models hit 5.0. Differentiation happens on harder prompts like the neon sign above.

flux-schnell
flux-schnell
5.0 / 5.0
flux-dev
flux-dev
5.0 / 5.0
flux-pro
flux-pro
5.0 / 5.0
recraft-v3
recraft-v3
5.0 / 5.0
ideogram-v3
ideogram-v3
5.0 / 5.0

Showcase 02 · Img2Img · 4 Models · 1 Prompt · Face Metric

Why do users say "that's not me"?

One model completely destroys faces (similarity 0.03). ArcFace cosine similarity confirms: 3 models preserve identity, 1 fails catastrophically.

warning flux-dev-i2i: face destroyed (similarity 0.034)
4
MODELS
4
IMAGES
$0.14
TOTAL COST
50s
DURATION
ArcFace
FACE METRIC

Model Rankings

Tied #1Best Value
seedream-edit
$0.03/img
5.00
Face 0.951
Identity Pres.5.0
Tied #1 $0.04/img
reve-edit
5.00
Face 0.954
Identity Pres.5.0
#3 $0.05/img
flux-kontext
5.00
Face 0.854
Identity Pres.5.0
Face Destroyed
flux-dev-i2i
$0.025/img
1.00
Face 0.034
Identity Pres.1.0
psychology
VLM + ArcFace agreement: r = 0.99

The VLM judge's identity_preservation scores and ArcFace cosine similarity correlate near-perfectly. Two independent methods — a vision-language model and a deterministic face embedding — confirm the same ranking.

Gallery — "Place this person in a professional office with bookshelves"

Input
Input
Original
seedream-edit
seedream-edit
5.0 · face 0.951
reve-edit
reve-edit
5.0 · face 0.954
flux-kontext
flux-kontext
5.0 · face 0.854
flux-dev-i2i
flux-dev-i2i
1.0 · face 0.034

Showcase 03 · Text2Img · 1 Model · 5 Prompt Pairs

Which prompt actually performs better?

"More detail = better results" is a myth. 1 out of 5 stuffed prompts actually scored worse than its minimal version. Test your prompts, don't guess.

10
IMAGES
$0.04
TOTAL COST
72s
DURATION
flux-schnell
MODEL

Simple vs Stuffed Prompt Comparison

Subject Simple Stuffed Delta Verdict
Cat on windowsill 4.33 4.67 +0.34 Helped
Coffee shop 4.33 4.33 +0.00 No change
Neon sign 2.33 5.00 +2.67 Helped!
Mountain lake 5.00 4.33 -0.67 Hurt
Robot reading 3.00 4.67 +1.67 Helped!

Gallery — Simple vs Stuffed

Showing the two most dramatic pairs — the biggest improvement and the only prompt where extra detail hurt.

Neon Sign — biggest improvement (+2.67)
Simple neon
"A neon sign reading 'OPEN 24/7' in a foggy downtown street at 2am"
2.33 / 5.0
Stuffed neon
"A buzzing neon sign spelling 'OPEN 24/7' in electric blue and hot pink..."
5.00 +2.67
Mountain Lake — extra detail hurt (-0.67)
Simple mountain
"A mountain lake at sunrise"
5.00 / 5.0
Stuffed mountain
"A pristine alpine lake at sunrise with mirror-perfect reflections..."
4.33 -0.67

Showcase 04 · Img2Img · 3 Models · 1 Prompt

Is my product photo still my product?

AI edits warp shapes, lose logos, change colors. One model was blocked by content policy for a sneaker photo. Silent failures are everywhere — measure, don't assume.

error seedream-edit: blocked by content policy (0% success)
3
MODELS
3
IMAGES
$0.09
TOTAL COST
75s
DURATION

Model Rankings

Winner $0.05/img
flux-kontext
5.00
100% success
Visual Quality5.0
Input Fidelity5.0
Transform. Quality5.0
Artifact Detect.5.0
#2Best Value
$0.04/img
reve-edit
4.00
100% success
Visual Quality5.0
Input Fidelity2.0
Transform. Quality4.0
Artifact Detect.5.0
Blocked $0.03/img
seedream-edit
0.00
0% success
Content policy violation. Sneaker product photo blocked.

Dimension Breakdown

Input Fidelity ★ key differentiator
flux-kontext
5.0
reve-edit
2.0
seedream-edit
fail
Visual Quality ≈ ceiling for successful models
flux-kontext
5.0
reve-edit
5.0
seedream-edit
fail

Gallery — "Place this product on a marble kitchen countertop"

Input
Input
Product photo
flux-kontext
flux-kontext 🏆
5.0 / 5.0
reve-edit
reve-edit
4.0 / 5.0
block
Content policy
violation
seedream-edit
Failed

Run your own benchmark.

One command. Real scores. Your models, your prompts, your data.

# Install & setup
pip install evalytic
evaly init
# Run your first benchmark
evaly bench -y
# Or compare specific models
evaly bench -m flux-schnell -m flux-pro \
-p "A cat on a windowsill" -o report.html