leaderboard Image Models Leaderboard
evalytic

Evalytic

Pytest for
AI outputs.

Evaluate images, text, RAG, and agents with LLM judges, local metrics, and CI gates.

Know if your AI is good before your users tell you.

Prompt

"A neon sign reading 'MIDNIGHT CAFE' above a door in a rainy Tokyo alley at night"

flux-dev
flux-dev $0.025
visual
3.0
prompt
4.0
text
3.0
0.0 2.6s
Winner
flux-schnell
flux-schnell 8x cheaper $0.003
visual
3.0
prompt
5.0
text
5.0
0.0 1.2s
sdxl
sdxl $0.010
visual
3.0
prompt
3.0
text
1.0
0.0 2.6s

Visual evaluation is the most mature public workflow today. Text, RAG, and agent support are now available in the same SDK.

Start with Visual GitHub
$ pip install evalytic copy

What is Evalytic?

One SDK for evaluating images, text, RAG, and agents.

visibility

Multi-Modal Eval

Score image generations, text outputs, grounded RAG answers, and agent runs in one workflow.

dashboard

LLM-as-Judge

Use Gemini, GPT, Claude, and consensus judging when one score is not enough.

compare

Local Metrics & Embeddings

Mix judge scores with deterministic metrics and embeddings that run on your machine.

terminal

Compare Reports

Diff baseline versus candidate runs to catch regressions before rollout.

verified

CI Quality Gates

Fail builds with bench thresholds or metric thresholds for text, RAG, and agent reports.

description

Reports for Humans and Automation

Terminal, JSON, HTML, and browser review make results useful in both local work and CI.

Use Cases

Model selection, regression detection, CI gates, and evaluation loops across AI outputs.

leaderboard

Model Selection

Choose the best model for your workflow with side-by-side visual, text, or agent evaluations.

bug_report

Regression Detection

Catch drops from model upgrades, prompt changes, or agent workflow edits before users notice.

tune

CI/CD Quality Gates

Turn reports into pass or fail decisions with thresholds that match the report type you are shipping.

verified

Reference-Free Evaluation

Score outputs when there is no gold answer yet, using judges, faithfulness checks, and task-specific criteria.

monitoring

Reference-Based Evaluation

Use expected answers, rubrics, and deterministic metrics when you do have a target to measure against.

photo_camera

Human Review When Needed

Start with fast automated scoring, then open reports for the edge cases that deserve a human pass.

How It Works

Your prompts, outputs, and traces go in. Scores, reports, compare, and gates come out.

dataset

1. Inputs

Bring prompts, generated outputs, retrieved context, or tool traces from the workflow you want to test.

psychology

2. Judges + Metrics

Run LLM judges, deterministic metrics, and embeddings together instead of picking just one style of evaluation.

lab_profile

3. Scores + Reports

Review terminal output locally or export JSON and HTML reports for teammates, dashboards, and CI.

rule

4. Compare + Gate

Compare candidate runs against a baseline and fail builds when the scores move the wrong way.

Visual evaluation is still the strongest public proof surface today. Text, RAG, and agent support now ship in the same SDK so teams can keep one evaluation stack as their product grows.

Quick Examples

One visual benchmark, one RAG evaluation.

Visual benchmark
$ evaly bench -m flux-schnell -m flux-dev -p prompts.json

2 models x 5 prompts = 10 images
Generating ████████████████████ 100%
Scoring ████████████████████ 100%

flux-schnell overall 4.3 cost $0.03
flux-dev overall 3.3 cost $0.25

Winner: flux-schnell
RAG eval
$ evaly rag eval --query "What changed in the refund policy?"
--response "Returns are now 45 days."
--context "Refund window updated from 30 to 45 days."

faithfulness 1.00
answer_relevancy 0.92

Result: grounded and relevant

Start scoring in
five minutes.

Pick a path for visual benchmarking, RAG evaluation, or text and agent workflows.

$ pip install evalytic