Test Results Dashboard

Compare LLM performance across 10 test categories. Select models to analyze.

Test Category

Sort By

Model Score Card

Select Models to Compare

GPT-4o

Claude Sonnet 4

Claude Opus 4

Gemini 2.5 Pro

DeepSeek V3

Llama 4

Capability Radar

Interactive spider chart — select one model to see its capability profile.

Cross-Model Comparison

Scores for the selected model against all models on the same test.

Leaderboard

All models ranked by average score across test types.

Model	Avg Score	Best	Worst	Verdict

Recent Runs

Latest test execution results.