Test Results Dashboard

Compare LLM performance across 10 test categories. Select models to analyze.

Model Score Card
Select Models to Compare
GPT-4o
Claude Sonnet 4
Claude Opus 4
Gemini 2.5 Pro
DeepSeek V3
Llama 4
Capability Radar
Interactive spider chart — select one model to see its capability profile.
Cross-Model Comparison
Scores for the selected model against all models on the same test.
Leaderboard
All models ranked by average score across test types.
ModelAvg ScoreBestWorstVerdict
Recent Runs
Latest test execution results.