Test Results Dashboard
Compare LLM performance across 10 test categories. Select models to analyze.
Model Score Card
Select Models to Compare
GPT-4o
Claude Sonnet 4
Claude Opus 4
Gemini 2.5 Pro
DeepSeek V3
Llama 4
Capability Radar
Interactive spider chart — select one model to see its capability profile.
Cross-Model Comparison
Scores for the selected model against all models on the same test.
Leaderboard
All models ranked by average score across test types.
| Model | Avg Score | Best | Worst | Verdict |
|---|
Recent Runs
Latest test execution results.