LLM Capability Evaluation

Observing and comparing LLM capabilities across multiple dimensions โ€” from code generation to multi-turn reasoning and multimodal understanding.

11 Tests
5 Categories

Demo Directory

Browse all available interactive demos and evaluations.

View:
๐ŸŒŒ
Live

Solar Lab

Interactive 3D orrery visualization. Explore planetary orbits, scales, and cinematic camera angles.

Visualization 3D
๐ŸŽบ
Live

Trompetas 3D

Three.js 3D trumpet physics โ€” standing waves, harmonics, and sound propagation rendered interactively.

Animation Physics
๐Ÿงฎ
Live

LLM Math

Interactive visualization of the transformer self-attention mechanism with real-time weight heatmap.

Visualization AI
๐Ÿ“
Live

Context Window

Token window management and context limits visualization โ€” see how models handle long contexts.

Engineering Architecture
๐Ÿง 
Live

Reasoning

Multi-step logic and math problem solving with step-by-step chain-of-thought breakdown.

Reasoning Logic
๐Ÿ’ป
Live

Code Generation

Functional code generation and bug correction across multiple programming languages.

Code Engineering
๐Ÿ”ง
Live

Tool Use

Function calling evaluation โ€” model's ability to select, invoke, and parameterize external tools.

Tools Reasoning
๐Ÿ’ฌ
Live

Multi-Turn Dialogue

Context retention and consistency across extended conversations โ€” tracking state and coherence.

Dialogue Communication
๐Ÿ‘๏ธ
Live

Multimodal

Cross-modal reasoning โ€” understanding images, text, and structured output in parallel.

Vision Communication
โšก
Live

Performance

Response latency, accuracy consistency, and throughput benchmarks under load.

Perf Benchmarks
📊
Live

Results Dashboard

Aggregate evaluation results across all test dimensions. Compare models side-by-side.

Results Analytics
๐ŸŽจ
Live

Creativity

Narrative creativity, code originality, and divergent thinking assessment. Evaluate originality, fluency, and elaboration.

Creative Analysis
?
Coming Soon

Translation

Multilingual translation and cross-lingual context understanding evaluation.

NLP Language

About This Project

Tests of Code (TOC) is an internal project to evaluate and compare LLM capabilities across key dimensions: code generation, reasoning, tool use, dialogue, and multimodal understanding. Each demo is self-contained, observable, and designed to clearly show what capability is being tested.

Explore demos →