LLM Capability Evaluation
Observing and comparing LLM capabilities across multiple dimensions โ from code generation to multi-turn reasoning and multimodal understanding.
Demo Directory
Browse all available interactive demos and evaluations.
Visual cards
Solar Lab
Interactive 3D orrery visualization. Explore planetary orbits, scales, and cinematic camera angles.
Trompetas 3D
Three.js 3D trumpet physics โ standing waves, harmonics, and sound propagation rendered interactively.
LLM Math
Interactive visualization of the transformer self-attention mechanism with real-time weight heatmap.
Context Window
Token window management and context limits visualization โ see how models handle long contexts.
Reasoning
Multi-step logic and math problem solving with step-by-step chain-of-thought breakdown.
Code Generation
Functional code generation and bug correction across multiple programming languages.
Tool Use
Function calling evaluation โ model's ability to select, invoke, and parameterize external tools.
Multi-Turn Dialogue
Context retention and consistency across extended conversations โ tracking state and coherence.
Multimodal
Cross-modal reasoning โ understanding images, text, and structured output in parallel.
Performance
Response latency, accuracy consistency, and throughput benchmarks under load.
Results Dashboard
Aggregate evaluation results across all test dimensions. Compare models side-by-side.
Creativity
Narrative creativity, code originality, and divergent thinking assessment. Evaluate originality, fluency, and elaboration.
Translation
Multilingual translation and cross-lingual context understanding evaluation.
About This Project
Tests of Code (TOC) is an internal project to evaluate and compare LLM capabilities across key dimensions: code generation, reasoning, tool use, dialogue, and multimodal understanding. Each demo is self-contained, observable, and designed to clearly show what capability is being tested.
Explore demos →