Multimodal Understanding Demo
Evaluate how well an LLM processes combined text and visual inputs.
👀
Coming Soon
Image captioning, visual QA, and cross-modal reasoning evaluation coming soon.
Planned
Planned Features
Image Captioning
Accurate, detailed descriptions of visual content.
Visual QA
Answer specific questions about objects and relationships.
Text-Image Consistency
No hallucination of visual details.
Structured Output
JSON output from visual analysis (bounding boxes, labels).