Multimodal Understanding Demo

Evaluate how well an LLM processes combined text and visual inputs.

👀

Coming Soon

Image captioning, visual QA, and cross-modal reasoning evaluation coming soon.

Planned

Planned Features

Image Captioning

Accurate, detailed descriptions of visual content.

Visual QA

Answer specific questions about objects and relationships.

Text-Image Consistency

No hallucination of visual details.

Structured Output

JSON output from visual analysis (bounding boxes, labels).