Back to Blog
December 2, 2025

Multimodal Structured Outputs: Evaluating VLM Image Understanding at Scale

Evaluating how vision language models behave when you remove or include images in multiple-choice tasks.

by Everett Kleven

TLDR

After building a naive evaluation pipeline for Qwen-3-VL, I quickly discovered that most multiple choice questions from the science qa, ai2d, and were answerable without the image. By running an ablation study (with vs. without images) and classifying results into quadrants, I was able to analyze the degree to which naive accuracy metrics disguise genuine image understanding. I then took the cases where the model failed and added a final LLM-as-a-Judge stage to reason about each failure and attribute a diagnosis.

How do you know if your vision-language model is actually seeing the image or just guessing from the text?

This isn't a philosophical question. When you're using vision language models to solve a real problem, it matters if your model's image understanding is adding value or introducing noise. The time-to-first-token with image inputs usually takes at least 3x longer than pure text does. Second, if you can figure out the answer to a problem without an image, why bother? Finally, if you're job fundamentally relies on information stored in each image then removing it should mean there should be a slim chance the question is predictable.

In this post, we'll build a short evaluation pipeline in Daft that addresses this question by leveraging structured outputs and the new prompt function to conduct an ablation study. By the end, you'll have a methodology you can apply to your own models.

What we'll cover:

  • A few core primitives: Structured Outputs and LLM-as-a-Judge

  • An ablation study that isolates image understanding from text reasoning

  • A quadrant framework for classifying model behavior

  • Code that runs in 5 minutes on 50 rows—and scales to millions

What is Structured Outputs?

Structured Outputs refers to a family of features that constrain language model responses to a specific format. While LLMs continue to demonstrate impressive capabilities, their unpredictable outputs make them difficult to integrate with traditional software systems. Most production AI use cases leverage structured outputs—whether for tool calls, function arguments, or schema-compliant JSON.

The key enabling technology is guided decoding (also called constrained decoding). Rather than hoping the model outputs valid JSON, guided decoding manipulates token probabilities during generation to guarantee valid output. The model literally cannot produce invalid tokens.

This works through several mechanisms:

  • Logit biasing: Penalizing or promoting specific tokens

  • Finite State Machines (FSM): Filtering tokens based on grammar rules

  • Schema enforcement: Only allowing tokens that keep the output JSON-schema compliant

Structured outputs generally consist of five constraint techniques:

  1. 1.

    Basic Python typesintfloatbool

  2. 2.

    Multiple choice: Using Literal or Enum

  3. 3.

    JSON schemas: Using Pydantic models or dataclasses

  4. 4.

    Regex patterns

  5. 5.

    Context-free grammars

For our evaluation, we'll use Pydantic models to ensure the VLM always returns a valid multiple-choice answer.

What is LLM-as-a-Judge?

LLM-as-a-Judge is a framework where a language model evaluates the outputs of other AI systems. Rather than relying on human evaluation (expensive, slow, inconsistent) or surface-level metrics like BLEU/ROUGE, LLM judges can assess semantic quality at scale.

The approach was formalized in the MT-Bench paper, which demonstrated that strong LLMs can achieve ~80% agreement with human preferences—comparable to inter-annotator agreement between humans themselves.

Three common evaluation methods:

  • Pairwise Comparison: The judge picks which of two responses is better

  • Single Answer Grading: The judge assigns a score based on criteria

  • Reference-Guided Grading: The judge compares a response against a known correct answer

In this pipeline, we'll use reference-guided grading with diagnostic feedback. Our judge won't just say pass/fail, it will analyze why the model failed, attributing errors to either question ambiguity or image understanding issues.

Ablating Images in Image understanding

We'll evaluate Qwen3-VL-8B, a capable open vision-language model, on The Cauldron—a massive collection of 50 vision-language datasets from HuggingFace.

For this demo, we'll use the AI2D subset: science diagrams with multiple-choice questions. Think food webs, cell diagrams, and physics illustrations paired with questions like:

"From the above food web diagram, what would cause the kingfisher population to increase?"

A. decrease in fish

B. decrease in water boatman

C. increase in fish

D. increase in algae

These questions require genuine image understanding. You can't answer them from text alone. Or can you?

Ablation Methodology

A simple accuracy score tells us how often the model is correct, but not why. To isolate image understanding from text reasoning, we'll run an ablation study:

  1. 1.

    Run inference with the image attached

  2. 2.

    Run the same prompts without the image

  3. 3.

    Compare the results

This produces four possible outcomes for each example:

Outcome

Pass With Image

Pass Without Image

What It Tells Us

Both Correct

Question might be solvable from text alone

Image Helped

True image understanding at work

Image Hurt

The image confused the model

Both Incorrect

Hard question or model limitation

The "Image Hurt" quadrant is particularly interesting. These are cases where the model got the answer right without the image but wrong with it. Understanding why this happens is crucial for improving VLM performance.

Running the Evaluation on 50 Rows

We can quickly get started using HuggingFace Inference Providers to run our evaluation on 50 rows. First we'll define some configuration up front.

1import daft
2
3# Configuration
4MODEL_ID = "qwen/qwen3-vl-8b-instruct"
5LIMIT = 50
6
7# Configuring the OpenAI Provider to point to HuggingFace Inference for hosted Qwen3-VL
8daft.set_provider(
9 "openai",
10 api_key=os.getenv("HF_TOKEN"),
11 base_url="https://router.huggingface.co/v1"
12)

Loading and Preprocessing The Cauldron

HuggingFaceM4/the_cauldron is a massive collection of 50 vision-language datasets spanning millions of rows across:

  1. 1.

    General visual question answering

  2. 2.

    OCR document understanding & text transcription

  3. 3.

    Chart/figure understanding

  4. 4.

    Table understanding

  5. 5.

    Reasoning, logic, maths

  6. 6.

    Textbook/academic questions

  7. 7.

    Differences between 2 images

  8. 8.

    Screenshot to code

This dataset is a great resource for evaluating the image understanding capabilities of a vision language model as it gives us a wide range of tasks and image compositions to test on. It's size alone makes it particularly useful for training and validation.

For now we will begin with General visual Q&A subset AI2D, leveraging Daft's built-in dataframe operations like regular expressions to parse each assistant message to extract the answer.

1from daft import col
2from daft.functions import prompt, when, format
3
4import os
5from pydantic import BaseModel, Field
6
7# Load the AI2D subset
8df_raw = daft.read_huggingface("HuggingFaceM4/the_cauldron/ai2d")
9
10# Decode images and extract Q&A fields
11df_prep = (
12 df_raw
13 .explode(col("images"))
14 .with_column("image", col("images")["bytes"].decode_image())
15 .explode(col("texts"))
16 .select(unnest(col("texts")), "image")
17 .with_column("answer", col("assistant").regexp_replace("Answer: ", ""))
18).collect()

Daft handles image decoding natively, no PIL boilerplate required. Each row now contains a decoded image and the corresponding question/answer pair.

Structured Output Inference

Here's where structured outputs shine. We define a Pydantic model for the response format:

1class ChoiceResponse(BaseModel):
2 """Structured output for multiple choice answers."""
3 choice: str = Field(..., description="The letter of the correct choice (A, B, C, D)")

Then run inference using Daft's prompt function:

1SYSTEM_PROMPT = "Observe the attached image and respond to the multiple choice question with just the letter corresponding to the correct answer."
2
3df_results = df_prep.with_column(
4 "result",
5 prompt(
6 messages=[col("image"), col("user")], # Image + question
7 system_message=SYSTEM_PROMPT,
8 model=MODEL_ID,
9 use_chat_completions=True,
10 return_format=ChoiceResponse, # Enforces structured output
11 )
12).limit(LIMIT).collect()

The return_format=ChoiceResponse parameter ensures every response is valid JSON matching our schema.

Running the Ablation

Now we run the same prompts without images:

1SYSTEM_PROMPT_NO_IMAGE = "Respond to the multiple choice question with just the letter corresponding to the correct answer."
2
3df_ablation = df_results.with_column(
4 "result_no_image",
5 prompt(
6 messages=col("user"), # Text only—no image
7 system_message=SYSTEM_PROMPT_NO_IMAGE,
8 model=MODEL_ID,
9 use_chat_completions=True,
10 return_format=ChoiceResponse,
11 )
12).collect()
13

Classifying into Quadrants

1# Evaluate correctness for both conditions
2df_eval = df_ablation.with_column(
3 "is_correct",
4 col("result")["choice"].strip() == col("answer").strip()
5).with_column(
6 "is_correct_no_image",
7 col("result_no_image")["choice"].strip() == col("answer").strip()
8)
9
10# Classify into quadrants
11df_classified = df_eval.with_column(
12 "quadrant",
13 when((col("is_correct")) & (col("is_correct_no_image")), "Both Correct")
14 .when((col("is_correct")) & (~col("is_correct_no_image")), "Image Helped")
15 .when((~col("is_correct")) & (col("is_correct_no_image")), "Image Hurt")
16 .otherwise("Both Incorrect")
17)
18
19# View the distribution
20df_classified.groupby("quadrant").count().show()

On our 50-row sample, you'll see a distribution across all four quadrants. The exact numbers will vary, but the pattern is consistent: images don't always help.

LLM-as-a-Judge: Understanding Failures

Now for the diagnostic layer. We'll have the VLM analyze its own failures:

1class JudgeResponse(BaseModel):
2 """Diagnostic feedback from the judge."""
3 reasoning: str = Field(..., description="Why did the model choose this answer?")
4 hypothesis: str = Field(..., description="What caused the error?")
5 attribution: str = Field(..., description="Was this a 'question' or 'image' understanding issue?")
6
7JUDGE_SYSTEM_PROMPT = """
8You are an impartial judge reviewing VLM benchmark results.
9Analyze why the model chose its answer and what caused the error.
10Focus on image understanding—your feedback should help improve visual reasoning.
11"""
12
13judge_template = format(
14 """Model predicted {} but correct answer is {}.
15Without the image, model predicted {}.
16Question: {}
17Analyze the failure.""",
18 col("result")["choice"],
19 col("answer"),
20 col("result_no_image")["choice"],
21 col("user")
22)
23
24# Run on a sample of failures
25df_failures = df_classified.where(
26 (col("quadrant") == "Image Hurt") | (col("quadrant") == "Both Incorrect")
27)
28
29df_judge = df_failures.limit(2).with_column(
30 "judge",
31 prompt(
32 messages=[col("image"), judge_template],
33 system_message=JUDGE_SYSTEM_PROMPT,
34 model=MODEL_ID,
35 use_chat_completions=True,
36 return_format=JudgeResponse,
37 )
38).collect()

The judge output gives you actionable feedback:

1# View all judge feedback
2df_judge.select(
3 "quadrant",
4 "image",
5 col("judge")["reasoning"].alias("reasoning"),
6 col("judge")["hypothesis"].alias("hypothesis"),
7 col("judge")["attribution"].alias("attribution"),
8).show(10)

This is the kind of signal you need to improve prompts, fine-tune models, or identify dataset issues. Everything above runs locally in about 5 minutes on 50 rows. But The Cauldron contains millions of rows across 50 subsets.

At 50 rows, you can begin to observe a pattern. At ~8,000 rows (the full AI2D subset), you get statistically significant results. At millions of rows across all subsets, you have a comprehensive VLM benchmark.

The challenge? API rate limits and cost.

Running the full AI2D subset through interactive API providers means:

  • ~8,000 inference calls (with image)

  • ~8,000 inference calls (without image)

  • Thousands more for the Judge pass

  • 429 Errors that throttle you to a crawl

  • one massive bill at the end

Ideally we'd want to be able to run as many inference stages as needed with a predictable costs and high overall throughput.

So I ran it on Daft Cloud

Another natural next step would be to parallelize this pipeline across multiple datasets leveraging multiple GPUs. In this scenario, leveraging a managed provider like OpenRouter isn't feasible. With Daft Cloud you can run Qwen-3-VL on as much data as you want with no rate limits or GPU configuration headaches.


If you're interested to run serverless AI pipelines at scale make sure to sign up for Daft Cloud early access and book a demo!

What to try next: Multi-Dataset Evaluation

Extend evaluation across all 50 subsets of The Cauldron to build a comprehensive benchmark:

1# Same code, different provider
2daft.set_provider("daft")
3
4# Run all subsets
5subsets = ["ai2d", "chartqa", "docvqa", "infographicvqa", ...]
6for subset in subsets:
7 df = run_full_pipeline(subset, MODEL_ID)
8 df.write_parquet(f"results/{subset}.parquet")
Get updates, contribute code, or say hi.
Daft Engineering Blog
Join us as we explore innovative ways to handle multimodal datasets, optimize performance, and simplify your data workflows.
Github Discussions Forums
join
GitHub logo
The Distributed Data Community Slack
join
Slack logo