December 2, 2025

Multimodal Structured Outputs: Evaluating VLM Image Understanding at Scale

Leveraging ablation for contrastive image understanding evaluation in Daft

by Everett Kleven

TLDR

We ran a large-scale VLM image understanding evaluation on Daft Cloud and discovered statistically significant bias in the multiple-choice questions within the AI2D, Science-QA, and TQA datasets. Across 20,093 questions spanning 3 datasets, our analysis revealed a Qwen-3-VL-4b scored an accuracy of 90.36% with the image included versus 71.00% with the image ablated. By classifying test results into one of four possible scenarios, we were able to quantify the impacts of our image ablation and isolate which test-cases genuinely required images for the correct answer.

Note: For evaluation results, scroll to the bottom of the post.

How do you know if your vision-language model is actually seeing the image or just guessing from the text?

This isn't a philosophical question. When you're using vision language models to solve a real problem, it matters if your model's image understanding is adding value or introducing noise. Image inputs come at a cost. Metrics like time-to-first-token can take 3x longer than just raw text, so if you're LLM pipeline can operates without an image, it's usually best to avoid it. On the flip side, Vision-Language-Models (VLMs) have improved drastically over the past few years, empowering teams across industries to leverage the flexibility of generalized intelligence on images and videos.

Image understanding is a machine learning task concerned with enabling computational systems to derive semantically meaningful interpretations from visual data. A system performing image understanding aims not only to extract low-level visual features but also to construct a coherent representation that approximates human-level interpretation, supporting downstream tasks like decision-making, description generation, or interaction in multimodal environments. Since these capabilities are more computationally expensive w care about the degree to which our image understanding can perform across a variety of scenarios.

This is where evals come in. A good image understanding evaluation should not only tell you how well you perform compared to your peers, but additionally how effective each an individual model performs across a variety of contexts and formats.

In this post, we'll build a short evaluation pipeline in Daft that measures textual bias in image understanding by focusing on academic diagrams. We'll explore how well a model can answer a multiple choice question by testing how accuracy differs when a model is given the reference image versus when it is removed. The goal is to develop intuition about how accuracy results change when an image is removed from the context to simultaneously evaluate the performance of our VLM and surface any bias in our dataset. Finally we'll run a popular diagnotic strategy to investigate WHY a particular image + question pair failed by leveraging an LLM-as-a-Judge to review the results. By the end of this post you'll have a methodology you can apply to your own models.

What we'll cover:

•
A few core primitives: Structured Outputs and LLM-as-a-Judge
•
An ablation study that isolates image understanding from text reasoning
•
A quadrant framework for classifying model behavior
•
Code that runs in 5 minutes on 50 rows—and scales to thou

What is Structured Outputs?

Structured Outputs refers to a family of features that constrain language model responses to a specific format. While LLMs continue to demonstrate impressive capabilities, their unpredictable outputs make them difficult to integrate with traditional software systems. Most production AI use cases leverage structured outputs, whether for tool calls, function arguments, or schema-compliant JSON.

The key enabling technology is guided decoding (also called constrained decoding). Rather than hoping the model outputs valid JSON, guided decoding manipulates token probabilities during generation to guarantee valid output. The model literally cannot produce invalid tokens.

This works through several mechanisms:

•
Logit biasing: Penalizing or promoting specific tokens
•
Finite State Machines (FSM): Filtering tokens based on grammar rules
•
Schema enforcement: Only allowing tokens that keep the output JSON-schema compliant

On the development side, structured outputs generally consist of five constraint techniques to define how an output should be constrained:

1.
Basic Python types: int, float, bool
2.
Multiple choice: Using Literal or Enum
3.
JSON schemas: Using Pydantic models
4.
Regex patterns
5.
Context-free grammars

For our evaluation, we'll use Pydantic models to ensure the VLM always returns a valid multiple-choice answer.

What is LLM-as-a-Judge?

LLM-as-a-Judge is a framework where a language model evaluates the outputs of other AI systems. Rather than relying on human evaluation (expensive, slow, inconsistent) or surface-level metrics like BLEU/ROUGE, LLM judges can assess semantic quality at scale.

The approach was formalized in the MT-Bench paper, which demonstrated that strong LLMs can achieve ~80% agreement with human preferences—comparable to inter-annotator agreement between humans themselves.

Three common evaluation methods:

•
Pairwise Comparison: The judge picks which of two responses is better
•
Single Answer Grading: The judge assigns a score based on criteria
•
Reference-Guided Grading: The judge compares a response against a known correct answer

In this pipeline, we'll use reference-guided grading with diagnostic feedback. Our judge won't just say pass/fail, it will analyze why the model failed, attributing errors to either question ambiguity or image understanding issues.

Ablating Images in Image understanding

We'll evaluate Qwen3-VL-8B, a capable open vision-language model, on 3 subsets of The Cauldron, a massive collection of 50 vision-language datasets from HuggingFace.

For this demo, we'll use the AI2D subset: science diagrams with multiple-choice questions. Think food webs, cell diagrams, and physics illustrations paired with questions like:

"From the above food web diagram, what would cause the kingfisher population to increase?"
A. decrease in fish
B. decrease in water boatman
C. increase in fish
D. increase in algae

These questions require genuine image understanding. Ideally, you wouldn't be able to answer them from text alone (our ablation study revealed otherwise).

Ablation Methodology

A simple accuracy score tells us how often the model is correct, but not why. To isolate image understanding from text reasoning, we'll run an ablation study:

1.
Run inference with the image attached
2.
Run the same prompts without the image
3.
Compare the results

This produces four possible outcomes for each example:

Outcome	Pass With Image	Pass Without Image	What It Tells Us
Both Correct	✓	✓	Question might be solvable from text alone
Image Helped	✓	✗	True image understanding at work
Image Hurt	✗	✓	The image confused the model
Both Incorrect	✗	✗	Hard question or model limitation

The "Image Hurt" quadrant is particularly interesting. These are cases where the model got the answer right without the image but wrong with it. Understanding why this happens is crucial for improving VLM performance.

Running the Evaluation on 50 Rows

For a low commitment exploration of the pipeline we'll use HuggingFace Inference Providers to run our evaluation on 50 rows. First we'll define some configuration up front.


1# pip install "daft[openai]"
2import daft
3
4# Configuration
5MODEL_ID = "qwen/qwen3-vl-8b-instruct"
6LIMIT = 50  
7
8# Configuring the OpenAI Provider to point to HuggingFace Inference for hosted Qwen3-VL
9daft.set_provider(
10    "openai",
11    api_key=os.getenv("HF_TOKEN"),
12    base_url="https://router.huggingface.co/v1"
13)

Here we define our MODEL_ID variable for re-use throughout the script and set the provider to "openai", overriding the base_url and setting the api_key to our HF_TOKEN environment variable.

Loading and Preprocessing The Cauldron

HuggingFaceM4/the_cauldron is a massive collection of 50 vision-language datasets spanning millions of rows across:

1.
General visual question answering
2.
OCR document understanding & text transcription
3.
Chart/figure understanding
4.
Table understanding
5.
Reasoning, logic, maths
6.
Textbook/academic questions 👈 Our Datasets
7.
Differences between 2 images
8.
Screenshot to code

This superset is a great resource for evaluating the image understanding capabilities of a vision language model as it gives us a wide range of tasks and image compositions to test on. It's size alone makes it particularly useful for training and validation.

We are focused on textbook/academic questions category so we'll experiment with AI2D, the smallest of the subsets. We'll begin our preprocessing by leveraging Daft's built-in dataframe operations to decode the images, explode the multiple choice question texts column, and finally parsing each assistant message to extract the answer letter with regular expressions. This will give us a deterministic way to calculate accuracy when we compare our VLM's structured output.


1import os
2from daft import col
3from pydantic import BaseModel, Field
4
5# Load the AI2D subset
6df_raw = daft.read_huggingface("HuggingFaceM4/the_cauldron/ai2d")
7
8# Decode images and extract Q&A fields
9df_prep = (
10    df_raw
11    .explode(col("images"))
12    .with_column("image", col("images")["bytes"].decode_image())
13    .explode(col("texts"))
14    .select(unnest(col("texts")), "image")
15    .with_column("answer", col("assistant").regexp_replace("Answer: ", "").lstrip().rstrip())
16).collect()

Daft handles image decoding natively, no PIL boilerplate required. Each row now contains a decoded image and the corresponding question/answer pair.

Structured Output Inference

Here's where structured outputs shine. Since we are only testing the model's ability to answer the multiple choice questions, we define a Pydantic model for the response format with a description to remind the model to only response with the letter for its guess.


1class ChoiceResponse(BaseModel):
2    """Structured output for multiple choice answers."""
3    choice: str = Field(..., description="The letter of the correct choice (A, B, C, D)")

Then run inference using Daft's prompt function:


1from daft.functions import prompt
2
3df_results = df_prep.with_column(
4    "result",
5    prompt(
6        messages=[col("image"), col("user")],  # Image + question
7        model=MODEL_ID,
8        use_chat_completions=True,
9        return_format=ChoiceResponse,  # Enforces structured output
10    )
11).limit(LIMIT).collect()

The return_format=ChoiceResponse parameter ensures every response is valid JSON matching our schema.

Running the Ablation

Now we run the same prompt without images:


1df_ablation = df_results.with_column(
2    "result_no_image",
3    prompt(
4        messages=col("user"),  # Text only—no image
5        model=MODEL_ID,
6        use_chat_completions=True,
7        return_format=ChoiceResponse,
8    )
9).limit(LIMIT).collect()

Evaluating Correctness

Since we are dealing with structured outputs, we can rely on the outputs being one of letters A-D, which in some cases may be padded with an extra space. This we can removed with a simple string left/right strip expression, leaving the final string equivalence check trivial.


1# Evaluate correctness for both conditions
2df_eval = df_ablation.with_column(
3    "is_correct",
4    col("result")["choice"].lstrip().rstrip() == col("answer")
5).with_column(
6    "is_correct_no_image", 
7    col("result_no_image")["choice"].lstrip.rstrip() == col("answer")
8)

Classifying into Quadrants

Since we now have our is_correct and is_correct_no_image columns, we can simply define a a when statement to quickly classify our results into our four possible scenarios:


1from daft.functions import whne
2
3# Classify into quadrants
4df_classified = df_eval.with_column(
5    "quadrant",
6    when((col("is_correct")) & (col("is_correct_no_image")), "Both Correct")
7    .when((col("is_correct")) & (~col("is_correct_no_image")), "Image Helped")
8    .when((~col("is_correct")) & (col("is_correct_no_image")), "Image Hurt")
9    .otherwise("Both Incorrect")
10)

On our 50-row sample, you'll see a distribution across all four quadrants. The exact numbers will vary, but the pattern is consistent: a majority of multiple choice questions are guessable without the reference image.

LLM-as-a-Judge: Understanding Failures

Now for the diagnostic layer. We'll have the VLM analyze its own failures:


1from daft.functions import format
2
3class JudgeResponse(BaseModel):
4    """Diagnostic feedback from the judge."""
5    reasoning: str = Field(..., description="Why did the model choose this answer?")
6    hypothesis: str = Field(..., description="What caused the error?")
7    attribution: str = Field(..., description="Was this a 'question' or 'image' understanding issue?")
8
9JUDGE_SYSTEM_PROMPT = """
10You are an impartial judge reviewing VLM benchmark results.
11Analyze why the model chose its answer and what caused the error.
12Focus on image understanding, your feedback should help improve visual reasoning.
13"""
14
15judge_template = format(
16    """Model predicted {} but correct answer is {}.
17Without the image, model predicted {}.
18Question: {}
19Analyze the failure.""",
20    col("result")["choice"],
21    col("answer"),
22    col("result_no_image")["choice"],
23    col("user")
24)
25
26# Filter our DataSet to only focus on Failures 
27df_failures = df_classified.where(
28    (col("quadrant") == "Image Hurt") | (col("quadrant") == "Both Incorrect")
29)
30
31# Run our LLM-as-a-Judge call on the failures
32df_judge = df_failures.limit(2).with_column(
33    "judge",
34    prompt(
35        messages=[col("image"), judge_template],
36        system_message=JUDGE_SYSTEM_PROMPT,
37        model=MODEL_ID,
38        use_chat_completions=True,
39        return_format=JudgeResponse,
40    )
41).collect()

Selecting only the necessary columns for the presentation, The judge output gives you actionable feedback:


1from daft.functions import unnest
2
3# View all judge feedback
4df_judge.select(
5    "quadrant",
6    "image",
7    unnest(col("judge")),
8).show(10)

This is the kind of signal you need to improve prompts, fine-tune models, or identify dataset issues. Everything above runs locally in about less than 3 minutes on 50 rows. But The Cauldron contains millions of rows across 50 subsets.

The challenge? API rate limits and cost.

Running the full AI2D subset through interactive API providers means:

•
~20,000 inference calls (with image)
•
~20,000 inference calls (without image)
•
Thousands more for the Judge pass
•
429 Errors that throttle you to a crawl
•
one massive bill at the end

Ideally we'd want to be able to run as many inference stages as needed with a predictable costs and high overall throughput. Since we're focusing on the multiple choice datasets within The Cauldron our sample size covers ~20,000 rows. With this level of scale we can be confident we'll get statistically significant results, however at this scale leveraging an interactive provider isn't feasible.

So I ran it on Daft Cloud

With Daft Cloud I can run Qwen-3-VL-4b on as much data as you need without having to deal with rate limit headaches or painful GPU configuration.

Metrics Dashboard for the TQA Daft Cloud Run

Pipeline performance metrics across all three datasets

The throughput graphs tell the real story. The Tokens In/Second metric held steady between 3,000 and 4,500 for the duration of the run, with no degradation as the pipeline worked through the queue. Zero 429 errors. No exponential backoff. No idle time waiting for quota to refresh. The pipeline simply processed requests as fast as the model could handle them.

To put this in perspective, running the same evaluation through a typical interactive API provider would take considerably longer. Most free-tier inference endpoints cap you at roughly 10 requests per minute, and even paid tiers impose token-per-minute limits that would stretch this workload across hours or days. Here, the entire evaluation—20,093 questions with two inference passes each—finished in the background without a hitch.

You can find the full production pipeline on our daft-examples repository. If you're looking to run serverless AI data pipelines make sure to sign up for Daft Cloud early access and book a demo!

Image Understanding Ablation Evaluation Results

There are few engines that are faster than Daft at analytics quarrels. We can easily generate reports from our results by reading from our results tables that were written to S3 and applying a few transformations to calculate our aggregate accuracies and classifications.

Most of the time when you run an image evaluation benchmark, you are testing the accuracy of a particular model compared to it's peers. Our study reveals that using multiple choice in image evaluations is not so straightforward. Multiple Choice might be a great way to get determinstic verifiable answers, but their design and composition have an outsized impact on the value of the benchmark.

In this case, these textbook/academic diagrams contained excellent semantic reasoning representations, however many of the questions that were posed didn't effectively account for the context that many LLMs already come with rendering 70% of all questions useless for evaluating image understanding.

What becomes valuable about this exercise is how clearly it segments our quadrant classifications to eliminate noise from our evaluation. By isolating the 4,360 questions where "Image Helped" from the 13,798 questions where "Both Correct", we can now focus our analysis on the ~22% of questions that genuinely test image understanding.

Key Takeaways

1. Ablation Studies Reveal Dataset Bias, Not Just Model Performance

Our most surprising finding wasn't about Qwen-3-VL-4b, it was about the datasets themselves. Across 20,093 multiple-choice questions designed to test image understanding:

When nearly 70% of an "image understanding" benchmark can be solved blind, we're not measuring vision, we're measuring textual memorization.

2. Multiple-Choice Questions Carry Hidden Context

The design of multiple-choice questions inadvertently encodes information. Answer options often contain semantic cues that a well-trained language model can exploit. Consider a question about a food web diagram asking what would cause a kingfisher population to increase—even without the diagram, an LLM knows that predators increase when prey increases.

This isn't a flaw in the model; it's a consequence of training on internet-scale text that includes countless biology textbooks, Wikipedia articles, and educational materials. The model has seen enough food webs to know the patterns and it can

3. The Real Benchmark is the "Image Helped" Subset

For rigorous VLM evaluation, the most informative metric isn't overall accuracy—it's performance on questions where the image was *necessary*. Our quadrant framework isolates these cases:

Image Necessity Rate by Dataset:

•
TQA: 29.5% of correct answers required the image
•
ScienceQA: 24.9% of correct answers required the image
•
AI2D: 19.1% of correct answers required the image

These rates tell us which datasets provide the strongest signal for evaluating genuine image understanding. TQA, despite having lower overall accuracy, may actually be a more discriminative benchmark.

4. "Image Hurt" Cases Deserve Investigation

The 469 cases where the image decreased accuracy warrant attention. Our LLM-as-a-Judge analysis revealed several patterns:

•
Visual ambiguity and fidelity: Diagrams with small or unclear labels or overlapping elements
•
Conflicting information: Text context suggesting one answer, image suggesting another
•
Attention interference: The model focusing on irrelevant visual details

Understanding these failure modes is essential for improving both models and evaluation datasets.

Conclusion

We set out to evaluate image understanding and ended up evaluating our evaluation. The ablation methodology revealed that multiple choice based benchmarks contain substantial textual bias. This is a finding that has implications for how we interpret published VLM results across the field.

This doesn't diminish the value of AI2D, ScienceQA, or TQA as resources. These datasets remain useful for training and for measuring certain capabilities. But our analysis suggests that headline accuracy numbers on these benchmarks may overstate genuine image understanding by a significant margin.

The quadrant framework [Both Correct, Image Helped, Image Hurt, Both Incorrect] provides a more nuanced view. It separates questions where the model demonstrates true visual reasoning from questions where it's simply applying learned patterns to text. For production systems, this distinction matters.

---

Theres quite a few ways this evaluation could be improved or extended. If you have suggestions or are interested in contributing to the daft-examples repository, we are open source! The code for this evaluation is available in our daft-examples repository.