Why Is It So Hard to Tell What a Vision-Language Model Actually Sees?
High benchmark accuracy tells you almost nothing about whether a VLM is actually using its image input. Measuring modality attribution turns out to be harder than it looks, and every existing approach shares the same structural blind spot.
A vision-language model scores 82% on VQA. That score tells you almost nothing about whether the model is actually using the image.
It is consistent with a model doing genuine visual reasoning, and equally consistent with a model that almost entirely ignores the image, answers from language priors alone, and gets lucky on the cases where those priors align with the correct answer. You cannot tell the difference by looking at accuracy. The number is real; what it measures is ambiguous.
This is the core problem that motivates a whole branch of VLM interpretability research, and it turns out to be much harder than it first appears. The difficulty is not that researchers have failed to try; it is that every intuitive approach runs into a structural problem, and the next intuitive approach, designed to fix that, reveals a different structural problem underneath.
This post is about what those problems are and why they're hard. It sets up the question that the rest of this series is built around.
What we actually want to know
Before diagnosing the measurement failure, it helps to be precise about what a good measurement would actually capture.
The naive version of the question is: did the model look at the image? But that's not precise enough to be useful. A model can attend to image tokens, include them in its computation, and still produce a prediction that is entirely determined by the text. Looking and using are different things.
A better version of the question: how much of the model's internal prediction was shaped by the image versus the text? Still not quite right, because this frames the problem as though vision and text are independent contributors whose shares add to 100%, which assumes something about the nature of multimodal fusion that may not be true.
The actually correct version of the question is more complex: how much does each modality contribute uniquely, meaning information that the other modality could not have provided, and how much of the model's behavior depends on information that only emerges when both modalities are processed together?
That last clause matters. In any system where two inputs interact, some of what the system does cannot be attributed to either input alone; it emerges from their combination. Any measurement framework that ignores this produces attribution scores that are, at best, incomplete and, at worst, actively misleading. Every existing family of measurement methods fails to handle this fully.
The first family: perturbation and occlusion
The oldest and most intuitive approach is to corrupt or remove one modality and observe what happens to the output.
The logic is straightforward: if blanking out the image causes a large drop in accuracy, the image must be important. If the model keeps answering correctly without it, maybe it wasn't doing much. You can quantify this by comparing the model's output distribution with and without each modality and treating the divergence as a contribution score.
This is the approach behind input occlusion, MM-SHAP (which computes performance gaps under unimodal ablations), and various perturbation-based schemes that systematically replace one modality with noise, zeros, or dummy content.
The underlying problem is that these methods conflate what the modality contains with what the model uses.
Suppose the image carries a lot of relevant information, but the model is trained to ignore it because the language signal is almost always sufficient to get the right answer. Perturbing the image won't change the output much. Not because the image is uninformative, but because the model has learned to route around it. The method will correctly report that the image has low influence on this model, but it will be tempting to misread that as the image being low-information. They're not the same thing.
The reverse failure is also possible. Suppose the image and the text are highly redundant... they both encode the same information about what's in the scene. When you perturb the image, the text carries enough redundant signal to maintain the correct prediction. The perturbation method attributes low importance to both modalities independently, which makes them look like they don't matter, even though together they're doing everything.
There's a third problem specific to output-based methods: they only tell you about behavioral sensitivity. They say nothing about what's happening inside the model's computation. A model can be internally computing rich visual features and then discarding them at the last layer. Perturbation would correctly report low sensitivity, but the internal story would look completely different. Diagnosing the model's behavior requires seeing inside it, not just watching it perform.
The second family: gradient and attention methods
The next generation of approaches goes deeper, targeting the model's internal computations rather than its outputs.
Gradient-based methods -- Integrated Gradients, GradCAM, attribution heatmaps, attention rollout -- ask: which input features, if changed slightly, would most change the model's internal activations or output logits? The gradient is a measure of local sensitivity, and integrating it along a path from a baseline to the actual input gives you a principled attribution score for each input token or patch.
Attention-based methods take a different route: they use the attention weights themselves as a proxy for importance. If the model's attention heads are strongly attending to visual tokens when producing its answer, that's interpreted as evidence that the visual modality is influential.
These methods are richer than perturbation approaches because they're looking inside the model. They can tell you which specific image regions or text tokens are driving the prediction, not just whether the modality as a whole matters.
But they share a critical limitation: they assign credit to each modality independently, as marginal importance scores. The gradient of the output with respect to the image tokens is computed holding the text fixed. The gradient with respect to the text tokens is computed holding the image fixed. These are two separate measurements, and they're treated as two separate contributions.
This is the right way to measure the influence of a single feature in isolation, but it produces a distorted picture in systems where features interact.
When two modalities encode overlapping information, gradient methods will overestimate their total influence, each one looks important because each one's removal would matter, even though they're both saying the same thing. When information only becomes meaningful through the interaction of both modalities... when neither the text alone nor the image alone is sufficient to produce the prediction, but together they are -- gradient methods will undercount both, because neither one individually moves the output much.
The attention-based variants have an additional problem: attention weights are not attribution scores. A head attending heavily to an image token means the model is routing information through that token, but the information that flows through might be coming from the image, from the text that was used to query the image, from residual connections, from anywhere. High attention to vision tokens is consistent with the model using visual information and consistent with the model using visual tokens as a routing mechanism while the actual computation is linguistic. You cannot tell from the weights alone.
The third family: Shapley-based methods
Shapley values, borrowed from cooperative game theory, were designed precisely for the problem of attributing credit in systems where inputs interact. The original framework handles the case where a coalition of players cooperates to produce an outcome, and you want to fairly distribute the outcome's value across individual players based on their marginal contributions to every possible subset coalition.
Applied to modality attribution, this means: rather than measuring each modality's marginal contribution while holding the other fixed, you measure each modality's average marginal contribution across all possible coalitions... all possible subsets of modalities you could include or exclude. The Shapley value for each modality is the weighted average of how much it adds across every possible configuration.
This is a genuine improvement. It captures some interaction effects that gradient methods miss, because the marginal contribution of a modality in a coalition where both are present can differ from its contribution when it's alone. MM-SHAP, TokenSHAP, PixelSHAP, and MultiSHAP all operate in this family, with varying granularity, from modality-level attribution down to individual visual patch and text token contributions.
The problem is that Shapley values, even when computed correctly, still produce a single scalar per modality. They tell you how much each modality contributes to the outcome, on average, but not how the contribution is structured: whether the modalities are sharing information, whether one is suppressing the other, or whether something genuinely new emerges only from their combination.
More precisely: Shapley values decompose the total value among players. They do not decompose the information structure of the prediction. You can have two modalities with equal Shapley values because they share all the same information (redundancy), or because each uniquely contributes half the prediction (genuine individual contribution), or because neither contributes anything independently but together they produce the full prediction (pure synergy). The Shapley scores look identical in all three cases. The underlying situations are completely different.
The structural blind spot: redundancy and synergy
At this point a pattern is visible. All three families of methods share the same fundamental assumption: that modality contributions are additive. The image contributes some amount, the text contributes some amount, and the total is their sum. The methods disagree about how to estimate each term, but they all agree the model looks like a sum. Multimodal fusion, in practice, does not work this way.
Consider what actually happens in a system where two inputs interact. Part of what each modality carries is unique to it: information about the target that the other modality does not contain. Part of what they carry is redundant, meaning the same information is encoded in both, and either one alone would be sufficient to recover it. And there is a third category: information that neither modality carries independently, which only becomes available when both are processed together. This last component is synergy.
These are three structurally different situations, and any attribution framework that conflates them produces scores that are fundamentally uninterpretable.
In a redundant setting, perturbation and gradient methods overestimate combined importance. Removing either modality appears to matter, even though the other could cover for it, and the credited contributions do not add up to a coherent picture. In a synergistic setting, those same methods underestimate what is happening. Neither modality, examined in isolation, seems to do much, so the actual computation driving the prediction becomes invisible to methods that only measure marginal effects.
Any method that attributes credit modality-by-modality, independently, cannot see redundancy or synergy by construction. The framing itself rules it out. Measuring these things requires a framework that treats the joint distribution of both modalities as the fundamental object, rather than analyzing each distribution separately.
The representation problem: outputs versus internals
There's a second, independent problem that compounds everything above.
Most attribution methods - perturbation, gradient, and Shapley alike, are evaluated against the model's output: either the final predicted token, the logit distribution, or a downstream accuracy metric. The question they answer is: how does the model's answer change?
But the model's answer is a function of two things: the model's internal computation, and the structure of the dataset. A model that answers correctly because text in the benchmark reliably predicts the correct answer looks identical at the output level to a model that answers correctly because it genuinely reasons over both modalities. Accuracy conflates the two.
The right target for attribution is not the output label but the model's internal fusion representation, the state of the computation at the point where modalities are combined. This is what directly reflects how the model is processing its inputs, independent of whether the dataset happens to reward that computation.
There's a clean information-theoretic justification for this: the Data Processing Inequality. Any information transformation that happens after fusion can only reduce or preserve the information content. If you measure modality contributions at the output, you're measuring something that has already been processed and you can't reconstruct from that what happened inside. Measuring at the internal fusion representation captures what the fusion mechanism actually did, before downstream layers have had a chance to compress or discard it.
This also means that attribution measured at the internal representation is a property of the model's architecture, not the dataset. The same architecture will have the same internal information structure regardless of whether the dataset is easy or hard, biased or balanced. That's the diagnostic you want.
What a principled metric actually needs
A measurement framework that answers the modality attribution question properly needs to do four things simultaneously, and no existing method does all four:
Decompose uniqueness from redundancy. You need to be able to say: this fraction of the model's prediction is driven by information that only the text carries. This fraction is driven by information that only the image carries. And this fraction comes from information both carry redundantly. Without this decomposition, any "text contribution" score is contaminated by shared information that should be attributed to neither modality alone.
Measure synergy. You need to be able to say: this fraction of the model's prediction is driven by information that neither modality carries independently and it only emerges when both are processed together. This is what distinguishes a model doing genuine multimodal reasoning from a model that processes two unimodal streams in parallel and concatenates them.
Operate on internal representations, not outputs. Attribution should target the fusion representation, not the label distribution. This decouples model bias from dataset bias and measures what the architecture actually does rather than what the dataset rewards.
Require no auxiliary training. Several recent approaches propose trainable estimators for these quantities, optimization objectives that estimate information-theoretic quantities using learned networks. But a trainable estimator introduces its own inductive biases. The estimator's architecture encodes assumptions about what kind of information matters, and those assumptions contaminate the attribution scores. A principled metric needs to operate inference-only, on fixed representations, without any additional learning.
Why it matters: the 81% problem
The practical implications of this measurement gap show up clearly in the numbers.
One of the results from this paper: SmolVLM-256M, a compact vision-language model, has a text contribution score of 81.39% on GQA. More than four out of every five units of predictive information in its internal representation come from the text stream. The image is contributing less than 19%.
This is a model marketed and trained as a vision-language model, evaluated on a visual reasoning benchmark, producing correct answers and it is almost entirely not using vision.
No perturbation or gradient method would have surfaced this clearly. The model answers correctly enough on the benchmark that output-based methods would report normal accuracy. Its attention patterns would show it attending to image tokens. Its gradient scores would show some image influence. The architecture looks like a VLM, the outputs look like VLM outputs, and the benchmark score looks like VLM performance.
Only a method that decomposes the internal information structure, separating what the image uniquely contributes from what the text uniquely contributes, can see that almost everything is coming from one side.
What comes next
The framework that makes all of this measurable is called Partial Information Decomposition, or PID. It's a branch of information theory that was developed precisely to handle the situation described above: multiple sources of information, a target variable, and the need to decompose what each source contributes uniquely, what they share redundantly, and what emerges synergistically from their combination.
The next post introduces PID from first principles... what it decomposes, why each component means what it means, and why it provides the kind of principled attribution that all the methods described above fall short of. The math is real but the intuitions are strong, and by the end of it the framework should feel less like a technical construction and more like an obvious tool for the job.
The vocabulary is unique, redundant, and synergistic. Once you have those three words, the problems described in this post have names, and named problems are problems you can measure.