PortfolioBlogSocialsMusic
research

VLMs Are Mostly Reading, Not Looking -- What the Numbers Say

Across six models and six benchmarks, text contribution exceeds image contribution in every case but one. The numbers vary in structured ways that reveal what actually drives the imbalance -- and it is not scale, not training data, and not the benchmark.

Posts 1 through 4 built the case for why existing attribution methods fail, introduced PID as the replacement, traced the architectural origins of text dominance, and validated the metric under controlled synthetic conditions. This post applies the validated metric to six real vision-language models across six real benchmarks and reads what comes back.

The central finding is not subtle. Across every model and every benchmark tested, text contribution exceeds image contribution -- in some cases by a large margin. The most extreme result, SmolVLM-256M on GQA at CT = 81.39%, means that more than four out of every five units of attributable predictive information in that model's internal fusion representation come from the text stream. The image is contributing less than one in five.

But the aggregate finding is not the whole story. The numbers vary in structured, interpretable ways across models, benchmarks, and task types -- and those variations reveal what actually drives the imbalance. This post reads the full table, not just the headline.

One framing note before the numbers: these are not accuracy scores. They are not benchmark rankings. They are measurements of internal information structure -- what each modality uniquely contributes to the model's predictive representation, independent of whether the model gets the right answer. A model can score high on a benchmark and still be heavily biased toward one modality. The PID scores reveal what the benchmark scores conceal.


The models

The paper tests six VLMs spanning three dimensions of variation that matter for interpretation: fusion architecture, parameter count, and recency of design.

ModelFusion typeParametersNotes
BLIPTxt->Img cross-attention~3BText queries image; oldest architecture in set
LLaVA-1.5Concatenation7BLLM-backed; CLIP ViT-L vision encoder
PaliGemma-3BConcatenation3BSigLIP vision encoder; high-capacity visual backbone
SmolVLM-256MConcatenation256MEfficiency-focused; ~28x smaller than LLaVA
Qwen3-VL-2BConcatenation2BModern alignment pipeline
Qwen3-VL-8BConcatenation8BDirect scale comparison to 2B

The Qwen3-VL pair is the controlled scale comparison: same architecture, same training pipeline, same design philosophy, different parameter count. BLIP is the only model in the set using explicit cross-attention fusion. SmolVLM is the compression extreme. This coverage is what makes the cross-model results interpretable -- you can separate architectural effects from scale effects from training effects.


The benchmarks

The benchmark choice matters for PID because PID measures what the model uses, not what the task requires. A benchmark where most questions are answerable from language priors will show high CT even on a genuinely capable multimodal model. Running across multiple benchmarks with different visual grounding requirements lets you see whether information usage shifts when the task demands it.

VQAv2 requires joint visual-textual reasoning over natural images. The canonical multimodal benchmark. Known to have residual language biases but balanced across answer types.

GQA uses programmatically generated compositional questions over scene graphs. Designed specifically to require genuine visual grounding -- spatial relationships, attribute identification, object counting -- reducing language-only answerability compared to VQAv2.

CLEVR uses synthetic images with 3D objects and geometric relationships. Minimal linguistic ambiguity. Almost every question requires parsing the actual visual layout. No shortcut from language priors is available because the images are artificial and the relationships are precise. The strongest test of visual grounding in the set.

ScienceQA covers multi-subject science questions, many with diagrams. Highly heterogeneous in modality reliance -- some questions require reading a diagram, others are answerable from general knowledge where the image is decorative. The benchmark's mix of image-dependent and text-sufficient questions produces high variance in CT across model types.

FOIL-COCO (original and foiled) is an image-caption verification task. Given an image and a caption, determine whether the caption is accurate. The foiled condition replaces one word in each caption with a plausible but incorrect alternative -- a cat becomes a dog, a red object becomes a blue one. Detecting the substitution requires comparing the textual claim against the image. The original and foiled conditions run in parallel, making the CT difference between them a direct measure of whether the model shifts its information usage when the caption is subtly wrong.


Reading Table 1

ModelVQAv2 CT/CIGQA CT/CICLEVR CT/CIScienceQA CT/CIFOIL-Orig CT/CIFOIL-Foil CT/CI
PaliGemma-3B53.62/46.4861.44/38.5648.63/51.3753.29/46.7155.10/44.9054.80/45.20
Qwen3-VL-8B55.43/44.5764.73/35.2653.80/46.2060.45/39.5458.94/41.0657.32/42.68
LLaVA-1.5-7B59.16/40.8471.73/28.2652.40/47.6064.16/35.8458.50/41.5058.70/41.30
BLIP55.45/44.5554.58/45.4253.80/46.2052.90/47.1058.94/41.0657.32/42.68
Qwen3-VL-2B56.28/43.7262.50/37.5051.38/48.6260.40/39.6054.23/45.7756.92/43.08
SmolVLM-256M60.90/39.1081.39/18.6165.18/34.8253.30/46.7069.80/30.2070.10/29.90

Text dominance is universal, but not uniform. Every model on every benchmark shows CT > 50%, with one exception: PaliGemma on CLEVR at 48.63/51.37, the only cell in the table where the image modality edges out text. CT ranges from near-parity to extreme -- and that range is structured, not random.

CLEVR consistently produces the most balanced scores. Across all six models, CLEVR produces the lowest CT values. LLaVA-1.5 drops from 71.73% text on GQA to 52.40% on CLEVR. SmolVLM drops from 81.39% on GQA to 65.18% on CLEVR -- still text-dominant, but measurably less so. The task's visual grounding requirement is pulling image contribution up even in architectures that structurally favor text. The model's information usage does respond to task demands, at least partially. The architecture sets a ceiling on how much visual information can be used; the task affects where within that ceiling the model operates.

ScienceQA inflates CT for some models but not others. LLaVA-1.5 shows CT = 64.16% on ScienceQA, higher than its VQAv2 score of 59.16%. Qwen3 models show similar inflation. PaliGemma (53.29%), BLIP (52.90%), and SmolVLM (53.30%) stay near their baselines. ScienceQA's heterogeneous modality mix -- some questions requiring the diagram, others not -- produces this split. Models with stronger language backbones extract more from the text-sufficient subset, inflating CT.

SmolVLM is the persistent outlier. Its GQA score of 81.39% is nearly 10 percentage points above the next highest model on the same benchmark (LLaVA-1.5 at 71.73%). Its FOIL-COCO scores are similarly elevated. Even on CLEVR, where every other model approaches parity, SmolVLM maintains 65.18% text dominance. The pattern holds across every benchmark: the most compressed model shows the most extreme text reliance.


Why scale doesn't fix it

The Qwen3-VL family gives you the cleanest possible comparison. Same architecture, same training pipeline, different parameter count. If scale were the driver of text dominance, you'd expect a measurable CT shift between 2B and 8B.

BenchmarkQwen3-2B CTQwen3-8B CTDifference
VQAv256.28%55.43%-0.85pp
GQA62.50%64.73%+2.23pp
CLEVR51.38%53.80%+2.42pp
ScienceQA60.40%60.45%+0.05pp
FOIL-Orig54.23%58.94%+4.71pp
FOIL-Foil56.92%57.32%+0.40pp

The differences are small and go in both directions. On GQA and CLEVR, the larger model actually shows higher CT. On VQAv2, it is marginally lower. There is no consistent trend toward more balanced contributions as scale increases. The pattern is noise, not signal.

The paper is direct about this: "increased capacity alone is insufficient to mitigate textual over-reliance. Instead, the evidence implicates specific fusion paradigms, particularly feature concatenation and asymmetric text-to-image attention, as the primary drivers of visual information attenuation."

Note

If you are deploying a VLM on a task where visual grounding matters and you observe text dominance, scaling up is not the fix. The Qwen3 comparison shows that four times the parameters produces negligible CT shift. The fix is architectural.


BLIP as the architecture reference point

BLIP consistently shows lower CT than the concatenation-based models at comparable size, across most benchmarks.

BenchmarkBLIP CTPaliGemma CT (same size)LLaVA-1.5 CT (larger)
VQAv255.45%53.62%59.16%
GQA54.58%61.44%71.73%
CLEVR53.80%48.63%52.40%
ScienceQA52.90%53.29%64.16%

On GQA, the gap between BLIP and LLaVA-1.5 is 17 percentage points. Both are being measured on the same benchmark under the same conditions. The difference is not explained by task or dataset. Post 3's controlled ablation predicted exactly this gap, holding encoders and training constant and varying only the fusion mechanism. The real-world results are consistent with that prediction.

The nuance worth noting: PaliGemma also shows relatively balanced scores despite using concatenation. The paper attributes this partly to its SigLIP vision encoder -- a higher-capacity visual backbone produces richer image representations that are harder to ignore even in a concatenation-based architecture. Fusion architecture is the primary driver; vision encoder quality modulates the baseline.


The FOIL-COCO finding

The FOIL-COCO original versus foiled comparison was designed as a stress test for visual-textual integration. The foiled condition introduces a single-word substitution error into each caption -- one that is plausible but factually wrong. Detecting it requires comparing the textual claim against the image.

If a model is genuinely integrating vision and language, its PID profile should shift between the original and foiled conditions. The foiled caption creates a conflict between text and image. A model attending to both should detect that conflict, and the detection should show up as increased image contribution.

ModelFOIL-Orig CTFOIL-Foil CTShift
PaliGemma-3B55.10%54.80%-0.30pp
Qwen3-VL-8B58.94%57.32%-1.62pp
LLaVA-1.5-7B58.50%58.70%+0.20pp
SmolVLM-256M69.80%70.10%+0.30pp

Almost no shift. The internal information structure is nearly identical between the two conditions. Models are not meaningfully increasing their reliance on image evidence when the caption is subtly wrong. They process the foiled captions almost the same way they process the correct ones, at the level of what each modality contributes to the internal representation.

This is a precise characterization of a widely observed failure mode. VLMs often handle obvious image-caption contradictions but are insensitive to subtle ones. The PID scores locate where the failure occurs: the internal information structure does not change when the subtle conflict exists. The model is not using more image evidence because its fusion mechanism does not route more image evidence when the text is subtly wrong.


When the imbalance develops: layerwise dynamics

Figure 3 in the paper tracks layer-to-layer gradients of PID components from mean-pooled hidden states across three architectures on GQA. The gradient at each layer measures how much each PID component is changing as information propagates through the network.

Unique text information (U1) shows substantial fluctuations across all models, particularly pronounced in SmolVLM. Text's unique contribution to the fusion representation keeps evolving through later layers. Linguistic information is being processed, refined, and accumulated throughout the network.

Unique image information (U2) is consistently flatter. SmolVLM and Qwen-2B show near-flat gradients early -- visual information stops accumulating quickly and stabilizes at a lower level. PaliGemma is the exception: periodic spikes in U2 gradients suggest continued visual processing in some layer groups, consistent with its stronger vision encoder and relatively lower CT scores.

Redundancy (R) and synergy (S) show near-zero gradients throughout, across all three models. Neither the shared signal nor the cross-modal interaction signal grows meaningfully at any layer.

The picture this produces: text and image representations are processed largely in parallel through the network. The linguistic stream keeps evolving and accumulating unique signal; the visual stream levels off early. The merge happens at the end -- but because synergy never grew, what gets merged is two streams that were processed mostly independently.


Why the imbalance is a synergy failure, not just a contribution gap

The CT/CI scores measure unique information -- how much each modality contributes independently. But the paper also tracks synergy (S) and redundancy (R), and their values complete the diagnosis.

Synergy is near-zero across all models and all benchmarks. The sigma values -- synergy as a fraction of total mutual information -- are small throughout. This is what elevates text dominance from a quantitative observation to a structural one.

Two explanations are consistent with high CT:

Explanation A. The model is doing multimodal reasoning, but text happens to carry more uniquely relevant information for these tasks. The image is contributing meaningfully, just less so. The imbalance reflects the tasks' statistical structure.

Explanation B. The model is processing two streams largely in parallel and defaulting to whichever carries more unique signal. Cross-modal combination is not occurring. The imbalance reflects a structural failure to integrate.

High CT alone is consistent with both. Near-zero synergy distinguishes them. If the model were genuinely integrating modalities -- creating information from their combination that neither carries alone -- synergy would be non-zero. The flat S lines in Figure 3 and the low sigma values across Table 1 mean the models are not doing this. They default to the stronger stream, which is almost always text, without building anything from the combination.

The paper's framing of this finding: "language dominance arises primarily from a deficit in synergistic information, where models underutilize information that is only accessible through joint visual-textual reasoning."


What the results establish

Text dominance is present across every model family, training pipeline, and parameter count tested. Its degree is determined primarily by fusion architecture. BLIP's cross-attention mechanism produces lower CT than concatenation-based models at comparable size. Scaling Qwen3 from 2B to 8B produces negligible CT shift. SmolVLM's extreme GQA score reflects both its compression and its concatenation-based architecture.

Task demand modulates the imbalance but cannot eliminate it. CLEVR pulls CT down for all models. Even so, most models still show text dominance on CLEVR, and SmolVLM stays at 65.18%.

The FOIL-COCO results show that even when the task specifically requires visual verification of textual claims, the internal information structure does not adapt. The layerwise dynamics show that visual information stabilizes early while linguistic information keeps accumulating, and that cross-modal synergy stays flat throughout. Deeper processing does not resolve the imbalance.

All of this is consistent with the architecture ablation in Post 3: the bottleneck is the fusion mechanism, and scale and data alone cannot fix what the architecture constrains.


What comes next

The results in Table 1 establish the headline. The perturbation experiments extend it by asking a targeted question: if you deliberately degrade one modality's input, does the model's CT shift in the right direction? When you inject Gaussian noise into the image, does CT go up? When you replace the text with random content, does CI go up?

These perturbation results serve two purposes: they validate the metric's sensitivity on real data, complementing the synthetic validation from Post 4, and they reveal something specific about how different architectures compensate when one modality is degraded. The compensation pattern is strong in some architectures and nearly absent in others -- and the pattern is not what the CT/CI scores alone would predict.

This post is part of a broader project on quantifying modality contributions in vision-language models using Partial Information Decomposition. The full framework decomposes internal representations into unique, redundant, and synergistic components to derive principled attribution scores.