Fusion Architecture Determines What a VLM Pays Attention To
When you hold encoders, training, and task constant and vary only the fusion mechanism, text contribution shifts by up to 17 percentage points. Scale produces nothing comparable. The wiring between modalities shapes information flow before training starts.
Posts 1 and 2 established the diagnostic framework: how PID decomposes modality contributions into unique, redundant, and synergistic components, and what those scores look like when applied to real models across six benchmarks. The consistent finding across all models was text dominance, with image contribution ranging from around 19% to 51% depending on the model and task. This post turns to the more specific question those results raise. Given that the Qwen3-VL models show nearly identical PID profiles at 2B and 8B parameters, and given that every concatenation-based model in the study leans text-dominant regardless of scale, where does the imbalance actually originate?
The paper's ablation answers this directly. By holding the text encoder (RoBERTa), vision encoder (ViT), training procedure, and evaluation task (VQAv2) constant, and varying only the fusion mechanism, the experiment isolates architecture's contribution to the PID profile. The results show that fusion architecture significantly impacts the distribution of unique, redundant, and synergistic information. The mechanism that wires the two modalities together determines which information flows through and which does not, before any training has occurred.
What "fusion" means in a VLM
A VLM takes two inputs and produces a prediction. The image goes through a vision encoder; the text goes through a language encoder. At some point, these two representations have to interact to produce a shared predictive signal. The fusion mechanism is where and how that interaction happens.
From an information-theoretic standpoint, the fusion mechanism is the bottleneck through which visual information either reaches the model's internal predictive representation or gets discarded. A high-quality vision encoder can produce a rich, detailed patch-level representation of an image. But if the fusion mechanism does not create pathways for that representation to influence the final prediction, the information it contains will not survive into the output. High text contribution in PID scores can mean a weak vision encoder, but it can equally mean a fusion mechanism that structurally constrains how much visual information can enter the shared representation.
This distinction matters for diagnosis. If text dominance is a vision encoder problem, the fix is better visual features. If it is a fusion problem, better visual features will not help -- they will be filtered out at the same point they always were.
The four fusion strategies
The ablation evaluates four mechanisms. Walking through each one mechanically before looking at the results makes the PID scores easier to interpret.
Img->Txt cross-attention. Image features attend to text. The visual representation uses text tokens as keys and values: it queries the linguistic context to decide which parts of the text are relevant given what the image contains. The text representation is not reshaped by this operation; only the image representation is. The image features that survive into the shared representation have been filtered through what the text made relevant, but the image was the active agent selecting from the text rather than the other way around.
Txt->Img cross-attention. Text features attend to image. The linguistic representation uses image tokens as keys and values. Text drives the query; image provides the context. This is the architecture of the BLIP family. The text representation selects which visual features to incorporate, but the selection is driven by what the question asks, not by what the image distinctively contains. Visual features that the question does not anticipate cannot be retrieved, because the query that would retrieve them was never formed. The image can only contribute what the text thought to ask about.
Concatenation. Image tokens and text tokens are concatenated into a single sequence and processed by the transformer's standard self-attention. No explicit cross-modal attention mechanism is imposed. This is the architecture of LLaVA-1.5, Qwen-VL, SmolVLM, and most large modern VLMs.
In principle, self-attention allows any token to attend to any other, so visual tokens could influence text tokens and vice versa. In practice, two things work against visual information here. The language model backbone is pretrained on text -- its attention weights are calibrated to find meaningful relationships between text tokens. Visual tokens are structurally foreign to that learned space. The model can compute attention over visual tokens, but its weights were not optimized to extract meaningful signals from them. Additionally, typical inputs contain substantially more text tokens than image tokens, creating a numerical skew that reinforces the text-dominant signal already encoded in the pretrained weights.
Bidirectional fusion. Symmetric cross-attention in both directions simultaneously. Each modality attends to the other; each modality's representation is reshaped by what the other makes relevant. Neither modality controls the query unilaterally.
What the PID scores reveal
| Fusion Method | CI (%) | CT (%) |
|---|---|---|
| Img->Txt | 55.3 | 44.7 |
| Txt->Img | 38.5 | 61.5 |
| Concatenation | 45.6 | 54.4 |
| Bidirectional Fusion | 54.7 | 45.3 |
Txt->Img cross-attention produces the highest text contribution at 61.5%. Img->Txt and bidirectional fusion both produce image-leaning or near-parity results, at 55.3% and 54.7% image contribution respectively. Concatenation falls between the two groups but on the text-dominant side.
The concatenation result is worth examining carefully. Placing image tokens and text tokens in the same sequence does not produce balanced contributions even though, in principle, every token can attend to every other token. As the paper states directly: "balanced contributions are not solely a consequence of combining modalities, but depend on the presence of structured cross-modal interactions." The modalities being present in the same sequence is a necessary condition for interaction but not a sufficient one.
Why the query direction determines which modality dominates
In cross-attention, the modality that drives the query controls what information gets retrieved from the other. The query determines relevance. Whatever is in the key-value store can only contribute what the query asks for.
In Txt->Img attention, text drives the query. The linguistic representation determines which visual features are relevant. Visual information that is not anticipated by the question cannot be retrieved. The image can only contribute what the text has already decided to look for. This produces CT = 61.5% even under the controlled ablation conditions, where BLIP's training data advantages are stripped away. It is an architectural property of the mechanism, not an artifact of pretraining.
In Img->Txt attention, image drives the query. The visual representation determines which linguistic features are relevant. The image selects from the text rather than the text selecting from the image. This inverts the information hierarchy, and it shows up directly in the PID scores: CT drops to 44.7%, below parity.
For concatenation, the mechanism is different but the outcome is similar. The language model backbone's pretrained attention weights were calibrated on text tokens. Visual tokens are a novel input type that the backbone has no pretrained weights for. When visual tokens are prepended to the sequence, the model attends to them, but the learned attention patterns that determine which signals are meaningful were not optimized for visual features. Text dominance in concatenation-based models is a consequence of pretraining mismatch rather than any explicit architectural gate.
Connecting the ablation to the real-world results
BLIP uses Txt->Img cross-attention. Every other model in the paper's main analysis -- LLaVA-1.5, PaliGemma, SmolVLM, Qwen3-VL-2B, Qwen3-VL-8B -- uses concatenation-based fusion. The ablation predicts a measurable gap between BLIP and the concatenation-based models, and Table 1 shows it.
On GQA, BLIP's PID profile is 54.58% text / 45.42% image. LLaVA-1.5's is 71.73% text / 28.26% image. That is a 17-point difference in text contribution between two models whose fusion architectures occupy different positions in the ablation table. The standard explanation for a gap this size would point to model scale (BLIP at approximately 3B parameters versus LLaVA-1.5 at 7B) or to differences in pretraining data. The ablation separates those factors out. At identical encoder sizes, identical training procedures, and identical tasks, the fusion mechanism alone shifts the PID profile by a comparable magnitude.
The scale comparison within the Qwen3-VL family makes this precise. Qwen3-VL-2B shows CT = 62.50% on GQA. Qwen3-VL-8B shows CT = 64.73%. A four-fold increase in parameter count moves text contribution by 2.23 percentage points. Both models use concatenation-based fusion. The architecture did not change; neither did the PID profile in any meaningful way.
NoteThe bottleneck is architectural, not parametric. Scaling within a concatenation-based fusion paradigm does not shift where the information ceiling sits.
What synergy adds to the picture
The CT/CI scores measure unique information: how much each modality contributes independently to the predictive representation. But the paper also tracks synergy and redundancy layer by layer, and Figure 3 in the paper shows a consistent pattern across all three models examined: redundancy and synergy maintain near-zero gradients throughout the model's layers.
This connects directly to the fusion mechanism findings. Bidirectional fusion -- which produces the most balanced unique contributions in the ablation -- is also the architecture most structurally capable of generating synergy. When each modality reshapes the other's representation before the prediction is formed, there is a pathway for information that exists only in the combination to emerge. When only one modality drives the query, or when modalities share a sequence without structured cross-modal gating, that pathway is absent. The fusion point becomes a merge operation: two streams combining at prediction time rather than generating something new through their interaction.
The near-zero synergy result means models are not producing genuinely cross-modal computations at any layer. Visual and linguistic features are processed in parallel and combined at the end. Whether or not the final prediction is accurate, the internal computation is not doing what "multimodal reasoning" is supposed to mean.
What follows from this for model design
Deploying a larger version of the same concatenation-based architecture will not change its PID profile. The Qwen3-VL pair demonstrates this. Training on more image-text data will not change it either, if the fusion mechanism routes visual information through a linguistic filter or relies on a text-pretrained backbone to attend to visual tokens without adapted attention weights.
The tasks where this matters most are tasks where the relevant visual information is not something the text input would naturally anticipate: spatial reasoning over complex scenes, fine-grained visual discrimination, detecting contradictions between an image and its caption. These are exactly the cases where a fusion mechanism that routes image information through a linguistic query will systematically underperform, because the query will not be formed for visual features that the text did not describe.
For tasks where the text is already sufficient -- factual recall, tasks where the image mostly corroborates what the question implies -- text dominance is harmless. The architecture's consequences depend on what the task actually requires from the image.
The design choices that measurably shift PID profiles, based on the ablation, are:
- allowing image features to drive cross-attention queries over the text stream rather than only the reverse
- building explicit structured cross-modal pathways rather than relying on backbone self-attention to handle the cross-modal signal
- ensuring the fusion mechanism creates opportunities for visual features to reshape linguistic representations before the prediction is formed
What comes next
The ablation study isolates what scale and data comparisons cannot: architecture's independent contribution to the PID profile. When encoders, training, and task are held fixed and only the fusion mechanism varies, text contribution shifts by up to 17 percentage points. That separation establishes where architectural intervention can actually make a difference.
What the framework has not addressed yet is whether the PID measurements themselves are trustworthy. Reporting that Txt->Img cross-attention produces CT = 61.5% is only meaningful if the metric recovers the right values. For something as abstract as "unique information in internal representations," there is no immediately obvious ground truth to validate against.
The next post covers the synthetic validation experiments the paper runs before applying PID to real VLMs. With controlled inputs where the true modality contributions are analytically known, the experiment checks whether the metric recovers what it claims to recover. That verification is what makes the real-world results in Table 1 and Table 3 worth trusting.