PortfolioBlogSocialsMusic
research

Testing a Metric Before Trusting It

Before applying PID to real VLMs, the paper constructs inputs where the true modality contributions are analytically derivable and checks whether the metric recovers them. This post is about why that step is necessary and what it actually shows.

Posts 1 through 3 built up the problem, the diagnostic framework, and the architectural findings. At this point a careful reader should have a reasonable objection: how do you know the PID metric is actually measuring what it claims to measure?

This is not a small question. The pipeline involves approximations at every stage: clustering continuous embeddings into finite discrete bins, solving a constrained optimization problem iteratively, normalizing the result into a contribution score. Any of those steps could introduce systematic errors. A metric that produces plausible-looking numbers on real models is not validated. Plausibility is easy to achieve. You need a setting where the correct answer is known in advance, and you check whether the metric recovers it.

That is what the synthetic experiments do. Before applying PID to BLIP, LLaVA, PaliGemma, or any real benchmark, the paper constructs controlled inputs where the true modality contributions are mathematically derivable, and verifies that the metric returns the right values. This post covers why that validation step matters, how it is designed, and what it shows.


Why real data cannot validate the metric

On real VLMs and real benchmarks, you never know the true modality contribution. That is the entire problem the paper is trying to solve. If you already knew how much each modality contributed to each prediction, you would not need a metric. So real-world performance cannot validate the metric. The metric produces a number, the number looks reasonable, and you have no way to check whether it is correct, because correctness requires knowing the answer the metric is supposed to provide.

The only way out of this is a setting where the ground truth is not a matter of empirical measurement but of mathematical construction. If you build a synthetic system where modality X1 contributes exactly 10x more than modality X2 by design, you can derive what the PID scores should be analytically, run the metric, and check whether it recovers those values. If it does, consistently, across a range of different contribution structures, you have earned the right to trust it on real data where the ground truth is unknown.

This is standard practice in metric validation across quantitative fields. The specific challenge here is applying it to a metric that is measuring something as abstract as "unique information in internal representations."


The experimental setup

The paper uses two independent Gaussian vectors, X1 and X2, each drawn from N(0, I_d). They are independent by construction. This matters because it means any redundancy or synergy that appears in the PID output is a consequence of the fusion rule applied to them, not of any built-in correlation between the sources.

The target Y is generated from X1 and X2 under five predefined fusion rules:

RuleFormulaExpected CIExpected CT
AddY = X1 + X2~50%~50%
MultiplyY = X1 x X2~50%~50%
Weighted 10Y = X1 + 10X2HighLow
Weighted 100Y = X1 + 100X2Very highVery low
Only one inputY = X2~100%~0%

These rules cover the cases the metric needs to handle. Additive equal-weight tests balanced contribution. Multiply tests whether the metric handles nonlinear interaction correctly. The weighted rules test whether the metric is sensitive to degrees of dominance, not just the direction of it. The single-input case tests the degenerate boundary where one modality carries zero predictive information.

For each rule, you can derive from first principles what the unique information U1 and U2 should be. Add and multiply are symmetric in X1 and X2, so U1 = U2 by construction. Weighted-10 produces U2 > U1, with the ratio reflecting the weighting. Weighted-100 pushes U2 toward the maximum and U1 toward zero. Only-one-input gives U1 = 0 exactly. These are not estimates. They follow from the construction of the fusion rule, which is what makes them valid as ground truth.


What the metric recovers

The results in Figure 2 of the paper show that the metric recovers the expected pattern across all five fusion rules.

Add and multiply both return approximately equal contributions, around 50/50. The symmetry is preserved. Multiply and add return similar profiles despite being structurally different operations, which is correct: both are symmetric in X1 and X2, so both should produce equal unique contributions regardless of the nonlinearity.

Weighted-10 shifts the contribution toward X2, but X1 still registers a non-negligible unique contribution. The metric does not snap to 100/0. This is a meaningful test. A metric that can only distinguish "X1 dominates" from "X2 dominates" without sensing intermediate degrees of dominance would be too coarse to be useful on real VLMs, where dominance is always partial.

Weighted-100 pushes X2's contribution toward 100% and X1's toward zero. The convergence is gradual and monotonic as the weight increases from 10 to 100, which is the correct behavior. The metric tracks the gradient between the extremes, not just the endpoints.

Only-one-input returns the degenerate case correctly. X1 contributes zero, and the metric returns zero (within numerical tolerance) for Dependence1 and assigns all attributable information to Dependence2. Getting the boundary cases right is necessary, but what earns genuine confidence is that the metric tracks the full monotonic path between them.

Note

"The results align with expectations: additive and multiplicative rules yield approximately equal contributions, while increasing weights shift dominance toward the corresponding modality." -- the paper, on Figure 2


The bitwise operator experiments

The Gaussian fusion rules test whether the metric gets the quantities right. The bitwise operator experiments test a different property: whether the metric correctly identifies the qualitative structure of information sharing, specifically the distinction between redundancy and synergy.

Binary inputs give you precise control. For binary variables, PID components can be computed directly from the truth table, so the expected decomposition is analytically exact rather than approximate.

AND. Output is 1 only when both inputs are 1. Each input is partially informative alone -- knowing X1 = 0 tells you Y = 0 -- but neither alone fully determines Y when X1 = 1. This creates a blend of unique and synergistic information. The metric should detect that the combination carries more than either modality alone.

XOR. Output is 1 when exactly one input is 1. This is the canonical case of pure synergy. Knowing X1 alone tells you nothing about Y, because for any value of X1, Y is equally likely to be 0 or 1 depending on X2. Knowing X2 alone is equally uninformative. Only knowing both determines the output. XOR should produce near-zero unique information for both modalities and a high synergy component S.

OR. Output is 1 when at least one input is 1. Each input is individually informative about Y. There is also redundancy -- both carry overlapping information about when Y = 1. The metric should detect individual contributions alongside the redundant overlap.

XOR is the stress test for any attribution metric. Every method that measures modality contributions as marginal importance scores -- perturbation, gradient, Shapley -- returns near-zero for both modalities on XOR inputs. Each modality looks unimportant when examined alone, even though their combination determines everything. PID's synergy component S is designed specifically to capture this. If the metric returns high S on XOR inputs alongside near-zero U1 and U2, it has correctly identified that neither modality alone carries the answer. A metric that cannot detect synergy will report that neither modality matters on XOR -- which is precisely wrong.


What the validation actually covers in the pipeline

The synthetic experiment validates the full end-to-end pipeline, not a simplified version of it. The Gaussian vectors are continuous, just like real encoder embeddings, so they go through the same K-means discretization step. The same KL-projection optimization runs on them. The same normalization produces the final CT/CI scores.

On the discretization step. K-means clustering maps continuous embeddings into k discrete bins, where k = N^(1/3) for a dataset of N samples. This is an approximation. A poorly chosen cluster count could merge distinct distributions or over-segment identical ones, flattening or amplifying the contribution signal. The fact that the metric recovers correct results on Gaussian inputs -- which are well-conditioned continuous distributions -- is evidence that the approximation is not catastrophically lossy under the paper's hyperparameters.

On the KL-projection solver. The optimization in equation 6 is solved iteratively with CVXPY for 100 iterations. The solver could fail to converge, converge to a local optimum, or be sensitive to initialization. The synthetic results are clean enough to indicate the solver behaves correctly on tractable problem instances.

What the validation does not cover. Real VLM embeddings are products of billions of gradient updates on complex datasets. Their geometry may be anisotropic, low-rank, or clustered in ways the Gaussian case does not probe. The synthetic experiments confirm the mechanism works in principle; they do not guarantee it handles every real-world embedding geometry. The perturbation ablations in a later post extend this by testing sensitivity to design choices on real data -- cluster count, input degradation, fusion architecture effects.


Implementation decisions

The practical choices the paper makes are worth stating explicitly, since they affect reproducibility and extension.

  • Cluster count: cube root of sample size, applied independently to X1, X2, and Y
  • Solver: CVXPY, 100 iterations
  • Numerical tolerance: e = 10^-6 in the contribution metric, to prevent division by near-zero when unique information is very small
  • Model temperature: set to 0 for all real-model experiments, removing output stochasticity that would degrade embedding distributions
  • Prompt format: standardized instruction to answer with only the letter of the correct option, minimizing prompt-induced variance across models

On the synthetic data, numerical instabilities do not appear -- Gaussian inputs are well-conditioned and the joint probability tensors stay away from degenerate configurations. This confirms the solver behaves correctly when the problem is well-posed, and shifts scrutiny toward real-data cases where conditioning may be worse.


What passing validation earns

The synthetic experiments pass. The metric recovers correct contributions under Gaussian fusion rules across all five cases and recovers the correct qualitative structure -- including the presence and absence of synergy -- under bitwise operators. What does that entitle you to conclude?

It establishes that the mechanism is sound. The KL-projection approach, the discretization step, and the normalization procedure work together correctly to produce attribution scores that track the true contribution structure when the true structure is known. This is a necessary condition for trusting the metric on real data. It is not a sufficient condition.

The appropriate epistemic position: the metric is validated in the sense that it works correctly under controlled conditions where the answer is known. The ablation studies extend this by testing consistency with expected behavior on real data under perturbation. Together, they build a case for reliability. They do not prove the metric correct in general, and the paper does not claim they do.

This is the right model for any empirical metric. You demonstrate correctness in specific controlled settings and consistency in related ones. That is the standard the paper meets, and it is the standard that earns the right to report the real-world findings with confidence.


What comes next

With the metric validated, the results in Table 1 and Table 3 -- the 81.39% text contribution from SmolVLM-256M, the 17-point gap between BLIP and LLaVA-1.5 on GQA, the near-zero synergy across all models and layers -- carry a different weight than they would without this foundation. They are outputs of a pipeline that has been shown to correctly recover known contribution structures under controlled conditions.

The next post applies the validated metric to the full set of six models across six benchmarks, and reads the results in detail. The layerwise dynamics, the task-by-task variation, and the scale comparison within the Qwen3-VL family all come from the same pipeline that just recovered XOR's synergy structure and weighted-100's near-total dominance from first principles.

This post is part of a broader project on quantifying modality contributions in vision-language models using Partial Information Decomposition. The full framework decomposes internal representations into unique, redundant, and synergistic components to derive principled attribution scores.