Visual Sensing Part 2 : What If the Lighting Shifts?

Mar 23, 2025

RobustnessPerceptionRoboticsVisual Sensing

This model is hosted on Google Cloud Storage. Copy the URI below and use gsutil to download:

gs://openpi-assets/checkpoints/pi05_libero

Run in your terminal:

gsutil -m cp -r gs://openpi-assets/checkpoints/pi05_libero .

In Part 1, we asked what happens when the camera lies about geometry. Here, we ask what happens when it shifts appearance.

Unlike optical perturbations, photometric and color perturbations preserve geometry. Edges stay where they are. Shapes remain intact. Depth does not bend. And yet, these perturbations routinely cause large and sometimes surprising drops in performance.

The reason is uncomfortable but simple: many models rely on appearance far more than they admit.

Why appearance changes are deceptively powerful

Photometric perturbations alter how light and color are represented, not where things are in space. To a human, these changes are often trivial. To a model, they can be existential.

These perturbations test whether a model understands structure independently of lighting, or whether lighting itself has become a proxy for meaning.

In this post, we’ll explore photometric perturbations in two groups: intensity transformations (brightness, exposure, gamma, and contrast) and color manipulations (saturation, white balance, and color temperature). Each reveals a different way models confuse appearance with structure.

Intensity transformations are not interchangeable

Brightness, exposure, gamma, and contrast might seem similar, they all change how bright or dark an image appears. But they’re fundamentally different transformations, and models that handle one often struggle with another. Low contrast, in particular, stands out as a consistent failure point across the models we tested. Before we examine what these differences reveal, try experiencing them directly:

Perturbation

Corruption Degree

Perturbation Description

Brightness shifts add or subtract a constant value from pixel intensities. This is the simplest photometric perturbation and often the most revealing. Sensitivity to brightness indicates reliance on absolute intensity rather than relative structure. Common failures include overconfidence in over-bright regions, missed detections in underexposed areas, and action thresholds tied to pixel magnitude.

Figure 1: Visualize the different intensity transformations.

Each transformation attacks intensity information differently and exposes different assumptions about how models process pixel values.

Color as signal

Color perturbations reveal whether a model has learned color constancy or simply memorized color statistics from its training distribution. When color shifts, models that use hue or chrominance as a shortcut for object identity, affordance, or state can silently break. Experience these color manipulations:

Perturbation

Corruption Degree

Perturbation Description

Saturation controls the intensity of color relative to luminance. Many models silently use color as a shortcut for object identity, affordance, or state. When saturation shifts, those shortcuts break. Typical failure modes include confusing similarly shaped objects with different colors, overfitting to dataset-specific color palettes, and performance collapse in desaturated environments.

Figure 2: Visualize the different color perturbations.

What makes color perturbations particularly insidious is that they exploit learned associations rather than perceptual capabilities. A model may “see” perfectly well but still fail because it learned the wrong lesson about what matters.

Gradual degradation

Unlike optical perturbations, photometric failures are often gradual. Performance degrades smoothly rather than collapsing abruptly.

Below, we compare how Pi 0.5 and SmolVLA perform on Libero Spatial under photometric stress tests. The analysis reveals a stark gap in photometric robustness and very different failure profiles.

Loading chart data...

Figure 3: Pi 0.5 photometric sensitivity across severity levels on LIBERO-Spatial.

Pi 0.5 is remarkably stable. Across all photometric perturbations, including brightness, exposure, gamma, saturation, white balance, and color temperature, performance stays within a few percentage points of baseline even at the highest severity levels. Contrast is the only perturbation that produces a measurable dip, dropping to around 94% of baseline at the most extreme setting. But even this is a graceful decline, not a collapse. Pi 0.5’s vision encoder appears to have learned representations that are genuinely invariant to lighting and color shifts, likely reflecting both architectural choices and the breadth of its training distribution.

SmolVLA tells a different story.

Loading chart data...

Figure 4: SmolVLA photometric sensitivity across severity levels on LIBERO-Spatial.

Starting from a lower baseline, SmolVLA degrades noticeably under nearly every photometric perturbation, not just a single outlier. Low contrast is the most dramatic failure: performance drops from baseline to near-zero at high severity, a complete collapse. But SmolVLA also shows substantial sensitivity to saturation (dropping to roughly half its baseline), brightness (steady decline to about 65% of baseline), and underexposure (falling to around 75% of baseline at the highest severity). Even perturbations that barely register for Pi 0.5, such as gamma shifts, white balance, and color temperature, produce moderate degradation in SmolVLA, with most losing 5–15% of baseline performance at the highest severity.

The contrast between the two models is instructive. Pi 0.5’s near-total immunity suggests that photometric robustness is achievable: it is not an inherent limitation of vision-based policies. SmolVLA’s broad sensitivity, on the other hand, suggests that its vision encoder has overfit to the appearance statistics of its training data. When any aspect of appearance shifts, whether brightness, color, or edge strength, the encoder’s features degrade, and the policy follows.

The one vulnerability they share is low contrast, though the severity of the failure differs dramatically. This shared sensitivity likely reflects a fundamental property of how convolutional and transformer-based vision encoders extract features: when edge contrast drops below a critical threshold, the spatial gradients that drive feature extraction weaken, and downstream representations lose the structure needed for precise action prediction.

What photometric sensitivity reveals and why it matters

Photometric perturbations test whether the vision component of a model separates structure from appearance, learns color-invariant representations, relies on relative rather than absolute intensity, generalizes across lighting regimes, and treats color as context rather than ground truth.

The results from Pi 0.5 and SmolVLA make clear that these are not theoretical concerns. SmolVLA’s broad sensitivity across brightness, saturation, contrast, and exposure shows what happens when a vision encoder has learned the appearance statistics of its training set rather than the structural invariants of the task. Pi 0.5’s near-total robustness shows that this failure mode is not inevitable.

The stakes are high because lighting changes constantly. Time of day, weather, indoor lighting, sensor calibration, and environment all affect appearance. If a model fails when the lighting changes, it has learned the lighting conditions under which the task was demonstrated and not the task itself.

In robotics and embodied AI, this distinction is the difference between robustness and fragility. A policy whose vision encoder depends on specific lighting has memorized the appearance of success rather than understanding the structure of the task itself.

It is worth noting that some models incorporate photometric data augmentation during training, such as random brightness, contrast, and color jitter. Pi 0.5’s robustness likely reflects, at least in part, the effectiveness of aggressive augmentation combined with a larger and more diverse training distribution. SmolVLA’s fragility suggests that its augmentation pipeline, if present, did not push appearance variation far enough. The consistent failure under low contrast in both models, though to very different degrees, further suggests that standard augmentation pipelines rarely suppress contrast as aggressively as real-world conditions can.

What comes next

In Part 3, we move away from the physical world entirely. We will examine digital and representation-level perturbations, compression, resolution changes, and color-space transformations and show how preprocessing decisions quietly shape model behavior.