Language & Prompt Sensitivity: What If the Instruction Lies?

Mar 16, 2025
RobustnessPerceptionRoboticsLanguage

Download from Google Cloud Storage

This model is hosted on Google Cloud Storage. Copy the URI below and use gsutil to download:

gs://openpi-assets/checkpoints/pi05_libero

Run in your terminal:

gsutil -m cp -r gs://openpi-assets/checkpoints/pi05_libero .

Previously, we asked what happens when the camera lies. Now we ask something different:

What happens when the instruction lies?

This post shifts entirely to the language channel. In embodied systems that accept natural language instructions, the prompt is not just an input. It is supposed to be the specification of intent, the signal that distinguishes “pick up the red cup” from “pick up the blue block.” If that signal degrades through typos, ambiguity or paraphrasing, a robust model should still be able to interpret the intent and complete the task reliably.

What we found instead was far more unsettling.

Why language perturbations deserve scrutiny

Visual perturbations are physically grounded. Blur happens. Lighting changes. Sensors lie. But language perturbations are semantic. They test whether a model actually understands the instruction it was given or whether language is functioning as something else entirely, a context signal, a mode activator, a learned trigger that doesn’t need to be read carefully to produce behavior.

This distinction matters enormously for deployed systems. If a robot responds correctly regardless of whether the instruction says “pick up the cup” or “sklorp glarven the wumpthing,” then the instruction isn’t doing what you think it’s doing.

We tested eleven perturbation types across three severity levels. They decompose into two conceptual groups: surface-level corruptions (typos, synonym substitution, verbosity, compression, homophones, paraphrasing) and semantic-level corruptions (ambiguity injection, implicit references, instruction reordering, irrelevant content, conflicting instructions).

Surface corruptions: how much noise can language tolerate?

Surface perturbations change the form of an instruction while preserving its meaning. A human reader handles these effortlessly. The question is whether the model does too and whether tolerance here reflects genuine robustness or something more suspicious.

Synonym Substitutionlow severity delete add
Base prompt
pick up the black bowl between the plate and the ramekin and place it on the plate
Perturbed prompt
pick up the black bowl between the dish and the ramekin and place it on the dish

Perturbation Description

Words are replaced with semantically equivalent alternatives. 'Grab the container' instead of 'pick up the cup.' If the model is genuinely reading the instruction, synonym substitution should be near-invisible. If the model is pattern-matching against training vocabulary, synonyms may break recognition of key task tokens.

Figure 1: Visualize the different surface-level language perturbations.

These perturbations each attack the surface of the instruction differently. They act as stress tests for language robustness.

Semantic corruptions: what happens when meaning breaks?

Semantic perturbations go deeper. They don’t just change how an instruction is expressed, they change what it means, or obscure what it means, or actively contradict it. These are the perturbations that should produce meaningful failures in a model that genuinely grounds language in action.

Ambiguity Injectionlow severity delete add
Base prompt
pick up the black bowl between the plate and the ramekin and place it on the plate
Perturbed prompt
pick up the bowl between the plate and the ramekin and place it on the plate

Perturbation Description

Specific references are made vague. 'The red cup on the left' becomes 'the cup.' When multiple objects of the same type are present, ambiguity should produce hesitation, incorrect object selection, or failure. A model that succeeds despite ambiguity is either making lucky guesses or ignoring the object specification entirely.

Figure 2: Visualize the different semantic-level language perturbations.

The conflicting instruction perturbation is the most diagnostic of the group. It doesn’t degrade the instruction, it actively fights it. What happened when we applied it is worth discussing separately.

The cliff we didn’t find and what that means

Across vision perturbations, we consistently observed threshold effects: stable performance at low severity, then a collapse. The pattern was informative. It told us which assumptions were load-bearing and when they broke.

Language perturbations produced a different pattern. Performance remained remarkably stable across nearly all perturbation types and severity levels. Typos didn’t matter much. Synonyms didn’t matter much. Verbosity, compression, reordering, all absorbed without significant degradation.

Loading chart data...

Figure 3: Language perturbation sensitivity across perturbation types and severity levels.

This could be interpreted as robustness. A generous reading would say the model has learned language-invariant task representations. The same reading would call this a success.

But the conflicting instruction results forced a different interpretation.

The model that didn’t notice the contradiction

When we applied conflicting instructions, appending a direct contradiction to the original prompt performance remained stable. The model executed the original task correctly even when explicitly instructed to do something different.

At first, we considered several explanations. Maybe the model was resolving the contradiction intelligently, preferring the first instruction by some learned heuristic. Maybe it was robust to adversarial appending. We designed a cleaner test.

We sent a completely nonsensical prompt, “My name is Franka”. Not a degraded instruction. Not an ambiguous one. A string of words with no coherent task specification whatsoever.

The model still completed the task.

This is not a robustness finding. This is an architectural finding. The language instruction, at least for this model on this task distribution, is not the primary driver of behavior. The model is not grounding language in action in the way the architecture nominally suggests. It appears to be using visual context, the scene, the objects, the affordances visible in the observation to determine what to do, and using the language instruction as a weak contextual signal at best.

The prompt isn’t being read. The scene is being acted upon. In this setting, language input appears largely unused rather than acting as a true behavioral specification.

What the model is actually using

To move beyond behavioral inference, we ran modality ablation analyses on the same task.

Full model
95%--
Remove language
94% 1%
Remove vision
13% 82%

Figure 4: Modality ablation showing the model is near-indifferent to language removal.

The ablation results make the pattern clear. Removing the language stream entirely costs 1% in success rate. Removing vision costs 82%. The model is near-blind without visual input. It is near-indifferent to language.

Important disclaimer. These results should be interpreted narrowly. They reflect the behavior of one specific system: a fine-tuned OpenPI policy evaluated on the LIBERO Spatial benchmark, not a universal property of VLAs. Different models, different fine-tuning strategies, or benchmarks with stronger language dependence may show very different behavior. The takeaway is not that VLAs ignore language, but that standard task success metrics alone are insufficient to determine whether language is being used at all. Without perturbation and modality analysis, this failure mode would have remained invisible.

This reframes the perturbation results from earlier in the post. The stability which we eariler thought to be robustness across typos, synonyms, paraphrasing, and even direct contradictions was actually indifference. The language channel is structurally present but functionally marginal. What looks like a robust language-conditioned policy is, under the hood, a vision-conditioned policy with a language input that mostly goes unread.

Why this matters more than any perturbation result

Robustness failures are interpretable. A model that breaks under blur tells you it relies on high-frequency spatial information. A model that breaks under low contrast tells you it depends on edge strength. These failures are fixable: augment training data, normalize features, retrain.

A model that does not respond to language at all represents a different class of failure. This is not a brittleness problem, the model is actually quite stable. It is a grounding problem. The language channel is structurally present but functionally weak for this task distribution. While this might initially appear to require architectural changes in how language and vision are fused, it may also point to a data problem.

If the training distribution does not sufficiently force reliance on language, the model can succeed by exploiting visual regularities alone. In this sense, targeted augmentation, especially introducing unseen instructions, counterfactual prompts, and tasks where visual cues alone are insufficient, could help force stronger language utilization. However, this hypothesis requires deeper analysis. Understanding whether improved grounding comes from better data diversity, stronger cross-modal objectives, or fine-tuning strategies remains an open question, particularly for VLA models where language supervision is often weak relative to visual demonstration signals.

In practice, this means: if you evaluate a language-conditioned policy by testing whether it completes tasks when given correct instructions, you are not evaluating language conditioning. You are evaluating whether the task can be solved regardless of whether language is used. The two are not the same, and the difference matters the moment you deploy in an environment where the model has to do something different from what it was trained to do.

What language sensitivity actually tests

Language perturbations test whether a model uses instructions to determine actions, grounds object references in visual selection, sequences actions based on linguistic specification, filters out irrelevant or contradictory content, and fails gracefully when the instruction is genuinely uninterpretable.

Our results suggest the model passes none of these tests in any meaningful sense not because it fails catastrophically, but because it succeeds unconditionally. Unconditional success on a grounding test is not a passing grade. It’s evidence that the test isn’t being taken.