Self-Supervised Critical Phase Detection for VLA Refinement

Anonymous Author(s)
Anonymous Institution
Preprint · Under review · 2026

Critical phases (shaded) localized on two LIBERO-Long rollouts (π0.5 backbone) — a small fraction of each rollout.

TL;DR. A Vision-Language-Action policy nails most of a manipulation task, then fails on the one or two moments that actually decide it. We call those critical phases and define them by decision sensitivity — how much the action there flips the outcome, not how likely failure looks. We learn to detect them from successful demonstrations alone and concentrate reinforcement-learning refinement only there.

Abstract

Vision-Language-Action (VLA) policies handle most of a manipulation rollout, then fail on the few decisions that actually matter — the instant a peg first tilts inside a hole, the moment a grasp commits. We call these moments critical phases and define them by decision sensitivity: how much a small change to the action at that timestep flips the eventual success or failure of the rollout.

Crucially, a critical phase is not the same as a likely failure. Successful rollouts pass through the very same decision points, so criticality is a property of the decision structure, not of failure probability. This separates our target from runtime failure detection, which measures how out-of-distribution the current step looks and therefore stays quiet on successes and reacts only after a failure has surfaced.

We study how to detect critical phases from successful demonstrations alone — no failure labels and no per-task success oracle — using the frozen policy's latent embedding and the robot state, and how to concentrate reinforcement-learning refinement on exactly those phases so that precision tasks such as peg-in-hole converge with far fewer environment steps. (Method and experiments in progress.)

What is a critical phase?

Precision manipulation hinges on a handful of decision-sensitive moments. Consider driving a bolt into a hole (peg-in-hole). Following the classical contact-state decomposition (Mason 1981; Debus et al.), the task splits into contact regimes, and in each one a different control variable decides the outcome:

  • Contact 1 — approach. Move too fast and inertia overcomes the magnetic hold; the bolt drops off the driver. Speed is decisive.
  • Contact 2 — surface touch. Press too hard and the bolt bounces away. Contact force is decisive.
  • Contact 3 — rim alignment. The bolt must move along the hole's normal direction. Vertical alignment is decisive.
  • Contact 4 — seating. How far to turn / how hard to press to seat the bolt. Torque is decisive.

In each regime a small action error decides the success or failure of everything that follows. These intervals have historically been carved out and labeled by hand. We instead define them operationally: a critical phase is a timestep where perturbing the action measurably flips the outcome.

Criticality is not failure probability

Failure detection Critical phase (ours)
Question Is this rollout going to fail? Where is the outcome being decided?
Signal source OOD / uncertainty of the current step Decision sensitivity of the current action
On a success rollout Stays flat — nothing looks wrong Still fires — successes pass through the decision point too
Timing After the failure has surfaced (a symptom) Before the failure is irreversible (the cause)

Because criticality lives inside successful executions, it can be learned from success-only data, and it pinpoints spatially-overlapping failures — ones that never leave the success manifold and so are invisible to density-/OOD-based detectors.

Contributions

1. Definition & detection

Critical phase formalized as decision sensitivity, with a per-timestep score ct learned from success rollouts only. Validated against counterfactual ground-truth decisiveness — the measured rate at which perturbing the action at t flips the outcome in simulation.

2. Criticality ≠ failure

Direct evidence that ct is not a failure probability: it responds on success rollouts where failure detectors stay flat, and it catches the decision point of spatially-overlapping failures they miss. Head-to-head with FAIL-Detect, FIPER, SAFE on LIBERO.

3. Localized RL refinement

Concentrating reinforcement-learning refinement on the detected critical phases accelerates a frozen VLA's convergence on precision tasks, reaching higher sample efficiency than uniform RL fine-tuning (VLA-RL, SimpleVLA-RL).

Acting inside the phase: an intervention probe preliminary

A first test of using a detected phase. We attach a lightweight negate-replay intervention to a runtime failure-probability detector (an LSTM over the π0.5 latent, SAFE-style) and let it act whenever the score spikes. Each clip below is the baseline rollout (left) versus the same rollout with intervention (right), with the failure score overlaid.

RECOVERED  baseline fails → intervention rescues the rollout.

BROKEN  baseline succeeds → the same trigger fires on a healthy step and breaks it.

The takeaway is the point of this work: the intervention can rescue genuine failures (left), but a failure-probability trigger also misfires on healthy steps (right). Acting on where the outcome is actually being decided — criticality — rather than on how likely failure looks is what should make such intervention reliably net-positive.

How it relates to prior work

Runtime failure detection — FAIL-Detect (Xu et al., RSS 2025), FIPER (Römer et al., NeurIPS 2025), SAFE (Gu et al., NeurIPS 2025) — raises an alarm from OOD or uncertainty signals. These are detection-only and measure failure probability; we instead identify the decision structure inside successful rollouts and connect it to policy refinement. We compare directly on LIBERO using failure-AUROC.

Critical-step / critical-moment — Liu (ICCV 2023), Tang (2026), Mao / JITI (2025), Kappler (RSS 2015) — variously require return diversity, failure labels at calibration, policy token logits, spatial features, or mixed-quality demonstrations. Our setting is success-only, representation-level, and temporal, and remains policy-agnostic (no token logits).

Method & experiments in progress

The detector and the critical-phase-localized RL refiner are under active development. The evaluation plan:

TestbedLIBERO (quantitative headline) + a contact-rich insertion task (qualitative); sim counterfactual rollouts give ground-truth decisiveness on selected tasks.
BackbonesFrozen VLA — π0.5, π0, OpenVLA — read via latent embedding + robot state. Policy weights are not updated.
Headline metrics(i) decision-sensitivity alignment with counterfactual ground truth; (ii) signal on success rollouts vs. flat failure detectors; (iii) RL sample-efficiency under critical-phase-only refinement.
AuxiliaryHeld-out failure-AUROC vs. reported baselines; cross-task transfer.

Numbers will be posted as experiments complete.

BibTeX

@article{anonymous2026cpd,
  title   = {Self-Supervised Critical Phase Detection for VLA Refinement},
  author  = {Anonymous},
  journal = {Preprint},
  year    = {2026},
  note    = {Under review},
}