Self-Supervised Critical Phase Detection for VLA Refinement

Anonymous

Self-Supervised Critical Phase Detection for VLA Refinement

Anonymous Author(s)

Anonymous Institution

Preprint · Under review · 2026

Paper (coming soon) Code (upon publication)

Critical phases (shaded) localized on two LIBERO-Long rollouts (π_0.5 backbone) — a small fraction of each rollout.

TL;DR. A Vision-Language-Action policy nails most of a manipulation task, then fails on the one or two moments that actually decide it. We call those critical phases and define them by decision sensitivity — how much the action there flips the outcome, not how likely failure looks. We learn to detect them from successful demonstrations alone and concentrate reinforcement-learning refinement only there.

Abstract

Vision-Language-Action (VLA) policies handle most of a manipulation rollout, then fail on the few decisions that actually matter — the instant a peg first tilts inside a hole, the moment a grasp commits. We call these moments critical phases and define them by decision sensitivity: how much a small change to the action at that timestep flips the eventual success or failure of the rollout.

Crucially, a critical phase is not the same as a likely failure. Successful rollouts pass through the very same decision points, so criticality is a property of the decision structure, not of failure probability. This separates our target from runtime failure detection, which measures how out-of-distribution the current step looks and therefore stays quiet on successes and reacts only after a failure has surfaced.

We study how to detect critical phases from successful demonstrations alone — no failure labels and no per-task success oracle — using the frozen policy's latent embedding and the robot state, and how to concentrate reinforcement-learning refinement on exactly those phases so that precision tasks such as peg-in-hole converge with far fewer environment steps. (Method and experiments in progress.)

What is a critical phase?

Precision manipulation hinges on a handful of decision-sensitive moments. Consider driving a bolt into a hole (peg-in-hole). Following the classical contact-state decomposition (Mason 1981; Debus et al.), the task splits into contact regimes, and in each one a different control variable decides the outcome:

Contact 1 — approach. Move too fast and inertia overcomes the magnetic hold; the bolt drops off the driver. Speed is decisive.
Contact 2 — surface touch. Press too hard and the bolt bounces away. Contact force is decisive.
Contact 3 — rim alignment. The bolt must move along the hole's normal direction. Vertical alignment is decisive.
Contact 4 — seating. How far to turn / how hard to press to seat the bolt. Torque is decisive.

In each regime a small action error decides the success or failure of everything that follows. These intervals have historically been carved out and labeled by hand. We instead define them operationally: a critical phase is a timestep where perturbing the action measurably flips the outcome.

Criticality is not failure probability

	Failure detection	Critical phase (ours)
Question	Is this rollout going to fail?	Where is the outcome being decided?
Signal source	OOD / uncertainty of the current step	Decision sensitivity of the current action
On a success rollout	Stays flat — nothing looks wrong	Still fires — successes pass through the decision point too
Timing	After the failure has surfaced (a symptom)	Before the failure is irreversible (the cause)

Because criticality lives inside successful executions, it can be learned from success-only data, and it pinpoints spatially-overlapping failures — ones that never leave the success manifold and so are invisible to density-/OOD-based detectors.

Contributions

1. Definition & detection

Critical phase formalized as decision sensitivity, with a per-timestep score c_t learned from success rollouts only. Validated against counterfactual ground-truth decisiveness — the measured rate at which perturbing the action at t flips the outcome in simulation.

2. Criticality ≠ failure

Direct evidence that c_t is not a failure probability: it responds on success rollouts where failure detectors stay flat, and it catches the decision point of spatially-overlapping failures they miss. Head-to-head with FAIL-Detect, FIPER, SAFE on LIBERO.

3. Localized RL refinement

Concentrating reinforcement-learning refinement on the detected critical phases accelerates a frozen VLA's convergence on precision tasks, reaching higher sample efficiency than uniform RL fine-tuning (VLA-RL, SimpleVLA-RL).

Acting inside the phase: an intervention probe preliminary

A first test of using a detected phase. We attach a lightweight negate-replay intervention to a runtime failure-probability detector (an LSTM over the π_0.5 latent, SAFE-style), triggered when the score crosses a per-task functional conformal threshold δ_t calibrated on successful rollouts. Each clip below is the baseline rollout (left) versus the same rollout with intervention (right), with the failure score and δ_t overlaid.

LATE ALARM the failure score stays inside the success band until t≈366/520 (∼70% of the episode) — the alarm fires only once the failure is already unfolding.

TOO LATE TO ACT three negate-replay reversals fire, all past the fork — the rollout still fails.

The takeaway is the point of this work: even with a valid, per-task calibrated threshold, a failure-probability signal on the policy's latent fires only once failure is already in progress — too late for the intervention to help. Acting on where the outcome is actually being decided — the critical phase, upstream of failure — rather than on how likely failure looks is what such intervention needs.

How it relates to prior work

Runtime failure detection — FAIL-Detect (Xu et al., RSS 2025), FIPER (Römer et al., NeurIPS 2025), SAFE (Gu et al., NeurIPS 2025) — raises an alarm from OOD or uncertainty signals. These are detection-only and measure failure probability; we instead identify the decision structure inside successful rollouts and connect it to policy refinement. We compare directly on LIBERO using failure-AUROC.

Critical-step / critical-moment — Liu (ICCV 2023), Tang (2026), Mao / JITI (2025), Kappler (RSS 2015) — variously require return diversity, failure labels at calibration, policy token logits, spatial features, or mixed-quality demonstrations. Our setting is success-only, representation-level, and temporal, and remains policy-agnostic (no token logits).

Method & experiments in progress

The detector and the critical-phase-localized RL refiner are under active development. The evaluation plan:

Testbed	LIBERO (quantitative headline) + a contact-rich insertion task (qualitative); sim counterfactual rollouts give ground-truth decisiveness on selected tasks.
Backbones	Frozen VLA — π_0.5, π₀, OpenVLA — read via latent embedding + robot state. Policy weights are not updated.
Headline metrics	(i) decision-sensitivity alignment with counterfactual ground truth; (ii) signal on success rollouts vs. flat failure detectors; (iii) RL sample-efficiency under critical-phase-only refinement.
Auxiliary	Held-out failure-AUROC vs. reported baselines; cross-task transfer.

Numbers will be posted as experiments complete.

BibTeX

@article{anonymous2026cpd,
  title   = {Self-Supervised Critical Phase Detection for VLA Refinement},
  author  = {Anonymous},
  journal = {Preprint},
  year    = {2026},
  note    = {Under review},
}