CPD
Critical-Phase Detection for Vision-Language-Action Policies — CoRL 2026 (under review)
Critical-Phase Detection for Vision-Language-Action Policies
Hanyang University
CoRL 2026 · Under review
TL;DR
We build a zero-shot critical-phase detector for frozen vision-language-action policies. Given just 5–20 expert demonstrations and a frozen π₀ / π₀.₅ backbone, we automatically identify which rollout steps a robot is most likely to fail at — no human labels, no task-specific reward, no per-task tuning. On LIBERO-Long manipulation, we reach leave-one-out F1 of 0.86–0.89 for trajectory-level failure detection.
Why this matters
VLA models like π₀, OpenVLA, and RT-2 work most of the time on common manipulation tasks — but they fail unpredictably on precision phases: the last few millimeters of insertion, the alignment before a grasp, the handoff between sub-tasks. Recent work (RLT, 2026) showed that focusing RL fine-tuning on just these critical phases is dramatically more sample-efficient than fine-tuning the whole trajectory.
The remaining question is how to find them. Existing answers all have a per-task cost:
| Approach | What it needs | What it costs |
|---|---|---|
| Supervised labeling | A human annotating each trajectory | Time, doesn’t scale |
| Ground-truth success label | A boolean “task done” predicate per task | Engineer-coded for each task; doesn’t transfer |
| Reward-shaped RL | A dense reward function | Task-specific, often impossible for VLA |
CPD removes all of these. The detector reconfigures itself for a new task from only the demos, exploiting the fact that VLA backbones already encode “this state looks like task progress” — we just need to read it out.
Method
Temporal-distance latent representation (TLDR)
We learn a 64-D encoder φ on the proprioception of demos with a contrastive triplet loss: states that are close in time within the same demo are pulled together, states that are far apart are pushed apart. This collapses irrelevant variation (initial poses, recovery wiggles) and lines up trajectories on a 1-D progress coordinate.
G2 self-supervised labeler
Given a rollout from a frozen VLA policy, we don’t know whether it succeeded — that’s what an oracle predicate would tell us. Instead, we check whether the rollout’s final encoded state lands inside the cluster of demo end-states in latent space:
\[\ell_{G2}(\tau) = \mathbb{1}\big[\| \varphi(s_T) - g \| < \varepsilon \big]\]where g is the demo end-state centroid and ε is set automatically from demo statistics. No tunable threshold, no per-task constants. Theorem 1 in the paper proves this label converges to the ground-truth success predicate as the number of demos grows, under mild Lipschitz / ρ-cover assumptions on φ.
Per-step critical score
Rollouts split by G2 into success buffer 𝐵+ and failure buffer 𝐵−. We fit separate kernel density estimates and define
\[r_t = \log \tilde{f}_+(z_t) - \log \tilde{f}_-(z_t)\]This is the Bayes-optimal log-likelihood ratio. A step is critical when rt < 0 for at least 3 consecutive steps (debouncing single-step noise). Trajectory-level rules — longest critical run, total critical-step count, critical fraction — are derived from this per-step signal.
Results on LIBERO-Long (task 00, π₀.₅ backbone, 200 rollouts)
Detection F1 scales with demos
Per-step critical phase: success vs failure cases
Headline numbers
| Backbone | Task | N demos | LOO-CV F1 | Reward separation (z-score) |
|---|---|---|---|---|
| π₀.₅ | LIBERO-Long task 00 | 140 | 0.889 | 4.09 |
| π₀.₅ | LIBERO-Long task 00 | 200 | 0.86 | 4.09 |
The separation is essentially clean — the F1 ceiling is set by failure-pool size, not by signal quality.
What’s next
- Scaling to harder tasks with more failure modes (LIBERO-10 with weaker backbones) — needed to escape the small-failure-pool ceiling
- Latent-space correction: once a critical step is detected, project toward the nearest success-buffer latent and decode an action correction — converts CPD from a detector to a controller
- Extension to language-conditioned task transfer: does the G2 ε threshold transfer across tasks within the same backbone?
BibTeX
@inproceedings{choi2026cpd,
title = {Critical-Phase Detection for Vision-Language-Action Policies},
author = {Choi, Chanyeok},
booktitle = {Conference on Robot Learning (CoRL)},
year = {2026},
note = {Under review},
}