VE2VF: Vision-Enabled to Vision-Free Distillation for Robust Contact-Rich Manipulation

Abstract

When using reinforcement learning (RL) for contact-rich robotic manipulation, vision can provide task-relevant information that accelerates learning beyond what proprioception alone can achieve. However, vision-enabled policies tend to overfit to the visual conditions seen during training, limiting their robustness and transferability. We present a human-in-the-loop RL framework that employs teacher-student distillation to achieve robust performance across multiple task variants, trained entirely in the real world without requiring domain randomization or data augmentation. A vision-enabled teacher distills its knowledge into a vision-free student that relies solely on pose, twist (end-effector velocity), and wrench (force and torque) sensing, combining fast training with strong task generalization. On the real-world NIST assembly benchmark board, our approach achieves 95% overall success after approximately 50 minutes of training on 3 representative tasks, including robust generalization to 8 unseen task variants. Fine-tuning with distillation achieves full success on the most challenging task. We demonstrate that the resulting policies outperform baselines in both robustness and adaptability.

95%

Overall success rate
(NIST Assembly Benchmark)

50 min

Total robot interaction
for training

8

Unseen task variants
zero-shot generalization

0

Domain randomization
or data augmentation needed

Contributions

Cross-modal distillation — distills vision-enabled teacher policies into vision-free students, producing policies invariant to visual conditions.
Efficient generalization — co-training on 3 tasks enables zero-shot transfer to 8 unseen connectors, with full success achievable via brief fine-tuning with distillation.
Real-world validation — 95% overall success on the NIST Assembly Benchmark after ~50 min of robot interaction, outperforming all baselines in robustness and adaptability.

Method

VE2VF Training Pipeline

I A vision-enabled teacher is trained via human-in-the-loop RL using camera images alongside proprioceptive inputs. II The teacher is distilled into a vision-free student that uses only pose, twist, and wrench — no cameras at deployment. III (Optional) A new task-specific teacher is trained and used to fine-tune the student in just 10 additional minutes.

Fig. 2: VE2VF training stages.

Input Modalities & Architecture

Teacher: two 128×128 RGB views (frozen ResNet-10) + pose, velocity, force, torque.

Student: pose, velocity, force, torque only — no cameras.

Both policies use FC MLP actors/critics (256:256), trained with SAC + RLPD (Reinforcement Learning with Prior Data), mixing policy rollouts and human demonstrations.

RL Hyperparameters & Architecture
Setting	Value
Visual observations	Two 128×128 RGB views
Action (translation)	3×3×3 cm along x,y,z
Action (rotation)	10×10×10° about x,y,z
Image encoder	Pretrained ResNet-10 (frozen)
Proprioception encoder	FC 64
Actor / Critic networks	FC MLP 256 : 256
RL algorithm	SAC + RLPD
Control frequency	10 Hz (episode: 10 s train / 15 s eval)

Task Benchmark — NIST Assembly Board I

Trained on 3 tasks (M Peg, Ethernet, M Gear) chosen for their distinct contact dynamics. Evaluated zero-shot on all 8 connectors of the NIST Assembly Board I replica. Hardware: Franka FR3, two wrist-mounted RealSense D435i cameras, 3D Space Mouse for human teleoperation.

(a) All connectors

(b) Training tasks

(c) Test tasks (unseen in training)

Fig. 3: Benchmark tasks. 1. S gear, 2. M gear, 3. L gear, 4. M peg, 5. L peg, 6. Ethernet, 7. USB, 8. DSUB. Policies are trained on tasks in (b) and evaluated on all 8 tasks.

Experiments

Robustness and Generalization

Baselines: HIL-SERL (Human-In-the-Loop Sample Efficient RL) in a vision variant (VPTW) and a proprioception-only variant (PTW, no distillation); DMP (open-loop motion primitive from a kinesthetic demo); Residual RL (DMP + SAC correction). Each tested under three conditions: normal training, disturbed (visual distractors + pose noise), and OOD (unseen connectors).

Method	Input	Training tasks			Disturbed tasks			Out-of-distribution tasks								Overall
Method	Input	M Peg	Eth.	M Gear	M Peg	Eth.	M Gear	M Peg	L Peg	Eth.	USB	DSUB	S Gear	M Gear	L Gear	Overall
HIL-SERL	VPTW	10/10	10/10	10/10	0/10	10/10	8/10	0/10	0/10	0/10	0/10	0/10	0/10	0/10	0/10	34.3%
HIL-SERL	PTW	7/10	6/10	9/10	5/10	4/10	8/10	6/10	8/10	0/10	5/10	2/10	1/10	4/10	1/10	47.1%
DMP	P	9/10	5/10	9/10	4/10	2/10	2/10	4/10	3/10	5/10	0/10	0/10	0/10	4/10	3/10	35.7%
Residual RL	PTW	9/10	10/10	10/10	8/10	9/10	10/10	9/10	9/10	9/10	6/10	6/10	7/10	10/10	8/10	85.7%
VE2VF (ours)	PTW	10/10	10/10	10/10	10/10	10/10	10/10	10/10	10/10	10/10	10/10	5/10	9/10	10/10	9/10	95.0%

Table II: Success rates (trials out of 10) across training, disturbed, and out-of-distribution insertion tasks. Input modalities: V=vision, P=pose, T=twist, W=wrench. Bold = best per column. See paper for full citations.

VE2VF reaches 95.0% overall, with perfect scores under disturbance and strong OOD generalization. DSUB is the only exception (5/10 zero-shot) and is addressed by fine-tuning. HIL-SERL VPTW — our vision-enabled teacher — collapses entirely on OOD tasks, confirming that vision-based policies overfit to training appearance.

Input Modality Ablation

Pose is critical — sTW (no pose) performs worst despite a longer observation history.
All modalities together (sPTW) are needed for disturbance robustness — sP degrades significantly under pose noise.
Distillation is essential — training PTW from scratch for 75 min never matches the distilled student; the policy plateaus and overfits.

Method	Train time	Training tasks			Disturbed tasks
Method	Train time	M Peg	Eth.	M Gear	M Peg	Eth.	M Gear
VPTW	40 min	10/10	10/10	10/10	0/10	10/10	8/10
V	40 min	10/10	10/10	10/10	0/10	6/10	3/10
PTW	50 min	7/10	6/10	9/10	5/10	4/10	8/10
sPTW (ours)	40+10 min	10/10	10/10	10/10	10/10	10/10	10/10
sP	40+10 min	10/10	10/10	10/10	2/10	7/10	8/10
sTW	40+10 min	4/10	7/10	10/10	4/10	7/10	10/10

Table III: Input modality ablation on training tasks under normal and disturbed conditions. Prefix s denotes a student policy distilled for 10 min from a VPTW teacher trained for 40 min. sPTW=pose+twist+wrench; sP=pose only; sTW=twist+wrench (8-step history).

Fig. 6: Training progression of the vision-free policy PTW trained from scratch without distillation. Performance plateaus after 50 minutes; longer training leads to overfitting on the gear task.

Fine-tuning for Challenging Tasks

DSUB achieves only 50% zero-shot. Fine-tuning with distillation from a DSUB-specific teacher reaches 100% in 10 minutes. Fine-tuning without distillation drops to 0%, confirming that proprioception alone cannot guide RL exploration from scratch.

DSUB Insertion Success Rate
Method	Success
Residual RL zero-shot	6/10
VE2VF zero-shot	5/10
Residual RL fine-tuned	7/10
VE2VF fine-tuned w/o distillation	0/10
VE2VF fine-tuned w/ distillation	10/10

Table IV: DSUB insertion success rates before and after fine-tuning.

BibTeX

@article{ve2vf2025,
  title     = {{VE2VF: Vision-Enabled to Vision-Free Distillation via Real-world
               Reinforcement Learning for Robust Contact-Rich Manipulation}},
  author    = {Kowalski, Victor and Li, Chengxi and Lee, Dongheui},
  journal   = {arXiv preprint},
  year      = {2026}
}

VE2VF: Vision-Enabled to Vision-Free Distillation via Real-world Reinforcement Learning for Robust Contact-Rich Manipulation