VE2VF: Vision-Enabled to Vision-Free Distillation via Real-world Reinforcement Learning for Robust Contact-Rich Manipulation

Victor Kowalski1, Chengxi Li1, Dongheui Lee1,2
1Autonomous Systems Lab, Technische Universität Wien (TU Wien)   2Institute of Robotics and Mechatronics, German Aerospace Center (DLR)

Abstract

When using reinforcement learning (RL) for contact-rich robotic manipulation, vision can provide task-relevant information that accelerates learning beyond what proprioception alone can achieve. However, vision-enabled policies tend to overfit to the visual conditions seen during training, limiting their robustness and transferability. We present a human-in-the-loop RL framework that employs teacher-student distillation to achieve robust performance across multiple task variants, trained entirely in the real world without requiring domain randomization or data augmentation. A vision-enabled teacher distills its knowledge into a vision-free student that relies solely on pose, twist (end-effector velocity), and wrench (force and torque) sensing, combining fast training with strong task generalization. On the real-world NIST assembly benchmark board, our approach achieves 95% overall success after approximately 50 minutes of training on 3 representative tasks, including robust generalization to 8 unseen task variants. Fine-tuning with distillation achieves full success on the most challenging task. We demonstrate that the resulting policies outperform baselines in both robustness and adaptability.

95%
Overall success rate
(NIST Assembly Benchmark)
50 min
Total robot interaction
for training
8
Unseen task variants
zero-shot generalization
0
Domain randomization
or data augmentation needed

Contributions

  • Cross-modal distillation — distills vision-enabled teacher policies into vision-free students, producing policies invariant to visual conditions.
  • Efficient generalization — co-training on 3 tasks enables zero-shot transfer to 8 unseen connectors, with full success achievable via brief fine-tuning with distillation.
  • Real-world validation — 95% overall success on the NIST Assembly Benchmark after ~50 min of robot interaction, outperforming all baselines in robustness and adaptability.

Method

VE2VF Training Pipeline


I A vision-enabled teacher is trained via human-in-the-loop RL using camera images alongside proprioceptive inputs. II The teacher is distilled into a vision-free student that uses only pose, twist, and wrench — no cameras at deployment. III (Optional) A new task-specific teacher is trained and used to fine-tune the student in just 10 additional minutes.


VE2VF Training Stages

Fig. 2: VE2VF training stages.

Input Modalities & Architecture


Teacher: two 128×128 RGB views (frozen ResNet-10) + pose, velocity, force, torque.

Student: pose, velocity, force, torque only — no cameras.

Both policies use FC MLP actors/critics (256:256), trained with SAC + RLPD (Reinforcement Learning with Prior Data), mixing policy rollouts and human demonstrations.
RL Hyperparameters & Architecture
SettingValue
Visual observationsTwo 128×128 RGB views
Action (translation)3×3×3 cm along x,y,z
Action (rotation)10×10×10° about x,y,z
Image encoderPretrained ResNet-10 (frozen)
Proprioception encoderFC 64
Actor / Critic networksFC MLP 256 : 256
RL algorithmSAC + RLPD
Control frequency10 Hz (episode: 10 s train / 15 s eval)

Task Benchmark — NIST Assembly Board I


Trained on 3 tasks (M Peg, Ethernet, M Gear) chosen for their distinct contact dynamics. Evaluated zero-shot on all 8 connectors of the NIST Assembly Board I replica. Hardware: Franka FR3, two wrist-mounted RealSense D435i cameras, 3D Space Mouse for human teleoperation.


All connectors

(a) All connectors

Training tasks

(b) Training tasks

Test tasks

(c) Test tasks (unseen in training)

Fig. 3: Benchmark tasks. 1. S gear, 2. M gear, 3. L gear, 4. M peg, 5. L peg, 6. Ethernet, 7. USB, 8. DSUB. Policies are trained on tasks in (b) and evaluated on all 8 tasks.

Experiments

Robustness and Generalization

Baselines: HIL-SERL (Human-In-the-Loop Sample Efficient RL) in a vision variant (VPTW) and a proprioception-only variant (PTW, no distillation); DMP (open-loop motion primitive from a kinesthetic demo); Residual RL (DMP + SAC correction). Each tested under three conditions: normal training, disturbed (visual distractors + pose noise), and OOD (unseen connectors).

Method Input Training tasks Disturbed tasks Out-of-distribution tasks Overall
M PegEth.M Gear M PegEth.M Gear M PegL PegEth.USBDSUBS GearM GearL Gear
HIL-SERLVPTW 10/1010/1010/10 0/1010/108/10 0/100/100/100/100/100/100/100/10 34.3%
HIL-SERLPTW 7/106/109/10 5/104/108/10 6/108/100/105/102/101/104/101/10 47.1%
DMPP 9/105/109/10 4/102/102/10 4/103/105/100/100/100/104/103/10 35.7%
Residual RLPTW 9/1010/1010/10 8/109/1010/10 9/109/109/106/106/107/1010/108/10 85.7%
VE2VF (ours)PTW 10/1010/1010/10 10/1010/1010/10 10/1010/1010/1010/105/109/1010/109/10 95.0%

Table II: Success rates (trials out of 10) across training, disturbed, and out-of-distribution insertion tasks. Input modalities: V=vision, P=pose, T=twist, W=wrench. Bold = best per column. See paper for full citations.


VE2VF reaches 95.0% overall, with perfect scores under disturbance and strong OOD generalization. DSUB is the only exception (5/10 zero-shot) and is addressed by fine-tuning. HIL-SERL VPTW — our vision-enabled teacher — collapses entirely on OOD tasks, confirming that vision-based policies overfit to training appearance.

Input Modality Ablation


  • Pose is critical — sTW (no pose) performs worst despite a longer observation history.
  • All modalities together (sPTW) are needed for disturbance robustness — sP degrades significantly under pose noise.
  • Distillation is essential — training PTW from scratch for 75 min never matches the distilled student; the policy plateaus and overfits.
Method Train time Training tasks Disturbed tasks
M PegEth.M Gear M PegEth.M Gear
VPTW40 min 10/1010/1010/10 0/1010/108/10
V40 min 10/1010/1010/10 0/106/103/10
PTW50 min 7/106/109/10 5/104/108/10
sPTW (ours)40+10 min 10/1010/1010/10 10/1010/1010/10
sP40+10 min 10/1010/1010/10 2/107/108/10
sTW40+10 min 4/107/1010/10 4/107/1010/10

Table III: Input modality ablation on training tasks under normal and disturbed conditions. Prefix s denotes a student policy distilled for 10 min from a VPTW teacher trained for 40 min. sPTW=pose+twist+wrench; sP=pose only; sTW=twist+wrench (8-step history).


Training progression

Fig. 6: Training progression of the vision-free policy PTW trained from scratch without distillation. Performance plateaus after 50 minutes; longer training leads to overfitting on the gear task.

Fine-tuning for Challenging Tasks


DSUB achieves only 50% zero-shot. Fine-tuning with distillation from a DSUB-specific teacher reaches 100% in 10 minutes. Fine-tuning without distillation drops to 0%, confirming that proprioception alone cannot guide RL exploration from scratch.
DSUB Insertion Success Rate
MethodSuccess
Residual RL zero-shot6/10
VE2VF zero-shot5/10
Residual RL fine-tuned7/10
VE2VF fine-tuned w/o distillation0/10
VE2VF fine-tuned w/ distillation10/10

Table IV: DSUB insertion success rates before and after fine-tuning.

BibTeX

@article{ve2vf2025,
  title     = {{VE2VF: Vision-Enabled to Vision-Free Distillation via Real-world
               Reinforcement Learning for Robust Contact-Rich Manipulation}},
  author    = {Kowalski, Victor and Li, Chengxi and Lee, Dongheui},
  journal   = {arXiv preprint},
  year      = {2026}
}