When using reinforcement learning (RL) for contact-rich robotic manipulation, vision can provide
task-relevant information that accelerates learning beyond what proprioception alone can achieve.
However, vision-enabled policies tend to overfit to the visual conditions seen during training,
limiting their robustness and transferability. We present a human-in-the-loop RL framework that
employs teacher-student distillation to achieve robust performance across multiple task variants,
trained entirely in the real world without requiring domain randomization or data augmentation.
A vision-enabled teacher distills its knowledge into a vision-free student that relies solely on
pose, twist (end-effector velocity), and wrench (force and torque) sensing,
combining fast training with strong task generalization.
On the real-world NIST assembly benchmark board, our approach achieves
95% overall success after approximately 50 minutes of training on
3 representative tasks, including robust generalization to 8 unseen task variants.
Fine-tuning with distillation achieves full success on the most challenging task.
We demonstrate that the resulting policies outperform baselines in both robustness and adaptability.