Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer
vision. In robotics, algorithms have primarily focused on leveraging proprioceptive information to
determine skill boundaries, with recent approaches in surgical robotics incorporating vision. In contrast,
computer vision typically relies on exteroceptive sensors, such as cameras. Existing multimodal TAS models
in robotics integrate feature fusion within the model, making it difficult to reuse learned features across
different models. Meanwhile, pretrained vision-only feature extractors commonly used in computer vision
struggle in scenarios with limited object visibility. In this work, we address these challenges by
proposing M2R2, a multimodal feature extractor tailored for TAS, which combines
information from both proprioceptive and exteroceptive sensors. We introduce a novel training strategy
that enables the reuse of learned features across multiple TAS models. Our method sets a new
state-of-the-art performance on three robotic datasets REASSEMBLE, (Im)PerfectPour, and JIGSAWS.
Additionally, we conduct an extensive ablation study to evaluate the contribution of different modalities
in robotic TAS tasks.