M2R2: MultiModal Robotic Representation for Temporal Action Segmentation

Abstract

Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer vision. In robotics, algorithms have primarily focused on leveraging proprioceptive information to determine skill boundaries, with recent approaches in surgical robotics incorporating vision. In contrast, computer vision typically relies on exteroceptive sensors, such as cameras. Existing multimodal TAS models in robotics integrate feature fusion within the model, making it difficult to reuse learned features across different models. Meanwhile, pretrained vision-only feature extractors commonly used in computer vision struggle in scenarios with limited object visibility. In this work, we address these challenges by proposing M2R2, a multimodal feature extractor tailored for TAS, which combines information from both proprioceptive and exteroceptive sensors. We introduce a novel training strategy that enables the reuse of learned features across multiple TAS models. Our method sets a new state-of-the-art performance on three robotic datasets REASSEMBLE, (Im)PerfectPour, and JIGSAWS. Additionally, we conduct an extensive ablation study to evaluate the contribution of different modalities in robotic TAS tasks.

Fig. 1: Many multimodal approaches follow an end-to-end paradigm (a). Modular approaches train the feature extractor and TAS model separately but rely on vision-only extractors (b). We propose the M2R2 feature extractor, which leverages multimodal information and integrates with diverse TAS models (c).

Contributions

Multimodal feature extractor — a deep-learning-based M2R2 feature extractor that fuses vision, audio, and proprioceptive data for robotic temporal action segmentation.
Novel training strategy — a modular training approach that decouples feature extraction from the TAS model, enabling reuse of learned features with any state-of-the-art TAS architecture (MSTCN, ASRF, DiffAct, …).
Extensive evaluation — state-of-the-art performance on three robotic datasets (REASSEMBLE, (Im)PerfectPour, JIGSAWS) and an ablation study quantifying the contribution of each sensor modality.

Method

M2R2 Feature Extractor Architecture

To compute the multimodal feature at time instant t_i, we first process each modality separately: image features are extracted with an ActionCLIP visual transformer; audio features are extracted from the log-mel spectrogram via the Audio Spectrogram Transformer (AST); proprioceptive signals (end-effector pose, twist, gripper width, force/torque) are projected into a shared embedding space and fused over the temporal dimension using a learnable embedding. All modality tokens are then fused with a single self-attention Transformer encoder layer followed by an MLP.

Fig. 2: M2R2 Feature Extractor Architecture.

Training Strategy

We train the M2R2 feature extractor using two complementary objectives. First, we sample windows of three consecutive actions and maximize the cosine similarity between the window's average feature embedding and a CLIP text embedding of a sentence describing the action sequence — encouraging the model to capture action order. Second, a Boundary Regression Network predicts the probability of each frame being an action boundary, supervised with a smoothed ground-truth boundary sequence. The total loss is the sum of the action-contrastive loss and the boundary MSE loss.

After training, the Fusion Transformer, Boundary Regression Network, and CLIP text encoder are discarded. The resulting per-frame M2R2 features are used to train any downstream TAS model.

Fig. 3: M2R2 training objectives.

Experiments

REASSEMBLE Dataset

M2R2 features achieve state-of-the-art performance on the REASSEMBLE dataset across all TAS models. M2R2+ASRF attains an F1@50 of 82.4% on fine-grain labels and 97.1% on coarse labels, outperforming the best vision-only baseline by over 74 percentage points. Unsupervised proprioception-only baselines (BOCPD, AWE) are surpassed by at least 46.6 pp in F1@50.

Method	Modalities	Type	Fine-grain Labels					Coarse Labels
Method	Modalities	Type	F1@10	F1@25	F1@50	Acc	EDIT	F1@10	F1@25	F1@50	Acc	EDIT
BOCPD	FT	E2E	39.2	32.6	12.8	35.0	19.7	—	—	—	—	—
AWE	P	E2E	72.5	68.4	35.8	54.6	54.9	—	—	—	—	—
BrP+MSTCN	V	Modular	12.9	10.9	8.4	7.9	18.4	32.1	28.2	20.0	65.0	34.0
BrP+ASRF	V	Modular	12.8	11.1	8.6	6.4	19.4	57.6	44.5	23.3	54.3	67.5
BrP+DiffAct	V	Modular	12.0	10.1	5.1	5.7	21.4	62.9	51.8	22.5	54.8	73.4
M2R2+MSTCN	V,A,P,FT,G	Modular	83.1	82.7	80.8	82.4	79.3	97.7	97.6	96.7	96.7	95.0
M2R2+ASRF	V,A,P,FT,G	Modular	83.5	83.4	82.4	82.7	82.5	98.1	98.1	97.1	96.0	96.9
M2R2+DiffAct	V,A,P,FT,G	Modular	78.1	77.7	74.6	74.9	68.7	95.1	95.0	92.1	91.7	90.3

Table I: Results on REASSEMBLE. Bold = best, underline = second best. V – vision, A – audio, P – pose, FT – force/torque, G – gripper width.

(Im)PerfectPour Dataset — Generalization

We evaluate generalization by applying the M2R2 extractor trained on REASSEMBLE to the (Im)PerfectPour bartending dataset. M2R2+DiffAct achieves F1@50 of 89.7%, outperforming the best vision-only baseline by 16.1 percentage points. These results demonstrate that when available modalities are similar across datasets, M2R2 features transfer well to new task domains without retraining.

Method	Modalities	F1@10	F1@25	F1@50	Acc	EDIT
BrP+MSTCN	V	81.0	73.9	63.6	76.8	88.9
BrP+ASRF	V	84.1	82.1	73.6	77.7	91.2
BrP+DiffAct	V	81.6	79.9	71.8	74.4	83.1
M2R2+MSTCN	V,P,FT,G	92.6	89.8	84.6	93.9	88.9
M2R2+ASRF	V,P,FT,G	89.8	88.5	86.0	93.7	87.9
M2R2+DiffAct	V,P,FT,G	90.5	89.7	89.7	93.3	83.0

Table II: Results on (Im)PerfectPour. M2R2 features trained on REASSEMBLE.

JIGSAWS Dataset — Cross-Embodiment

Method	Modalities	Type	F1@10	F1@25	F1@50	Acc	EDIT
Lea et al.	V	E2E	n/r	n/r	n/r	74.2	66.5
Weerasinghe et al.	V,P,G	E2E	87.3	86.5	81.1	87.1	83.9
Atoum et al.	V,P	E2E	n/r	n/r	n/r	90.3	89.0
BrP+MSTCN	V	Modular	53.2	41.7	23.0	38.7	76.8
BrP+ASRF	V	Modular	53.4	43.2	24.5	39.3	73.6
BrP+DiffAct	V	Modular	59.6	49.6	30.2	44.5	81.9
M2R2+MSTCN	V,P,G	Modular	93.3	91.6	86.3	87.0	90.4
M2R2+ASRF	V,P,G	Modular	93.1	91.2	86.0	85.6	90.7
M2R2+DiffAct	V,P,G	Modular	94.3	93.0	89.4	87.1	91.7

Table III: Results on JIGSAWS (Suturing, Leave-One-User-Out). n/r – not reported.

On the JIGSAWS surgical gesture recognition dataset, M2R2+DiffAct achieves an F1@50 of 89.4% and an EDIT score of 91.7%, surpassing prior state-of-the-art by 1.7% in EDIT. Unlike previous approaches that rely on additional binary gripper signals or invariant proprioceptive representations, M2R2 directly processes raw sensory readings. This demonstrates that M2R2 generalizes across different embodiments and task domains.

Qualitative Results

TAS Performance — REASSEMBLE, (Im)PerfectPour & JIGSAWS

REASSEMBLE — M2R2 features are less sensitive to high-frequency proprioceptive changes that cause heuristic approaches to over-segment, and more precise in differentiating visually similar objects (e.g., round vs. square pegs) compared to vision-only approaches. The segmentation bar shows ground truth (top) and predicted labels (bottom).

Fig. 4: Qualitative evaluation. M2R2 avoids over-segmentation from heuristics (left, middle) and differentiates objects better than vision-only approaches (right).

(Im)PerfectPour — M2R2 features trained on REASSEMBLE generalize to a new task domain (bartending) without retraining, achieving F1@50 of 89.7% and outperforming vision-only baselines by 16.1 pp.

JIGSAWS — M2R2 generalizes to a surgical robotics setting with a different embodiment (two-arm da Vinci). Using only raw pose, twist, and gripper width, M2R2+DiffAct achieves state-of-the-art F1@50 of 89.4% on the Suturing task.

Fig. 5: Coarse vs. fine-grain prediction on REASSEMBLE. Large and medium gears are occasionally confused at fine-grain level, but coarse-level segmentation remains correct.

Ablation Study

Modality Contribution

We ablate the contribution of each sensor modality on the REASSEMBLE dataset using DiffAct. Key findings:

Vision alone performs worst (F1@50: 21.6%), as many objects are poorly visible due to their small size.
Audio alone achieves a high detection rate (80.1%) — action transitions are marked by distinct sounds — but struggles to identify objects.
Proprioception alone performs comparably to all modalities combined (F1@50: 74.5%), reflecting the specificity of the REASSEMBLE dataset.
Gripper width is the most informative proprioceptive signal — objects can often be distinguished by size.
All modalities combined yield the best detection rate (82.4%), showing that multimodal fusion improves boundary precision.

Configuration	F1@10	F1@25	F1@50	Acc	EDIT	DR
Only vision	27.2	26.1	21.6	21.4	28.9	44.3
Only audio	39.2	38.7	36.4	34.5	37.5	80.1
Only proprio	78.0	77.5	74.5	75.4	69.2	80.3
No gripper	67.9	67.2	64.2	65.1	59.2	82.2
No pose/vel	69.9	69.1	65.1	64.6	61.8	80.5
No F/T	72.9	72.4	69.9	70.4	65.2	80.6
All modalities	78.1	77.7	74.6	74.9	68.7	82.4

Table IV: Modality ablation on REASSEMBLE with DiffAct (fine-grain labels). DR = Detection Rate.

Note: These ablation results are specific to the REASSEMBLE dataset, where objects are small and often identifiable by size or force profile. The relative importance of each modality is likely to differ in tasks with higher visual variability (e.g., meal preparation, where vision may be essential to distinguish similar-looking objects). Further investigation across more diverse datasets is needed to draw general conclusions about modality contributions in robotic TAS.

(a) Some objects (e.g., waterproof connector) can be identified from audio alone.

(b) Removing gripper information makes object size-based discrimination harder.

Fig. 6: Qualitative predictions for different modality combinations.

BibTeX

@inproceedings{sliwowski2025m2r2,
  title     = {{M2R2: MultiModal Robotic Representation for Temporal Action Segmentation}},
  author    = {Sliwowski, Daniel and Lee, Dongheui},
  booktitle = {2026 IEEE International Conference on Robotics & Automation (ICRA)},
  year      = {2026}
}