ATLAS: An Annotation Tool for Long-horizon Robotic Action Segmentation

Fig. 1: ATLAS is a tool for long-horizon robotic action segmentation, supporting time-synchronized visualization of multi-modal data. It handles common robotics dataset formats and can be extended to new ones. The tool allows annotation of action boundaries, classes, and outcomes (success or failure).

Abstract

Annotating long-horizon robotic demonstrations with precise temporal action boundaries is crucial for training and evaluating action segmentation and manipulation policy learning methods. Existing annotation tools, however, are often limited: they are designed primarily for vision-only data, do not natively support synchronized visualization of robot-specific time-series signals (e.g., gripper state or force/torque), or require substantial effort to adapt to different dataset formats. In this paper, we introduce ATLAS, an annotation tool tailored for long-horizon robotic action segmentation. ATLAS provides time-synchronized visualization of multi-modal robotic data, including multi-view video and proprioceptive signals, and supports annotation of action boundaries, action labels, and task outcomes. The tool natively handles widely used robotics dataset formats such as ROS bags and the Reinforcement Learning Dataset (RLDS) format, and provides direct support for specific datasets such as REASSEMBLE. ATLAS can be easily extended to new formats via a modular dataset abstraction layer. Its keyboard-centric interface minimizes annotation effort and improves efficiency. In experiments on a contact-rich assembly task, ATLAS reduced the average per-action annotation time by at least 6% compared to ELAN, while the inclusion of time-series data improved temporal alignment with expert annotations by more than 2.8% and decreased boundary error fivefold compared to vision-only annotation tools.

Contributions

Synchronized multi-modal visualization — time-synchronized display of multi-view RGB video streams and robot-specific time-series signals (end-effector pose, gripper state, force/torque) for precise boundary identification.
Flexible dataset support — native handling of raw videos, frame-based datasets, ROS bag files (ROS1 & ROS2), the RLDS format (covering 71 datasets and 2.4M+ episodes in the Open X-Embodiment repository), and the REASSEMBLE dataset — no conversion required.
Extensible dataset abstraction — a template-method design pattern decouples the visualization and annotation frontend from the underlying data format, allowing new datasets to be integrated with minimal effort.
Keyboard-centric annotation interface — customizable shortcuts reduce mouse interaction overhead, achieving the lowest average per-action annotation time among compared tools (11.0 ± 1.2 s for vision-only, 18.5 ± 0.9 s with time-series data).
Action segmentation and outcome annotation — support for marking action boundaries, assigning semantic action labels, and recording task outcomes (success or failure).

Tool Overview

Interface & Annotation Workflow

The ATLAS interface is organized into five vertically stacked sections:

Multi-view camera panel — simultaneous display of all configured camera streams.
Data Selector — scrollable menu to choose which time-series signals to visualize.
Time-series plots — interactive PyQtGraph panels with zoom and axis adjustment.
Timeline & Annotation Panel — scrub through the episode; view and edit all action segments, their labels, and success/failure flags.
Control buttons — keyboard shortcut–driven controls for starting/ending segments and navigating the episode at two configurable speeds.

Annotation begins by pressing Start Action; after navigating to the boundary, pressing the same key again opens a label dialog. The annotated segment is instantly added to the panel.

Action Label Dialog & Code Architecture

Action dialog: select a label by pressing its number and toggle the success flag before confirming.

Code structure: datasets/ houses format-specific handlers inheriting from DatasetBase; gui.py implements the PyQt5 frontend.

Supported Dataset Formats

ATLAS uses a template method design pattern: an abstract DatasetBase class defines a common API for episode retrieval and I/O. Specialized subclasses implement it for each format. New formats can be added by creating a single Python file in the datasets/ folder.

Format	Time-series Visualization	Temporal Segmentation Annotation	Notes
Raw Videos (.mp4, .avi, .mkv)	—	✓	Auto-detects single-file, flat-dir, multi-camera layouts
Frame-based datasets	—	✓	Same layout detection as Video handler
ROS Bags (ROS1 & ROS2)	✓	✓	No ROS installation required; parses JointState, WrenchStamped, PoseStamped, …
RLDS (TF Datasets)	✓	✓	Covers all 71 datasets / 2.4M episodes in Open X-Embodiment
REASSEMBLE (HDF5)	✓	✓	Background ring-buffer pre-loading for low I/O latency

Comparison with Existing Tools

Tool	Domain	Time-series Visualization	Temporal Segmentation Annotation	Supported Formats
CVAT	Computer Vision	✗	✗	Frames, Videos
Foxglove Studio	Robotics	✓	✗	MCAP, ROS Bags
ROSAnnotator	Robotics	✗ ^†	✓	ROS Bags
Anvil	Behavioral	✓	✓	Videos, Motion Capture
ELAN	Linguistics	✓	✓	Videos, CSV files
ATLAS (ours)	Robotics	✓	✓	Frames, Videos, ROS Bags, RLDS, REASSEMBLE

^† Authors provide modification guidelines but this functionality is not available by default.

Experiments

Annotation Efficiency & Quality

We recorded four long-horizon demonstrations of a gear assembly task (NIST assembly task board) with three camera views and proprioceptive data. Twelve annotators were each assigned to one of four conditions: ROSAnnotator, ELAN, ATLAS (vision-only), or ATLAS (vision + time-series). Expert annotations from two specialists (inter-annotator agreement 99.6%, boundary distance 0.033 s) served as ground truth.

Condition	Avg. per-action time (s)	Alignment score (%)	Avg. boundary distance (s)
ROSAnnotator	26.9 ± 6.1	85.1 ± 2.3	1.59 ± 0.25
ELAN	19.7 ± 1.1	96.7 ± 0.8	0.35 ± 0.08
ATLAS (vision-only)	11.0 ± 1.2	97.0 ± 1.4	0.31 ± 0.15
ATLAS (vision + time series)	18.5 ± 0.9	99.4 ± 0.1	0.06 ± 0.01

Table I: Quantitative comparison of annotation performance across tools. Bold = best result per metric.

Key takeaways:

ATLAS (vision-only) is the fastest at 11.0 ± 1.2 s/action — 44% faster than ELAN and 59% faster than ROSAnnotator.
Adding time-series data increases annotation time slightly (18.5 s), but yields a fivefold reduction in boundary error (0.06 s vs. 0.31 s) and near-expert alignment (99.4%).
ROSAnnotator's coarse whole-second timeline limits boundary precision, resulting in the lowest alignment score (85.1%).

BibTeX

@inproceedings{stanovcic2026atlas,
  title     = {{ATLAS: An Annotation Tool for Long-horizon Robotic Action Segmentation}},
  author    = {Stanovcic, Sergej and Sliwowski, Daniel and Lee, Dongheui},
  booktitle = {2026 IEEE International Conference on Advanced Robotics and its Social Impact (ARSO)},
  year      = {2026}
}