REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly

Abstract

Robotic manipulation remains a core challenge in robotics, particularly for contact-rich tasks such as industrial assembly and disassembly. Existing datasets have significantly advanced learning in manipulation but are primarily focused on simpler tasks like object rearrangement, falling short of capturing the complexity and physical dynamics involved in assembly and disassembly. To bridge this gap, we present REASSEMBLE (Robotic assEmbly disASSEMBLy datasEt), a new dataset designed specifically for contact-rich manipulation tasks. Built around the NIST Assembly Task Board 1 benchmark, REASSEMBLE includes four actions (pick, insert, remove, and place) involving 17 objects. The dataset contains 4,551 demonstrations, of which 4,035 were successful, spanning a total of 781 minutes. Our dataset features multi-modal sensor data including event cameras, force-torque sensors, microphones, and multi-view RGB cameras. This diverse dataset supports research in areas such as learning contact-rich manipulation, task condition identification, action segmentation, and more. We believe REASSEMBLE will be a valuable resource for advancing robotic manipulation in complex, real-world scenarios.

Data Visualization

We develop a visualization tool based on the ReRun viewer to display synchronized data from the REASSEMBLE dataset. To improve visualization speed, we downsample the recorded data both spatially (external wrist camera images by a factor of 9, event camera videos by a factor of 4) and temporally (audio by a factor of 2000, proprioceptive data by a factor of 200). Currently, we showcase two samples from the dataset, with plans to expand the selection upon the full dataset’s release. Click on the gifs to open the visualization app!

Experiments

Temporal Action Segmentation

For the benchmarking purposes, we evaluate the performance of a state-of-the-art visual TAS model, DiffAct. We use the default hyperparameter settings provided for the 50Salads dataset. The performance of DiffAct on the REASSEMBLE dataset is as follows: Accuracy 61.5%, EDIT 47.8%, F1@10 63.3%, F1@25 58.4%, and F1@50 44.1%. The figure illustrates the TAS performance. In red, we highlight instances where the "Pick" action was not predicted by DiffAct. In blue, we mark areas where similar objects were confused. In this case, DiffAct confused "round peg 1" with "square peg 1" and "square peg 2."

Motion Policy learning

Based on the demonstrations from the REASSEMBLE dataset we train simple motion policies (Dynamical Movement Primitives) to complete the gear assembly and disassembly tasks. Each action in the assembly and disassembly was executed 10 times to assess success and failure modes. "Pick" succeeded in 8 trials, failing due to gear slippage. "Insert" had the lowest success rate, with 7 successful trials, mainly failing when the gear remained misaligned after the spiral search. "Remove" succeeded in 8 trials, with failures caused by the gripper grasping the plate instead of the gear due to tracking errors. "Place" was successful in all trials.

Execution Monitoring

Task execution can fail due to policy generalization issues, perception errors, controller limitations, or human interruptions, making error detection and recovery essential. To demonstrate anomaly detection using the REASSEMBLE dataset, we develop a execution monitoring pipeline based on our previous ConditionNET work. The performance can be seen in the video on the left side.

Data Collection Setup

We collect a comprehensive range of sensory data for task demonstrations, including multi-view RGB video from two external HAMA C-600 Pro webcams and a wrist-mounted Intel RealSense D435i, as well as proprioceptive data such as joint positions, velocities, end-effector position, and gripper width. Audio is captured by three microphones, including an OSA K1T wireless mic on the robot's gripper. Interaction forces and torques are measured using a wrist-mounted 6-axis AIDIN ROBOTICS AFT200-D80-C force-torque sensor. The dataset also features an event camera, providing high-speed, low-latency motion information. For precise camera localization, we use a motion capture system with custom 3D-printed brackets and reflective markers on both the cameras and the robot's base.

Dataset Statistics

Action Distribution

The REASSEMBLE dataset includes 68 action-object pairs, or 69 with the "Idle" action, covering four actions and 17 objects. To ensure balanced performance, it contains a relatively equal number of demonstrations for each action, with a minimum of 55 for "Remove square peg 2" and a maximum of 86 for "Pick USB" and "Insert BNC." This distribution reduces performance bias toward more frequently occurring actions and supports diverse, balanced learning for downstream models.

Action Demonstration Success Rate

The REASSEMBLE dataset contains demonstrations and labels for both successful and unsuccessful action executions. This allows learning success detectors and implementing execution monitoring pipelines. The number of failed demonstrations per action reflects task difficulty, with complex motions leading to more failures. The "Insert" action has the highest failure rate due to its multi-step process, especially for BNC and bolt 4, which require precise alignment and rotation. Ethernet and USB insertions also fail frequently due to their directional plugs and edge placement on the task board. "Pick" failures occur when the gripper misses or drops the object, while "Remove" failures result from misalignment causing objects to jam. In contrast, "Place" has the fewest failures, though slips leading to obstruction are classified as failures.

Interaction Point Diversity

Increasing the diversity in the interaction points of actions within a dataset enhances generalization and performance on downstream tasks. To achieve high diversity in our data, we instructed the operator to randomize the board and object poses for each trial during data collection. The approximate interaction point of all 4,551 demonstrations was determined by sampling the final end-effector position of each trial. The robot starts at the origin, facing the positive x-axis, with most objects and the board placed in front of the robot within its workspace.

Patterns in Force&Torque measurements

To identify patterns in force and torque profiles, we normalize each demonstration by its duration, converting it to a "progress domain" where 100% indicates completion. We then resample all demonstrations to 500 samples using linear interpolation and plot the mean force and torque with standard deviations for each action. This allows us to visually analyze trends.

Dataset Structure

Each trial is stored in a separate H5 file. We preserve the native measurement frequency of each sensor, which allows the user for task-specific synchronization and sampling. To facilitate easy alignment of all messages, we save the timestamps or each sensor reading.

BibTeX


        Comming soon