The introduction of robots into everyday scenarios necessitates algorithms capable of monitoring the execution of tasks. In this paper, we propose ConditionNET, an approach for learning the preconditions and effects of actions in a fully data-driven manner. We develop an efficient vision-language model and introduce additional optimization objectives during training to ensure consistent feature representations. ConditionNET explicitly models the dependencies between actions, preconditions, and effects, leading to improved performance. We evaluate our model on two robotic datasets, one of which we collected for this paper, containing 406 successful and 138 failed teleoperated demonstrations of a Franka Emika Panda robot performing tasks like pouring and cleaning the counter. We show in our experiments that ConditionNET outperforms all baselines on both anomaly detection and phase prediction tasks. Furthermore, we implement an action monitoring system on a real robot to demonstrate the practical applicability of the learned preconditions and effects. Our results highlight the potential of ConditionNET for enhancing the reliability and adaptability of robots in real-world environments.
We learn the preconditions and effects of actions by framing it as a state prediction problem. The goal is to classify whether an image corresponds to the precondition or effect of a given action, based on a natural language description, or if it satisfies neither. Our approach utilizes a dataset of skill demonstrations, each containing a sequence of images, action descriptions, and labels indicating success or failure. The demonstrations are segmented into preparation, core, and post-action phases, and the model is trained using triplets of precondition images, effect images, and action descriptions. To enrich the limited data, We implement data augmentation by treating the post-state of one action as the precondition of another and by generating multiple paraphrased descriptions of each action using a language model.
The model architecture comprises two stages, both using transformers. The first stage, the State Transformer, extracts high-level features representing the environment, while the second stage, the Condition Transformer, refines this information by focusing on the action-specific details. The visual features are extracted using a pre-trained DINOv2 model, which processes the image into patches and produces a token-based representation. A frozen CLIP model is used to encode the natural language descriptions, which are integrated in the second stage to guide the model in focusing on relevant features. The training process is driven by two main objectives: a cross-entropy loss to classify the action phase, and a consistency loss that ensures the model learns the relationship between the change in state and the action description. These losses work together to optimize the model’s ability to predict action conditions accurately.
Our system includes a library of skills, each divided into three phases: pre, core, and effect. The pre phase involves preparation, the core phase completes the main action, and the effect phase verifies successful task completion. We use motion primitives to generate the required motions, and tasks are executed using a Behavior Tree (BT) that selects actions based on current observations. ConditionNET learns action preconditions and effects, which we use for anomaly detection by comparing expected and observed states. If anomalies are detected during the pre phase, the current action halts, and alternative behaviors are triggered. In the core phase, anomaly detection is suspended due to state ambiguity, while in the effect phase, we verify task success and initiate recovery when necessary.
(Im)PerfectPour is a teleoperated dataset collected in the Autonomous Systems Lab at TU Wien. It includes 406 successful and 138 unsuccessful demonstrations of a Franka Emika Panda robot performing tasks such as picking, placing, pouring, and whipping. The dataset consists of 4 skills:
@ARTICLE{10812068,
author={Sliwowski, Daniel and Lee, Dongheui},
journal={IEEE Robotics and Automation Letters},
title={ConditionNET: Learning Preconditions and Effects for Execution Monitoring},
year={2025},
volume={10},
number={2},
pages={1337-1344},
keywords={Robot sensing systems;Monitoring;Feature extraction;Hidden Markov models;Planning;Probabilistic logic;Natural languages;Image segmentation;Anomaly detection;Transformers;Deep learning methods;data sets for robot learning;deep learning for visual perception},
doi={10.1109/LRA.2024.3520916}}