ConditionNET: Learning Preconditions and Effects for Anomaly Detection and Recovery

IEEE Robotics and Automation Letters
Technische Universitat Wien (TUWien), German Aerospace Center (DLR)

Abstract

The introduction of robots into everyday scenarios necessitates algorithms capable of monitoring the execution of tasks. In this paper, we propose ConditionNET, an approach for learning the preconditions and effects of actions in a fully data-driven manner. We develop an efficient vision-language model and introduce additional optimization objectives during training to ensure consistent feature representations. ConditionNET explicitly models the dependencies between actions, preconditions, and effects, leading to improved performance. We evaluate our model on two robotic datasets, one of which we collected for this paper, containing 406 successful and 138 failed teleoperated demonstrations of a Franka Emika Panda robot performing tasks like pouring and cleaning the counter. We show in our experiments that ConditionNET outperforms all baselines on both anomaly detection and phase prediction tasks. Furthermore, we implement an action monitoring system on a real robot to demonstrate the practical applicability of the learned preconditions and effects. Our results highlight the potential of ConditionNET for enhancing the reliability and adaptability of robots in real-world environments.

General pipeline overview

How do we learn the conditions?

We learn the preconditions and effects of actions by framing it as a state prediction problem. The goal is to classify whether an image corresponds to the precondition or effect of a given action, based on a natural language description, or if it satisfies neither. Our approach utilizes a dataset of skill demonstrations, each containing a sequence of images, action descriptions, and labels indicating success or failure. The demonstrations are segmented into preparation, core, and post-action phases, and the model is trained using triplets of precondition images, effect images, and action descriptions. To enrich the limited data, We implement data augmentation by treating the post-state of one action as the precondition of another and by generating multiple paraphrased descriptions of each action using a language model.

General condition learning overview

The model architecture comprises two stages, both using transformers. The first stage, the State Transformer, extracts high-level features representing the environment, while the second stage, the Condition Transformer, refines this information by focusing on the action-specific details. The visual features are extracted using a pre-trained DINOv2 model, which processes the image into patches and produces a token-based representation. A frozen CLIP model is used to encode the natural language descriptions, which are integrated in the second stage to guide the model in focusing on relevant features. The training process is driven by two main objectives: a cross-entropy loss to classify the action phase, and a consistency loss that ensures the model learns the relationship between the change in state and the action description. These losses work together to optimize the model’s ability to predict action conditions accurately.

Architecture of our system

How do we detect anomalies and recover from them?

Our system includes a library of skills, each divided into three phases: pre, core, and effect. The pre phase involves preparation, the core phase completes the main action, and the effect phase verifies successful task completion. We use motion primitives to generate the required motions, and tasks are executed using a Behavior Tree (BT) that selects actions based on current observations. ConditionNET learns action preconditions and effects, which we use for anomaly detection by comparing expected and observed states. If anomalies are detected during the pre phase, the current action halts, and alternative behaviors are triggered. In the core phase, anomaly detection is suspended due to state ambiguity, while in the effect phase, we verify task success and initiate recovery when necessary.

Behavior tree example

(Im)PerfectPour

(Im)PerfectPour is a teleoperated dataset collected in the Autonomous Systems Lab at TU Wien. It includes 406 successful and 138 unsuccessful demonstrations of a Franka Emika Panda robot performing tasks such as picking, placing, pouring, and whipping. The dataset consists of 4 skills:

  • pick up O1,
  • pour O1 into O2,
  • place O1 on O2,
  • wipe O1,
where O1 and O2 are the names of the manipulated objects. Each demonstration is recorded using two cameras to increase dataset variability.

Dataset statistics

Qualitative results

Below we present state predictions and anomalies over time. Each hue marks a different expected action, and each saturation denotes a different expected motion phase. In the first case our approach correctly determines that no anomalies have ocurred during the execution. In the second case two anomalies are identified. The first occurs when a human pulls the robot, causing a spill. At this moment, the model switches from predicting "effect" to "unsatisfied." Since the core-motion phase is undefined for anomaly detection, no immediate anomaly is reported, but it is detected when the phase shifts back to "effect," triggering recovery. The second anomaly occurs when the robot fails to pick up a cloth on the first attempt. The model correctly detects that the execution remains in the precondition phase, while the expected state is "effect," identifying this mismatch as an anomaly. The robot then retries and successfully completes the task.

Qualitative results

BibTeX


          @ARTICLE{10812068,
            author={Sliwowski, Daniel and Lee, Dongheui},
            journal={IEEE Robotics and Automation Letters}, 
            title={ConditionNET: Learning Preconditions and Effects for Execution Monitoring}, 
            year={2025},
            volume={10},
            number={2},
            pages={1337-1344},
            keywords={Robot sensing systems;Monitoring;Feature extraction;Hidden Markov models;Planning;Probabilistic logic;Natural languages;Image segmentation;Anomaly detection;Transformers;Deep learning methods;data sets for robot learning;deep learning for visual perception},
            doi={10.1109/LRA.2024.3520916}}