When doing laundry, humans naturally envision the future states of clothes as a result of their actions, so that they can sort the clothes more effectively. In this work, we infuse the predictive nature of human manipulation strategies into robot imitation learning by disentangling task-related state transitions from agent-specific inverse dynamics modeling. Using a demonstration dataset, we train a diffusion model to predict future states given historical observations, thereby envisioning scene evolution. Then, we use an inverse dynamics model to predict robot actions to achieve predicted states. Our key insight is that modeling object movement can help learning policies for bimanual coordination manipulation tasks. Evaluating our framework across diverse simulation and real-world manipulation setups, including multimodal goal configurations, bimanual manipulation, deformable objects, and multi-object setups, we find that it consistently outperforms state-of-the-art state-to-action mapping policies. Our method demonstrates a remarkable capacity to navigate multimodal goal configurations and action distributions, maintain stability across different control modes, and synthesize a broader range of behaviors than those present in the demonstration dataset.
Push-L
Fruit Holding
Laundry Cleanup
Cluttered Shelf Picking
Push-L
Laundry Cleanup
Laundry Cleanup
@inproceedings{chen2025bimanual,
title={Learning Coordinated Bimanual Manipulation Policies using State Diffusion and Inverse Dynamics Models},
author={Haonan Chen and Jiaming Xu and Lily Sheng and Tianchen Ji and Shuijing Liu and Yunzhu Li and Katherine Driggs-Campbell},
booktitle={2025 IEEE International Conference on Robotics and Automation (ICRA)},
year={2025}
}