We propose novel metrics, tasks, visualization tools and methods for “policy mobilization,” the problem of taking a non-mobile manipulation policy and finding a proper initial robot pose from which to execute it on a mobile platform.
Introducing
Given an existing manipulation policy trained on data collected from limited viewpoints, policy mobilization aims to find an optimal robot pose in an unseen environment to successfully execute this policy.
Highlights
Chaining a sequence of manipulation skills in-the-wild.
Generalizing zero-shot to unseen scene layouts.
Operating in large spaces that the policies have not explored during training.
Adapting to novel object heights.
How we approach policy mobilization
Our method for policy mobilization follows four simple steps. Provide a pre-trained robot manipulation policy with its accompanying training data. The method handles the rest.
Drive the robot around to capture the test-time scene.
Build a 3D Gaussian Splatting model from the capture.
Use our novel hybrid score function to rate robot poses.
Find the best robot pose with sampling-based optimization.
Comparisons
In real experiments, we compare our method with two baselines: BC w/ Nav and a Human baseline. In the BC w/ Nav baseline, we train an end-to-end imitation learning policy from navigation and manipulation data. In the Human baseline, we verbally ask users to manually drive the robot to where they believe is the optimal starting base pose for it to execute each manipulation task successfully, without providing any information on how the policy was trained. All videos below are played at 4x playback speed.
Introducing
To motivate further research on policy mobilization, we propose the Mobi-π framework:
5
Simulated tasks based on RoboCasa
3
Navigation and imitation learning baselines
2
Mobilization Feasibility Metrics
To effectively study the policy mobilization problem, we develop a suite of sim environments with RoboCasa as a benchmarking setup. We pick five single-stage manipulation tasks: Close Door, Close Drawer, Turn on Faucet, Turn on Microwave, and Turn on Stove.
We study three baselines for policy mobilization. We divide them into two categories. The first type of baselines navigates to the object of interest without considering the manipulation policy's capabilities. These methods are not policy-aware. Methods in this category include LeLaN and VLFM. The second type of baselines is policy-aware and leverages large-scale data to connect navigation with manipulation. We use BC w/ Nav, a Behavior Transformer trained to jointly perform navigation and manipulation using combined demonstrations, as a representative method in this category.
We quantify how feasible a policy can be mobilized from spatial and visual perspectives. We evaluate the feasibility metrics for the simulated tasks and discuss their correlations with experimental results.
Team
This work would not be possible without the awesome team.
Jingyun completed this work during an internship with Toyota Research Institute.
* Isabella and Brandon contributed equally.
FAQ
Indeed, one approach to learning mobile manipulation is to train a policy from a dataset that includes both navigation and manipulation data. However, training such a policy typically requires large amounts of training data, since the policy needs to not only generalize to different navigation and manipulation scenarios but also seamlessly coordinate navigation and manipulation. We showed in our simulation experiments that an imitation learning baseline that learns an end-to-end policy for navigation and manipulation fails to perform well in unseen room layouts despite learning from 5x more training data. Compared to training an end-to-end mobile manipulation policy, our policy mobilization framework provides a more data-efficient approach to learning mobile manipulation. Meanwhile, our problem formulation complements existing efforts to improve the robustness of manipulation policies and remains compatible with them.
Training a mobile manipulation policy in simulation from 3DGS requires creating an interactive and physically accurate simulation from the 3DGS model, which is an unsolved problem.
Our sim setup includes unseen test objects. We find that our method, which utilizes DINO dense descriptors to score robot poses, is capable of ignoring these irrelevant objects.
Yes. Our learned 3DGS models have artifacts and surface color inconsistencies, yet the method still performs well.
Please try to first find answers in our paper and supplementary materials. If you still cannot find an answer, you are welcome to contact us via email.
Found our work useful?
@article{yang2025mobipi, title={Mobi-$\pi$: Mobilizing Your Robot Learning Policy}, author={Yang, Jingyun and Huang, Isabella and Vu, Brandon and Bajracharya, Max and Antonova, Rika and Bohg, Jeannette}, journal={arXiv preprint arXiv:2505.23692}, year={2025} }