Phantom: Training Robots Without Robots Using Only Human Videos

1Stanford University

Scaling robotics data collection is critical to advancing general-purpose robots. Current approaches often rely on teleoperated demonstrations which are difficult to scale. We propose a novel data collection method that eliminates the need for robotics hardware by leveraging human video demonstrations. By training imitation learning policies on this human data, our approach enables zero-shot deployment on robots without collecting any robot-specific data. To bridge the embodiment gap between human and robot appearances, we utilize a data editing approach on the input observations that aligns the image distributions between training data on humans and test data on robots. Our method significantly reduces the cost of diverse data collection by allowing anyone with an RGBD camera to contribute. We demonstrate that our approach works in diverse, unseen environments and on varied tasks.

Approach

Interpolate start reference image.

Overview of our data-editing pipeline for learning robot policies from human videos. During training, we first estimate the hand pose in each frame of a human video demonstration and convert it into a robot action. We then remove the human hand using inpainting and overlay a virtual robot in its place. The resulting augmented dataset is used to train an imitation learning policy, π. At test time, we overlay a virtual robot on real robot observations to ensure visual consistency, enabling direct deployment of the learned policy on a real robot.

Our method is robot agnostic. As shown below, each human video can be converted into a robot demonstration for any robot capable of completing the task. In our experiments, we deploy our method on two different robots: Franka and Kinova Gen3 (both with robotiq gripper).

Real-world experiments

Data-Editing Strategies

We compare three data-editing strategies: (1) Hand Inpaint inspired by Rovi-Aug, (2) Hand Mask inspired by Shadow, and (3) Red Line inspired by EgoMimic. While (3) is already designed for human videos, we adapt the data-editing strategies from (1) and (2), which were originally developed for the simpler robot-to-robot setting, to the human-to-robot setting. We also compare to a baseline that does not modify the train or test images in any way.

Interpolate start reference image.
Single scene

We start by evaluating how well our method can transfer a policy trained exclusively on data-edited human video demonstrations in a single scene. We evaluate our method on five tasks that highlight the diversity of skills our method can learn. These include deformable object manipulation (rope task), multi-object manipulation (sweep task), object reorientation, insertion, and pick and place. Videos are at 5x speed.

Across all tasks, Hand Inpaint and Hand Mask achieve comparable performance, with Hand Inpaint slightly outperforming Hand Mask. However, Hand Mask takes on average 73% longer to rollout due to having to run an additional diffusion model at test time to generate the hand masks. Red Line and Vanilla are not able to complete any of the tasks. Videos are at 5x speed.

Policy Pick/ Place Book Stack Cups Tie Rope Rotate Box
Hand Inpaint 0.92 0.72 0.64 0.72
Hand Mask 0.92 0.52 0.60 0.76
Red Line 0.0 0.0 0.0 0.0
Vanilla 0.0 0.0 0.0 0.0
Policy Grasp Brush Sweep > 0 Sweep > 2 Sweep > 4
Hand Inpaint 0.88 0.80 0.72 0.40
Hand Mask 0.75 0.75 0.72 0.68
Red Line 0.0 0.0 0.0 0.0
Vanilla 0.0 0.0 0.0 0.0
Scene generalization

Next, we evaluate how well our method generalizes to new, unseen environments. We collect human video demonstrations of a sweeping task across diverse scenes, and evaluate in three OOD environments. Hand Inpaint achieves high success rates across all three OOD environments. Videos at 5x speed.

OOD Scene #1

OOD Scene #2

OOD Scene #2 + OOD surface

Hand Inpaint achieves high success rates across all three OOD environments. Videos are at 5x speed.

Outdoor lawn Indoor lounge Indoor lounge + OOD surface
Hand Inpaint 0.72 0.84 0.64
Hand Mask 0.52 0.76 0.68

BibTeX

@misc{lepert2025phantomtrainingrobotsrobots,
        title={Phantom: Training Robots Without Robots Using Only Human Videos}, 
        author={Marion Lepert and Jiaying Fang and Jeannette Bohg},
        year={2025},
        eprint={2503.00779},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2503.00779}, 
  }