Phantom: Training Robots Without Robots Using Only Human Videos

1Stanford University

Training general-purpose robots requires learning from large and diverse data sources. Current approaches rely heavily on teleoperated demonstrations which are difficult to scale. We present a scalable framework for training manipulation policies directly from human video demonstrations, requiring no robot data. Our method converts human demonstrations into robot-compatible observation-action pairs using hand pose estimation and visual data editing. We inpaint the human arm and overlay a rendered robot to align the visual domain. This enables zero-shot deployment on real hardware without any fine-tuning. We demonstrate strong success rates—up to 92%—on a range of tasks including deformable object manipulation, multi-object sweeping, and insertion. Our approach generalizes to novel environments and supports closed-loop execution. By demonstrating that effective policies can be trained using only human videos, our method broadens the path to scalable robot learning.

Approach

Interpolate start reference image.

Overview of our data-editing pipeline for learning robot policies from human videos. During training, we first estimate the hand pose in each frame of a human video demonstration and convert it into a robot action. We then remove the human hand using inpainting and overlay a virtual robot in its place. The resulting augmented dataset is used to train an imitation learning policy, π. At test time, we overlay a virtual robot on real robot observations to ensure visual consistency, enabling direct deployment of the learned policy on a real robot.

Our method is robot agnostic. As shown below, each human video can be converted into a robot demonstration for any robot capable of completing the task. In our experiments, we deploy our method on two different robots: Franka and Kinova Gen3 (both with robotiq gripper).

Real-world experiments

Data-Editing Strategies

We compare three data-editing strategies: (1) Hand Inpaint inspired by Rovi-Aug, (2) Hand Mask inspired by Shadow, and (3) Red Line inspired by EgoMimic. While (3) is already designed for human videos, we adapt the data-editing strategies from (1) and (2), which were originally developed for the simpler robot-to-robot setting, to the human-to-robot setting. We also compare to a baseline that does not modify the train or test images in any way.

Interpolate start reference image.
Single scene

We start by evaluating how well our method can transfer a policy trained exclusively on data-edited human video demonstrations in a single scene. We evaluate our method on five tasks that highlight the diversity of skills our method can learn. These include deformable object manipulation (rope task), multi-object manipulation (sweep task), object reorientation, insertion, and pick and place. Videos are at 5x speed.

Across all tasks, Hand Inpaint and Hand Mask achieve comparable performance, with Hand Inpaint slightly outperforming Hand Mask. However, Hand Mask takes on average 73% longer to rollout due to having to run an additional diffusion model at test time to generate the hand masks. Red Line and Vanilla are not able to complete any of the tasks. Videos are at 5x speed.

Policy Pick/ Place Book Stack Cups Tie Rope Rotate Box
Hand Inpaint 0.92 0.72 0.64 0.72
Hand Mask 0.92 0.52 0.60 0.76
Red Line 0.0 0.0 0.0 0.0
Vanilla 0.0 0.0 0.0 0.0
Policy Grasp Brush Sweep > 0 Sweep > 2 Sweep > 4
Hand Inpaint 0.88 0.80 0.72 0.40
Hand Mask 0.75 0.75 0.72 0.68
Red Line 0.0 0.0 0.0 0.0
Vanilla 0.0 0.0 0.0 0.0
Scene generalization

Next, we evaluate how well our method generalizes to new, unseen environments. We collect human video demonstrations of a sweeping task across diverse scenes, and evaluate in three OOD environments. Hand Inpaint achieves high success rates across all three OOD environments. Videos at 5x speed.

OOD Scene #1

OOD Scene #2

OOD Scene #2 + OOD surface

Hand Inpaint achieves high success rates across all three OOD environments. Videos are at 5x speed.

Outdoor lawn Indoor lounge Indoor lounge + OOD surface
Hand Inpaint 0.72 0.84 0.64
Hand Mask 0.52 0.76 0.68

BibTeX

@misc{lepert2025phantomtrainingrobotsrobots,
        title={Phantom: Training Robots Without Robots Using Only Human Videos}, 
        author={Marion Lepert and Jiaying Fang and Jeannette Bohg},
        year={2025},
        eprint={2503.00779},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2503.00779}, 
  }