3D FlowMatch Actor: Unified 3D Policy for Single- and Dual-Arm Manipulation

Nikolaos Gkanatsios^1,† Jiahe Xu^1,† Matthew Bronars¹ Arsalan Mousavian² Tsung-Wei Ke³ Katerina Fragkiadaki¹

¹ Carnegie Mellon University ² NVIDIA ³ National Taiwan University

^† Equal contribution

description Paper code Code science Checkpoints folder Data

3D FlowMatch Actor

We present 3D FlowMatch Actor (3DFA), a 3D policy architecture for robot manipulation that combines flow matching for trajectory prediction with 3D pretrained visual scene representations for learning from demonstration. 3DFA leverages 3D relative attention between action and visual tokens during action denoising, building on prior work in 3D diffusion-based single-arm policy learning. Through a combination of flow matching and targeted system-level and architectural optimizations, 3DFA achieves over 30x faster training and inference than previous 3D diffusion-based policies, without sacrificing performance. On the bimanual PerAct2 benchmark, it establishes a new state of the art, outperforming the next-best method by an absolute margin of 41.4%. In extensive real-world evaluations, it surpasses strong baselines with up to 1000x more parameters and significantly more pretraining. In unimanual settings, it sets a new state of the art on 74 RLBench tasks by directly predicting dense end-effector trajectories, eliminating the need for motion planning. Comprehensive ablation studies underscore the importance of our design choices for both policy effectiveness and efficiency.

State-of-the-art on PerAct2

We train a multi-task 3D FlowMatch Actor on the 13 bimanual manipulation tasks of PerAct2, using 100 demos per task. 3DFA achieves an absolute performance gain of 41.4% over π₀ and 53.1% over the next best competitor. We also show results on few indicative tasks.

30x faster to train and run inference than 3D Diffuser Actor

3DFA achieves a 30x training and inference speedup over 3D Diffuser Actor, without any performance loss. This is due to a series of architectural and system-level optimizations, including the use of flow matching, a largely improved data loading strategy and a more efficient point sampling method. We show the impact of each optimization on training and inference speed.

Multi-task 3DFA outperforms policies optimized for single tasks

We trained 3D FlowMatch Actor on all 13 tasks simultaneously, but most previous baselines were trained on each task separately and evaluate on a subset of tasks. We evaluate 3DFA on the same 5 tasks that PPI and KStarDiffuser select for a direct comparison. We achieve a 13.6% absolute improvement over PPI and a 24% absolute improvement over KStarDiffuser.

Sample execution videos of 3D FlowMatch Actor on the PerAct2 tasks

We show the denoising process of the predicted keyposes as well as the actual execution for the task of pushing a heavy box:

First keypose (behind box)

Second keypose (to target area)

Execution

Real-world bimanual manipulation results

We construct a real-world multi-task bimanual manipulation benchmark with 10 challenging tasks, using Mobile Aloha. Each model is trained on 40 demos per tasks. 3DFA and baselines are trained to predict closed-loop trajectories, not keyposes.

3DFA largely outperforms π₀ and iDP3

3DFA solves many more tasks, while running faster and having 1000 times less parameters than π₀.

We show successful rollouts of our policy:

Sample execution videos of 3D FlowMatch Actor on the real-world tasks

We show common failure cases of 3DFA in the real world:

3D FlowMatch Actor is a more accessible drop-in replacement for 3D Diffuser Actor - Results on unimanual PerAct

We evaluate our design choices on the unimanual multi-task PerAct benchmark, which contains 18 tasks and 100 demos per task. 3DFA performs on par with 3D Diffuser Actor, even when only two cameras are used, while being 6.5x faster to train and 28x faster at inference.

3D FlowMatch Actor can predict trajectories and solve 74 RLBench tasks

We train 3D FlowMatch Actor on 74 unimanual RLBench tasks. While many of them can be solved by keypose-prediction models, there are tasks that require continuous interaction with the environment. We train 3DFA to jointly predict the next keypose and the trajectory from the current pose to the keypose in a single forward pass, showing the universality of our approach. 3DFA sets a new state of the art on the 74-task benchmark. Notably, 3DFA uses two cameras, while baselines use 3-5 cameras.

We additionally consider a subset of those 74 tasks that previous works have marked as challenging to complete with keypose-prediction models. 3DFA outperforms all prior arts on the tasks that require continuous interaction with the environment.

We show videos of 3DFA solving challenging tasks that previous approaches struggle with.

BibTeX

@article{3d_flowmatch_actor,
            author = {Gkanatsios, Nikolaos and Xu, Jiahe and Bronars, Matthew and Mousavian, Arsalan and Ke, Tsung-Wei and Fragkiadaki, Katerina},
            title = {3D FlowMatch Actor: Unified 3D Policy for Single- and Dual-Arm Manipulation},
            journal = {Arxiv},
            year = {2025}
        }