UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping


Microsoft Research Asia


Performance comparison of our UniGraspTransformer(red), with UniDexGrasp(blue) and UniDexGrasp++(green), across the state-based(purple) and vision-based(orange) settings. For each setting, success rates are evaluated on seen objects, unseen objects within seen categories, and entirely unseen objects from unseen categories.



Real-World Deployment of our UniGraspTransformer.



Simulation Results of our UniGraspTransformer.


Abstract

We introduce UniGraspTransformer, a universal Transformer-based network for dexterous robotic grasping that simplifies training while enhancing scalability and performance. Unlike prior methods such as UniDexGrasp++, which require complex, multi-step training pipelines, UniGraspTransformer follows a streamlined process: first, dedicated policy networks are trained for individual objects using reinforcement learning to generate successful grasp trajectories; then, these trajectories are distilled into a single, universal network. Our approach enables UniGraspTransformer to scale effectively, incorporating up to 12 self-attention blocks for handling thousands of objects with diverse poses. Additionally, it generalizes well to both idealized and real-world inputs, evaluated in state-based and vision-based settings. Notably, UniGraspTransformer generates a broader range of grasping poses for objects in various shapes and orientations, resulting in more diverse grasp strategies. Experimental results demonstrate significant improvements over state-of-the-art, UniDexGrasp++, across various object categories, achieving success rate gains of 3.5%, 7.7%, and 10.1% on seen objects, unseen objects within seen categories, and completely unseen objects, respectively, in the vision-based setting.


UniGraspTransformer


Overview. (a) Dedicated policy network training: each individual RL policy network is trained to grasp a specific object with various initial poses. (b) Grasp trajectory generation: each policy network generates M successful grasp trajectories, forming a trajectory set D. (c) UniGraspTransformer training: trajectories from D are used to train UniGraspTransformer, a universal grasp network, in a supervised manner. We investigate two settings—state-based and vision-based—with the primary difference being in the input representation of object state and hand-object distance, as indicated by * in the figure.

Experiment Results



Universal Policy. Comparison with state-of-the-art methods using a universal model for dexterous robotic grasping across both state-based and vision-based settings, evaluated by success rate. Evaluation on unseen objects from either seen or unseen categories assesses the models' generalization capability. "Obj." refers to objects, and "Cat." refers to categories.



Quantitative comparison of grasp pose diversity. Compared to UniDexGrasp++, our UniGraspTransformer demonstrates a broader range of grasping strategies, highlighting its ability to generate diverse grasping poses across a variety of objects.



Qualitative comparison of grasp pose diversity. (a) Using iterative online distillation, UniDexGrasp++ tends to grasp different objects through similar poses, with two middle fingers twisted and floated. (b) Using large-model offline distillation, our UniGraspTransformer adapts to objects of varying shapes, employing a diverse range of grasp poses.