Grasping large and flat objects (e.g., a book or a pan) is often regarded as an ungraspable task, which poses significant challenges due to the unreachable grasping poses. Prior research has exploited environmental interactions through Extrinsic Dexterity, utilizing external structures such as walls or table edges to facilitate object grasping. However, they are confined to task-specific policies while neglecting semantic perception and planning to identify optimal pre-grasp configurations. This limits their operational versatility, impeding effective adaptation to varied extrinsic dexterity constraints. In this work, we present ExDiff, a robot manipulation approach for extrinsic dexterity grasping in unrestricted environments. It utilizes Vision-Language Models (VLMs) to perceive the environmental state and generate instructions, followed by a Goal-Conditioned Action Diffusion (GCAD) model to predict the sequence of low-level actions. This diffusion model learns the low-level policy, conditioned on high-level instructions and cumulative rewards, which improves the generation of robot actions. Simulation experiments and real-world deployment results demonstrate that ExDiff effectively performs ungraspable tasks and generalizes to previously unseen target objects and scenes.
The videos here are at 4x speed. (The grasping operation is at 1x speed)
The videos here are at 4x speed. (The grasping operation is at 1x speed)
The videos here are at 4x speed. (The grasping operation is at 1x speed)
The videos here are at 4x speed. (The grasping operation is at 1x speed)