Implicit Geometric-Semantic Fusion for Task-Oriented 6-DoF Grasp Detection via Visual Affordance and Stability Learning
DOI:
https://doi.org/10.66279/apg4me53Keywords:
6-DoF grasp detection;, task-oriented manipulation;, visual affordance;, implicit neural representations;, point cloud learningAbstract
The nature of design-based robotic tasks often results in the mechanical stability of robotic hands in one of two ways: hands either lose or gain their stability in regards to mechanical structure. There have been a number of methodologies created to tackle the issue of interpreting and generating stable robot grasps and affordances. However, tasks related to interpreting and manipulating objects in 3D space are often treated in isolation, which hinders the creation of task-specific robotic manipulation systems. In this paper, we present Implicit Geometric–Semantic Fusion (IGSF), a framework for stable grasp generation in 6-DoF from partial 3D point cloud representations. The IGSF architecture integrates the grasp generation of control points, backward learning of the spatial field and affordances, and the task-based control of the system. The grasp generation module of the framework makes predictions of the placement, orientation (as defined by unit quaternions), and stability without the need for utilizing any fixed grasp anchors. The affordance module uses an attention-augmented Dynamic Graph CNN to learn task-specific functional areas, while the refinement stage adjusts stable grasp candidates to be within the semantically relevant regions and within the constraints of kinematic feasibility. The architecture has been trained using ACRONYM’s characterized and validated grasp data alongside the 3D AffordanceNet’s subset fine-grained affordance annotations, utilizing a balanced multi-objective optimization technique. The experimental assessment of 50 distinct objects and 5 different tasks found an outcome of 100\% feasibility, a 95.1\% success for grasping, a 60\% error reduction with regards to pose (0.018 vs task-agnostic baselines 0.045 m2 with regards to pose error, p < 0.001), and semantic-type alignment. These findings illustrate that integrating geometric stability and semantic affordance within a constrained learning paradigm allows effective 3D task-oriented robotic manipulation systems.
Downloads
References
[1] Arsalan Mousavian, Clemens Eppner, and Dieter Fox. 6-DOF GraspNet: Variational Grasp Generation for Object Manipulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2901–2910, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00299
[2] Shengheng Deng, Xun Xu, Chaozheng Wu, Ke Chen, and Kui Jia. 3D AffordanceNet: A Benchmark for Visual Object Affordance Understanding with 3D Point Clouds. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 177–183, 2021. DOI: https://doi.org/10.1109/CVPR46437.2021.00182
[3] Clemens Eppner, Arsalan Mousavian, and Dieter Fox. ACRONYM: A Large-Scale Dataset of Grasp Planning Simulations. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 3522–3527, 2021. DOI: https://doi.org/10.1109/ICRA48506.2021.9560844
[4] Anthony Brohan et al. RT-2: Generalist Robot Learning with Vision-Language Models. In Proceedings of the Conference on Robot Learning (CoRL), 2023.
[5] Chi-Pin Huang et al. OpenVLA: An Open-Source Vision-Language-Action Model for Generalist Robotic Manipulation. arXiv preprint arXiv:2406.09246, 2024.
[6] Hao Chen, Qian Liu, and Li Zhang. VLA-Grasp: A Vision-Language-Action Modeling with Cross-Modality Fusion for Task-Oriented Grasping. Complex & Intelligent Systems, 10(5):12345–12360, 2024.
[7] Zichao Ding, Aimin Wang, Maosen Gao, and Jiazhe Li. FastGNet: an efficient 6-DOF grasp detection method with multi-attention mechanisms and point transformer network. Measurement Science and Technology, 2024. DOI: https://doi.org/10.1088/1361-6501/ad1cc5
[8] Haoxiang Ma, Modi Shi, Boyang Gao, and Di Huang. Generalizing 6-DoF Grasp Detection via Domain Prior Knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18102–18111, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.01714
[9] H. Ling et al. Articulated Object Manipulation with Coarse-to-fine Affordance for Mitigating the Effect of Point Cloud Noise. In 2024 IEEE International Conference on Robotics and Automation (ICRA), 2024. DOI: https://doi.org/10.1109/ICRA57147.2024.10610593
[10] A Survey of Embodied Learning for Object-Centric Robotic Manipulation. CoRR, 2024.
[11] Abhay Deshpande, Yuquan Deng, Jordi Salvador, et al. GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation. In Proceedings of The 9th Conference on Robot Learning (CoRL), 2025.
[12] Jian Jian et al. Grasp as You Say: Language-guided Dexterous Grasp Generation. In Advances in Neural Information Processing Systems (NeurIPS), 2024.
[13] Weishang Wu, Yifei Shi, and Zhiping Cai. TOSC: Task-Oriented Shape Completion for Open-World Dexterous Grasp Generation from Partial Point Clouds. In Proceedings of the AAAI Conference on Artificial Intelligence, 40(13):10781–10789, 2026. DOI: https://doi.org/10.1609/aaai.v40i13.38053
[14] Enhancing task-oriented robotic grasping via 3D affordance grounding from vision-language models. Complex & Intelligent Systems, 12:42, 2026. DOI: https://doi.org/10.1007/s40747-025-02169-0
[15] Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Advances in Neural Information Processing Systems (NeurIPS), volume 30, 2017.
[16] Hao-Shu Fang et al. GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11444–11453, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.01146
[17] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD), pages 226–231, 1996.
[18] Kechun Xu, Shuqi Zhao, Zhongxiang Zhou, et al. A Joint Modeling of Vision-Language-Action for Target-oriented Grasping in Clutter. arXiv preprint arXiv:2302.12610, 2024.
[19] GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping. IEEE Robotics and Automation Letters, 2024.
[20] Tyler Ga Wei Lum, Albert H. Li, Preston Culbertson, et al. Get a Grip: Multi-Finger Grasp Evaluation at Scale Enables Robust Sim-to-Real Transfer. In Conference on Robot Learning (CoRL), pages 5405–5433, 2024.
[21] Domain Randomization for Sim2real Transfer of Automatically Generated Grasping Datasets. In 2024 IEEE International Conference on Robotics and Automation (ICRA), 2024.
[22] Yiwei Li, Zihao Wu, Huaqin Zhao, et al. ALDM-Grasping: Diffusion-aided Zero-Shot Sim-to-Real Transfer for Robot Grasping. arXiv preprint, 2024.
[23] FMB: a Functional Manipulation Benchmark for Generalizable Robotic Learning. CoRR, 2024.
[24] Yan Xia et al. TARGO: Benchmarking Target-driven Object Grasping under Occlusions. arXiv preprint arXiv:2407.06168, 2024.
[25] Aurel X. Appius, Émiland Garrabé, François Hélénon, et al. Task-Aware Robotic Grasping by evaluating Quality Diversity Solutions through Foundation Models. arXiv preprint, 2024.
[26] Aurel X. Appius, Émiland Garrabé, François Hélénon, et al. Task-Aware Robotic Grasping by evaluating Quality Diversity Solutions through Foundation Models. arXiv preprint, 2024. DOI: https://doi.org/10.1109/IROS60139.2025.11246636
Downloads
Published
Issue
Section
Categories
License
Copyright (c) 2026 Engineering Systems and Intelligent Technologies (ESIT)

This work is licensed under a Creative Commons Attribution 4.0 International License.
Engineering Systems and Intelligent Technologies (ESIT) content is published under a Creative Commons Attribution License (CCBY). This means that content is freely available to all readers upon publication, and content is published as soon as production is complete.
Engineering Systems and Intelligent Technologies (ESIT) seeks to publish the most influential papers that will significantly advance scientific understanding. Selected articles must present new and widely significant data, syntheses, or concepts. They should merit recognition by the wider scientific community and the general public through publication in a reputable scientific journal.









