LUOR: A Framework for Language Understanding in Object Retrieval and Grasping

被引:0
|
作者
Dongmin Yoon [1 ]
Seonghun Cha [2 ]
Yoonseon Oh [1 ]
机构
[1] Hanyang University,Department of Artificial Intelligence
[2] Hanyang University,Department of Electronic Engineering
关键词
Grasp detection; multi-modal learning; robotic object retrieval;
D O I
10.1007/s12555-024-0527-7
中图分类号
学科分类号
摘要
In human-centered environments, assistive robots are required to understand verbal commands to retrieve and grasp objects within complex scenes. Previous research on natural language object retrieval tasks has mainly focused on commands explicitly mentioning an object’s name. However, in real-world environments, responding to implicit commands based on an object’s function is also essential. To address this problem, we propose a new dataset consisting of 712 verb-object pairs containing 78 verbs for 244 ImageNet classes and 336 verb-object pairs covering 54 verbs for 138 ObjectNet classes. Utilizing this dataset, we propose a novel language understanding object retrieval (LUOR) module by fine-tuning the CLIP text encoder. This approach enables effective learning for the downstream task of object retrieval while preserving the object classification performance. Additionally, we integrate LUOR with a YOLOv3-based multi-task detection (MTD) module for simultaneous object and grasp pose detection. This integration enables the robot manipulator to accurately grasp objects based on verbal commands in complex environments containing multiple objects. Our results demonstrate that LUOR outperforms CLIP in both explicit and implicit retrieval tasks while preserving object classification accuracy for both the ImageNet and ObjectNet datasets. Also, the real-world applicability of the integrated system is demonstrated through experiments with the Franka Panda manipulator.
引用
收藏
页码:530 / 540
页数:10
相关论文
共 50 条
  • [1] Natural Language Object Retrieval
    Hu, Ronghang
    Xu, Huazhe
    Rohrbach, Marcus
    Feng, Jiashi
    Saenko, Kate
    Darrell, Trevor
    2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4555 - 4564
  • [2] Object Captioning and Retrieval with Natural Language
    Anh Nguyen
    Tran, Quang D.
    Thanh-Toan Do
    Reid, Ian
    Caldwell, Darwin G.
    Tsagarakis, Nikos G.
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 2584 - 2592
  • [3] A Unified Framework for Object Retrieval and Mining
    Anjulan, Arasanathan
    Cartagarajah, Nishan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2009, 19 (01) : 63 - 76
  • [4] Cortical Mechanisms Subserving Object Grasping, Action Understanding, and Imitation
    Rizzolatti, Giacomo
    Fogassi, Leonardo
    Gallese, Vittorio
    COGNITIVE NEUROSCIENCES III, THIRD EDITION, 2004, : 427 - 440
  • [5] Virtual Object Categorisation Methods: Towards a Richer Understanding of Object Grasping for Virtual Reality
    Blaga, Andreea Dalia
    Frutos-Pascual, Maite
    Creed, Chris
    Williams, Ian
    PROCEEDINGS OF 27TH ACM SYMPOSIUM ON VIRTUAL REALITY SOFTWARE AND TECHNOLOGY, VRST 2021, 2021,
  • [6] THE DESIGN AND TRANSLATION OF ORL - AN OBJECT RETRIEVAL LANGUAGE
    URBAN, SD
    LAI, CHC
    SAXENA, S
    JOURNAL OF SYSTEMS AND SOFTWARE, 1994, 24 (02) : 187 - 206
  • [7] Natural language guided object retrieval in images
    Ostovar, Ahmad
    Bensch, Suna
    Hellstrom, Thomas
    ACTA INFORMATICA, 2021, 58 (04) : 243 - 261
  • [8] Natural language guided object retrieval in images
    Ahmad Ostovar
    Suna Bensch
    Thomas Hellström
    Acta Informatica, 2021, 58 : 243 - 261
  • [9] A framework for efficient spatial web object retrieval
    Wu, Dingming
    Cong, Gao
    Jensen, Christian S.
    VLDB JOURNAL, 2012, 21 (06): : 797 - 822
  • [10] A framework for efficient spatial web object retrieval
    Dingming Wu
    Gao Cong
    Christian S. Jensen
    The VLDB Journal, 2012, 21 : 797 - 822