TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding

被引:35
|
作者
He, Dailan [1 ]
Zhao, Yusheng [1 ]
Luo, Junyu [1 ]
Hui, Tianrui [2 ]
Huang, Shaofei [2 ]
Zhang, Aixi [3 ]
Liu, Si [4 ]
机构
[1] Beihang Univ, Sch Comp Sci & Engn, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
[3] Alibaba Grp, Beijing, Peoples R China
[4] Inst Artificial Intelligence, Beijing, Peoples R China
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
3D visual grounding; transformer; entity attention; relation attention;
D O I
10.1145/3474085.3475397
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently proposed fine-grained 3D visual grounding is an essential and challenging task, whose goal is to identify the 3D object referred by a natural language sentence from other distractive objects of the same category. Existing works usually adopt dynamic graph networks to indirectly model the intra/inter-modal interactions, making the model difficult to distinguish the referred object from distractors due to the monolithic representations of visual and linguistic contents. In this work, we exploit Transformer for its natural suitability on permutation-invariant 3D point clouds data and propose a TransRefer3D network to extract entity-andrelation aware multimodal context among objects for more discriminative feature learning. Concretely, we devise an Entity-aware Attention (EA) module and a Relation-aware Attention (RA) module to conduct fine-grained cross-modal feature matching. Facilitated by co-attention operation, our EA module matches visual entity features with linguistic entity features while RA module matches pair-wise visual relation features with linguistic relation features, respectively. We further integrate EA and RA modules into an Entity-and-Relation aware Contextual Block (ERCB) and stack several ERCBs to form our TransRefer3D for hierarchical multimodal context modeling. Extensive experiments on both Nr3D and Sr3D datasets demonstrate that our proposed model significantly outperforms existing approaches by up to 10.6% and claims the new state-of-the-art performance. To the best of our knowledge, this is the first work investigating Transformer architecture for fine-grained 3D visual grounding task.
引用
收藏
页码:2344 / 2352
页数:9
相关论文
共 50 条
  • [1] RS-TNet: point cloud transformer with relation-shape awareness for fine-grained 3D visual processing
    Xu Wang
    Yuqiao Zeng
    Yi Jin
    Yigang Cen
    Baifu Liu
    Shaohua Wan
    Soft Computing, 2023, 27 : 1005 - 1013
  • [2] RS-TNet: point cloud transformer with relation-shape awareness for fine-grained 3D visual processing
    Wang, Xu
    Zeng, Yuqiao
    Jin, Yi
    Cen, Yigang
    Liu, Baifu
    Wan, Shaohua
    SOFT COMPUTING, 2023, 27 (02) : 1005 - 1013
  • [3] 3D Guided Fine-Grained Face Manipulation
    Geng, Zhenglin
    Cao, Chen
    Tulyakov, Sergey
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 9813 - 9822
  • [4] 3D Backscatter Localization for Fine-Grained Robotics
    Luo, Zhihong
    Zhang, Qiping
    Ma, Yunfei
    Singh, Manish
    Adib, Fadel
    PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, 2019, : 765 - 781
  • [5] 3D Object Representations for Fine-Grained Categorization
    Krause, Jonathan
    Stark, Michael
    Deng, Jia
    Li Fei-Fei
    2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2013, : 554 - 561
  • [6] Fine-Grained Categorization for 3D Scene Understanding
    Stark, Michael
    Krause, Jonathan
    Pepik, Bojan
    Meger, David
    Little, James J.
    Schiele, Bernt
    Koller, Daphne
    PROCEEDINGS OF THE BRITISH MACHINE VISION CONFERENCE 2012, 2012,
  • [7] Fine-grained metals from 3D printing
    Clarke, Amy J.
    NATURE, 2019, 576 (7785) : 41 - 42
  • [8] Multi-View Transformer for 3D Visual Grounding
    Huang, Shijia
    Chen, Yilun
    Jia, Jiaya
    Wang, Liwei
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15503 - 15512
  • [9] UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding
    Chen, Dave Zhenyu
    Hu, Ronghang
    Chen, Xinlei
    Niessner, Matthias
    Chang, Angel X.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 18063 - 18073
  • [10] Vertically-Composed Fine-Grained 3D CMOS
    Li, Mingyu
    Shi, Jiajun
    Rahman, Mostafizur
    Khasanvis, Santosh
    Bhat, Sachin
    Moritz, Csaba Andras
    2017 IEEE SOI-3D-SUBTHRESHOLD MICROELECTRONICS TECHNOLOGY UNIFIED CONFERENCE (S3S), 2017,