TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding

被引:35
|
作者
He, Dailan [1 ]
Zhao, Yusheng [1 ]
Luo, Junyu [1 ]
Hui, Tianrui [2 ]
Huang, Shaofei [2 ]
Zhang, Aixi [3 ]
Liu, Si [4 ]
机构
[1] Beihang Univ, Sch Comp Sci & Engn, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
[3] Alibaba Grp, Beijing, Peoples R China
[4] Inst Artificial Intelligence, Beijing, Peoples R China
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
3D visual grounding; transformer; entity attention; relation attention;
D O I
10.1145/3474085.3475397
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently proposed fine-grained 3D visual grounding is an essential and challenging task, whose goal is to identify the 3D object referred by a natural language sentence from other distractive objects of the same category. Existing works usually adopt dynamic graph networks to indirectly model the intra/inter-modal interactions, making the model difficult to distinguish the referred object from distractors due to the monolithic representations of visual and linguistic contents. In this work, we exploit Transformer for its natural suitability on permutation-invariant 3D point clouds data and propose a TransRefer3D network to extract entity-andrelation aware multimodal context among objects for more discriminative feature learning. Concretely, we devise an Entity-aware Attention (EA) module and a Relation-aware Attention (RA) module to conduct fine-grained cross-modal feature matching. Facilitated by co-attention operation, our EA module matches visual entity features with linguistic entity features while RA module matches pair-wise visual relation features with linguistic relation features, respectively. We further integrate EA and RA modules into an Entity-and-Relation aware Contextual Block (ERCB) and stack several ERCBs to form our TransRefer3D for hierarchical multimodal context modeling. Extensive experiments on both Nr3D and Sr3D datasets demonstrate that our proposed model significantly outperforms existing approaches by up to 10.6% and claims the new state-of-the-art performance. To the best of our knowledge, this is the first work investigating Transformer architecture for fine-grained 3D visual grounding task.
引用
收藏
页码:2344 / 2352
页数:9
相关论文
共 50 条
  • [21] Jointly Optimizing 3D Model Fitting and Fine-Grained Classification
    Lin, Yen-Liang
    Morariu, Vlad I.
    Hsu, Winston
    Davis, Larry S.
    COMPUTER VISION - ECCV 2014, PT IV, 2014, 8692 : 466 - 480
  • [22] Designer alloy enables 3D printing of fine-grained metals
    Amy J. Clarke
    Nature, 2019, 576 (7785) : 41 - 42
  • [23] A Refined 3D Pose Dataset for Fine-Grained Object Categories
    Wang, Yaming
    Tan, Xiao
    Yang, Yi
    Li, Ziyu
    Liu, Xiao
    Zhou, Feng
    Davis, Larry S.
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 2797 - 2806
  • [24] Bootstrapping vision-language transformer for monocular 3D visual grounding
    Lei, Qi
    Sun, Shijie
    Song, Xiangyu
    Song, Huansheng
    Feng, Mingtao
    Wu, Chengzhong
    IET IMAGE PROCESSING, 2025, 19 (01)
  • [25] Revisiting 3D visual grounding with Context-aware Feature Aggregation
    Guo, Peng
    Zhu, Hongyuan
    Ye, Hancheng
    Li, Taihao
    Chen, Tao
    NEUROCOMPUTING, 2024, 601
  • [26] Learning Fine-Grained Segmentation of 3D Shapes without Part Labels
    Wang, Xiaogang
    Sun, Xun
    Cao, Xinyu
    Xu, Kai
    Zhou, Bin
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 10271 - 10280
  • [27] Arms: A Fine-grained 3D AQI Realtime Monitoring System by UAV
    Yang, Yuzhe
    Zheng, Zijie
    Bian, Kaigui
    Jiang, Yun
    Song, Lingyang
    Han, Zhu
    GLOBECOM 2017 - 2017 IEEE GLOBAL COMMUNICATIONS CONFERENCE, 2017,
  • [28] Design and implementation of fine-grained realistic 3D virtual simulation experiment
    Zhang H.
    Shi L.
    Wang J.
    Cao M.
    Applied Mathematics and Nonlinear Sciences, 2024, 9 (01)
  • [29] DCNet: exploring fine-grained vision classification for 3D point clouds
    Wu, Rusong
    Bai, Jing
    Li, Wenjing
    Jiang, Jinzhe
    VISUAL COMPUTER, 2024, 40 (02): : 781 - 797
  • [30] SketchANIMAR: Sketch-based 3D animal fine-grained retrieval
    Le, Trung-Nghia
    Nguyen, Tam, V
    Le, Minh-Quan
    Nguyen, Trong-Thuan
    Huynh, Viet-Tham
    Do, Trong-Le
    Le, Khanh-Duy
    Tran, Mai-Khiem
    Hoang-Xuan, Nhat
    Nguyen-Ho, Thang-Long
    Nguyen, Vinh-Tiep
    Le-Pham, Nhat-Quynh
    Pham, Huu-Phuc
    Hoang, Trong-Vu
    Nguyen, Quang-Binh
    Nguyen-Mau, Trong-Hieu
    Huynh, Tuan-Luc
    Le, Thanh-Danh
    Nguyen-Ha, Ngoc-Linh
    Truong-Thuy, Tuong-Vy
    Phong, Truong Hoai
    Diep, Tuong-Nghiem
    Ho, Khanh-Duy
    Nguyen, Xuan-Hieu
    Tran, Thien-Phuc
    Yang, Tuan-Anh
    Tran, Kim-Phat
    Hoang, Nhu-Vinh
    Nguyen, Minh-Quang
    Vo, Hoai-Danh
    Doan, Minh-Hoa
    Nguyen, Hai-Dang
    Sugimoto, Akihiro
    Tran, Minh-Triet
    COMPUTERS & GRAPHICS-UK, 2023, 116 : 150 - 161