Visual-Haptic-Kinesthetic Object Recognition with Multimodal Transformer

被引:1
|
作者
Zhou, Xinyuan [1 ]
Lan, Shiyong [1 ]
Wa, Wenwu [2 ]
Li, Xinyang [1 ]
Zhou, Siyuan [1 ]
Yang, Hongyu [1 ]
机构
[1] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Peoples R China
[2] Univ Surrey, Guildford GU2 7XH, Surrey, England
关键词
Object Recognition; Multimodal Deep Learning; Multimodal Fusion; Attention Mechanism; TACTILE FUSION; NETWORK;
D O I
10.1007/978-3-031-44195-0_20
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Humans recognize objects by combining multi-sensory information in a coordinated fashion. However, visual-based and haptic-based object recognition remain two separate research directions in robotics. Visual images and haptic time series have different properties, which can be difficult for robots to fuse for object recognition as humans do. In this work, we propose an architecture to fuse visual, haptic and kinesthetic data for object recognition, based on the multimodal Convolutional Recurrent Neural Networks with Transformer. We use Convolutional Neural Networks (CNNs) to learn spatial representation, Recurrent Neural Networks (RNNs) to model temporal relationships, and Transformer's self-attention and cross-attention structures to focus on global and cross-modal information. We propose two fusion methods and conduct experiments on the multimodal AU dataset. The results show that our model offers higher accuracy than the latest multimodal object recognition methods. We conduct an ablation study on the individual components of the inputs to demonstrate the importance of multimodal information in object recognition. The codes will be available at https://github.com/SYLan2019/VHKOR.
引用
收藏
页码:233 / 245
页数:13
相关论文
共 50 条
  • [41] Sentimental Visual Captioning using Multimodal Transformer
    Wu, Xinxiao
    Li, Tong
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (04) : 1073 - 1090
  • [42] MULTIMODAL TRANSFORMER FUSION FOR CONTINUOUS EMOTION RECOGNITION
    Huang, Jian
    Tao, Jianhua
    Liu, Bin
    Lian, Zheng
    Niu, Mingyue
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3507 - 3511
  • [43] VSET: A MULTIMODAL TRANSFORMER FOR VISUAL SPEECH ENHANCEMENT
    Ramesh, Karthik
    Xing, Chao
    Wang, Wupeng
    Wang, Dong
    Chen, Xiao
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6658 - 6662
  • [44] Sentimental Visual Captioning using Multimodal Transformer
    Xinxiao Wu
    Tong Li
    International Journal of Computer Vision, 2023, 131 : 1073 - 1090
  • [45] Pedestrian Attribute Recognition Based on Multimodal Transformer
    Liu, Dan
    Song, Wei
    Zhao, Xiaobing
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I, 2024, 14425 : 422 - 433
  • [46] Multimodal Blockwise Transformer for Robust Sentiment Recognition
    Lai, Zhengqin
    Hong, Xiaopeng
    Wang, Yabin
    PROCEEDINGS OF THE 2ND INTERNATIONAL WORKSHOP ON MULTIMODAL AND RESPONSIBLE AFFECTIVE COMPUTING, MRAC 2024, 2024, : 88 - 92
  • [47] Multimodal Transformer Fusion for Emotion Recognition: A Survey
    Belaref, Amdjed
    Seguier, Renaud
    2024 6TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING, ICNLP 2024, 2024, : 107 - 113
  • [48] VISUAL DIMENSIONAL DOMINANCE AND HAPTIC FORM RECOGNITION
    MICALLEF, C
    MAY, RB
    BULLETIN OF THE PSYCHONOMIC SOCIETY, 1976, 7 (01) : 21 - 24
  • [49] Visual and haptic recognition of objects: Effects of viewpoint
    Buelthoff, HH
    Ernst, MO
    Newell, FN
    Tjan, BS
    INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 1999, 40 (04) : S398 - S398
  • [50] Visual and haptic control of grasping changes in the object proximity
    Camponogara, Ivan
    Volcic, Robert
    PERCEPTION, 2022, 51 : 39 - 39