Exploring and Distilling Cross-Modal Information for Image Captioning

被引:0
|
作者
Liu, Fenglin [1 ]
Ren, Xuancheng [2 ]
Liu, Yuanxin [3 ]
Lei, Kai [1 ]
Sun, Xu [2 ]
机构
[1] Peking Univ, Sch Elect & Comp Engn SECE, Shenzhen Key Lab Informat Centr Networking & Bloc, Beijing, Peoples R China
[2] Peking Univ, Sch EECS, MOE Key Lab Computat Linguist, Beijing, Peoples R China
[3] Beijing Univ Posts & Telecommun, Sch ICE, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, attention-based encoder-decoder models have been used extensively in image captioning. Yet there is still great difficulty for the current methods to achieve deep image understanding. In this work, we argue that such understanding requires visual attention to correlated image regions and semantic attention to coherent attributes of interest. To perform effective attention, we explore image captioning from a cross-modal perspective and propose the Global-and-Local Information Exploring-and-Distilling approach that explores and distills the source information in vision and language. It globally provides the aspect vector, a spatial and relational representation of images based on caption contexts, through the extraction of salient region groupings and attribute collocations, and locally extracts the fine-grained regions and attributes in reference to the aspect vector for word selection. Our fully-attentive model achieves a CIDEr score of 129.3 in offline COCO evaluation with remarkable efficiency in terms of accuracy, speed, and parameter budget.
引用
收藏
页码:5095 / 5101
页数:7
相关论文
共 50 条
  • [21] INFORMATION COMPLEXITY AND CROSS-MODAL FUNCTIONS
    FREIDES, D
    BRITISH JOURNAL OF PSYCHOLOGY, 1975, 66 (AUG) : 283 - 287
  • [22] Vision-Text Cross-Modal Fusion for Accurate Video Captioning
    Ouenniche, Kaouther
    Tapu, Ruxandra
    Zaharia, Titus
    IEEE ACCESS, 2023, 11 : 115477 - 115492
  • [23] Learning from the Master: Distilling Cross-modal Advanced Knowledge for Lip Reading
    Ren, Sucheng
    Du, Yong
    Lv, Jianming
    Han, Guoqiang
    He, Shengfeng
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 13320 - 13328
  • [24] VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning
    Wei, Tingting
    Yuan, Weilin
    Luo, Junren
    Zhang, Wanpeng
    Lu, Lina
    JOURNAL OF SYSTEMS ENGINEERING AND ELECTRONICS, 2023, 34 (01) : 9 - 18
  • [25] VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning
    WEI Tingting
    YUAN Weilin
    LUO Junren
    ZHANG Wanpeng
    LU Lina
    JournalofSystemsEngineeringandElectronics, 2023, 34 (01) : 9 - 18
  • [26] Cross-modal attention for multi-modal image registration
    Song, Xinrui
    Chao, Hanqing
    Xu, Xuanang
    Guo, Hengtao
    Xu, Sheng
    Turkbey, Baris
    Wood, Bradford J.
    Sanford, Thomas
    Wang, Ge
    Yan, Pingkun
    MEDICAL IMAGE ANALYSIS, 2022, 82
  • [27] Cross-Modal Manifold Propagation for Image Recommendation
    Jian, Meng
    Guo, Jingjing
    Fu, Xin
    Wu, Lifang
    Jia, Ting
    APPLIED SCIENCES-BASEL, 2022, 12 (06):
  • [28] Cross-Modal Saliency Correlation for Image Annotation
    Yun Gu
    Haoyang Xue
    Jie Yang
    Neural Processing Letters, 2017, 45 : 777 - 789
  • [29] Cross-Modal Saliency Correlation for Image Annotation
    Gu, Yun
    Xue, Haoyang
    Yang, Jie
    NEURAL PROCESSING LETTERS, 2017, 45 (03) : 777 - 789
  • [30] Audio-to-Image Cross-Modal Generation
    Zelaszczyk, Maciej
    Mandziuk, Jacek
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,