Exploring and Distilling Cross-Modal Information for Image Captioning

被引:0
|
作者
Liu, Fenglin [1 ]
Ren, Xuancheng [2 ]
Liu, Yuanxin [3 ]
Lei, Kai [1 ]
Sun, Xu [2 ]
机构
[1] Peking Univ, Sch Elect & Comp Engn SECE, Shenzhen Key Lab Informat Centr Networking & Bloc, Beijing, Peoples R China
[2] Peking Univ, Sch EECS, MOE Key Lab Computat Linguist, Beijing, Peoples R China
[3] Beijing Univ Posts & Telecommun, Sch ICE, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, attention-based encoder-decoder models have been used extensively in image captioning. Yet there is still great difficulty for the current methods to achieve deep image understanding. In this work, we argue that such understanding requires visual attention to correlated image regions and semantic attention to coherent attributes of interest. To perform effective attention, we explore image captioning from a cross-modal perspective and propose the Global-and-Local Information Exploring-and-Distilling approach that explores and distills the source information in vision and language. It globally provides the aspect vector, a spatial and relational representation of images based on caption contexts, through the extraction of salient region groupings and attribute collocations, and locally extracts the fine-grained regions and attributes in reference to the aspect vector for word selection. Our fully-attentive model achieves a CIDEr score of 129.3 in offline COCO evaluation with remarkable efficiency in terms of accuracy, speed, and parameter budget.
引用
收藏
页码:5095 / 5101
页数:7
相关论文
共 50 条
  • [31] Cross-Modal Localization Through Mutual Information
    Alempijevic, Alen
    Kodagoda, Sarath
    Dissanayake, Gamini
    2009 IEEE-RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, 2009, : 5597 - 5602
  • [32] Cross-modal information integration in category learning
    Smith, J. David
    Johnston, Jennifer J. R.
    Musgrave, Robert D.
    Zakrzewski, Alexandria C.
    Boomer, Joseph
    Church, Barbara A.
    Ashby, F. Gregory
    ATTENTION PERCEPTION & PSYCHOPHYSICS, 2014, 76 (05) : 1473 - 1484
  • [33] Mechanism of Cross-modal Information Influencing Taste
    Liang, Pei
    Jiang, Jia-yu
    Liu, Qiang
    Zhang, Su-lin
    Yang, Hua-jing
    CURRENT MEDICAL SCIENCE, 2020, 40 (03) : 474 - 479
  • [34] Mechanism of Cross-modal Information Influencing Taste
    Pei Liang
    Jia-yu Jiang
    Qiang Liu
    Su-lin Zhang
    Hua-jing Yang
    Current Medical Science, 2020, 40 : 474 - 479
  • [35] Information Recovery Technology for Cross-Modal Communications
    Xu J.-B.
    Wei X.
    Zhou L.
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2022, 50 (07): : 1631 - 1642
  • [36] Cross-modal information integration in category learning
    J. David Smith
    Jennifer J. R. Johnston
    Robert D. Musgrave
    Alexandria C. Zakrzewski
    Joseph Boomer
    Barbara A. Church
    F. Gregory Ashby
    Attention, Perception, & Psychophysics, 2014, 76 : 1473 - 1484
  • [37] CROSS2STRA: Unpaired Cross-lingual Image Captioning with Cross-lingual Cross-modal Structure-pivoted Alignment
    Wu, Shengqiong
    Fei, Hao
    Ji, Wei
    Chua, Tat-Seng
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 2593 - 2608
  • [38] EXPLORING DUAL STREAM GLOBAL INFORMATION FOR IMAGE CAPTIONING
    Xian, Tiantao
    Li, Zhixin
    Chen, Tianyu
    Ma, Huifang
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4458 - 4462
  • [39] Cross-modal learning using privileged information for long-tailed image classification
    Li, Xiangxian
    Zheng, Yuze
    Ma, Haokai
    Qi, Zhuang
    Meng, Xiangxu
    Meng, Lei
    COMPUTATIONAL VISUAL MEDIA, 2024, 10 (05) : 981 - 992
  • [40] The Cross-Modal and Cross-Cultural Processing of Affective Information
    Esposito, Anna
    Riviello, Maria Teresa
    NEURAL NETS WIRN10, 2011, 226 : 301 - 310