Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained Rewards

被引:7
|
作者
Wu, Chunlei [1 ]
Yuan, Shaozu [1 ]
Cao, Haiwen [1 ]
Wei, Yiwei [2 ]
Wang, Leiquan [1 ]
机构
[1] China Univ Petr, Coll Comp Sci & Technol, Qingdao 266580, Peoples R China
[2] China Univ Petr Beijing Karamay, Sch Petr Engn, Karamay 834000, Peoples R China
基金
中国国家自然科学基金;
关键词
Image caption; reforcement learning; attention mechanism;
D O I
10.1109/ACCESS.2020.2981513
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Image caption based on reinforcement learning (RL) methods has achieved significant success recently. Most of these methods take CIDEr score as the reward of reinforcement learning algorithm to compute gradients, thus refining the image caption baseline model. However, CIDEr score is not the sole criterion to judge the quality of a generated caption. In this paper, a Hierarchical Attention Fusion (HAF) model is presented as a baseline for image caption based on RL, where multi-level feature maps of Resnet are integrated with hierarchical attention. Revaluation network (REN) is exploited for revaluating CIDEr score by assigning different weights for each word according to the importance of each word in a generating caption. The weighted reward can be regarded as word-level reward. Moreover, Scoring Network (SN) is implemented to score the generating sentence with its corresponding ground truth from a batch of captions. This reward can obtain benefits from additional unmatched ground truth, which acts as sentence-level reward. Experimental results on the COCO dataset show that the proposed methods have achieved competitive performance compared with the related image caption methods.
引用
收藏
页码:57943 / 57951
页数:9
相关论文
共 50 条
  • [1] Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering
    Shaoning Xiao
    Yimeng Li
    Yunan Ye
    Long Chen
    Shiliang Pu
    Zhou Zhao
    Jian Shao
    Jun Xiao
    [J]. Neural Processing Letters, 2020, 52 : 993 - 1003
  • [2] Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering
    Xiao, Shaoning
    Li, Yimeng
    Ye, Yunan
    Chen, Long
    Pu, Shiliang
    Zhao, Zhou
    Shao, Jian
    Xiao, Jun
    [J]. NEURAL PROCESSING LETTERS, 2020, 52 (02) : 993 - 1003
  • [3] AERNs: Attention-Based Entity Region Networks for Multi-Grained Named Entity Recognition
    Dai, Jianghai
    Feng, Chong
    Bai, Xuefeng
    Dai, Jinming
    Zhang, Huanhuan
    [J]. 2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 408 - 415
  • [4] Fine-grained Image Caption based on Multi-level Attention
    Yang Zhenyu
    Zhang Jiao
    [J]. PROCEEDINGS OF 2019 IEEE 7TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT 2019), 2019, : 72 - 78
  • [5] Multi-Grained Selection and Fusion for Fine-Grained Image Representation
    Jiang, Jianrong
    Wang, Hongxing
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [6] Multi-Grained Attention Networks for Single Image Super-Resolution
    Wu, Huapeng
    Zou, Zhengxia
    Gui, Jie
    Zeng, Wen-Jun
    Ye, Jieping
    Zhang, Jun
    Liu, Hongyi
    Wei, Zhihui
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (02) : 512 - 522
  • [7] Attention-based Visual-Audio Fusion for Video Caption Generation
    Guo, Ningning
    Liu, Huaping
    Jiang, Linhua
    [J]. 2019 IEEE 4TH INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM 2019), 2019, : 839 - 844
  • [8] Attention-based Fusion for Multi-source Human Image Generation
    Lathuiliere, Stephane
    Sangineto, Enver
    Siarohin, Aliaksandr
    Sebe, Nicu
    [J]. 2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 428 - 437
  • [9] Multi-Grained Attention Network With Mutual Exclusion for Composed Query-Based Image Retrieval
    Li, Shenshen
    Xu, Xing
    Jiang, Xun
    Shen, Fumin
    Liu, Xin
    Shen, Heng Tao
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (04) : 2959 - 2972
  • [10] Attention-based hierarchical fusion of visible and infrared images
    Chen, Yanfei
    Sang, Nong
    [J]. OPTIK, 2015, 126 (23): : 4243 - 4248