HIST: Hierarchical and sequential transformer for image captioning

被引:0
|
作者
Lv, Feixiao [1 ,2 ]
Wang, Rui [1 ,2 ]
Jing, Lihua [1 ,2 ]
Dai, Pengwen [3 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Cyberspace Secur, Beijing, Peoples R China
[3] Sun Yat Sen Univ, Sch Cyber Sci & Technol, Shenzhen Campus, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
computer vision; feature extraction; learning (artificial intelligence); neural nets;
D O I
10.1049/cvi2.12305
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder transformer framework. Such transformer structures, however, show two main limitations in the task of image captioning. Firstly, the traditional transformer obtains high-level fusion features to decode while ignoring other-level features, resulting in losses of image content. Secondly, the transformer is weak in modelling the natural order characteristics of language. To address theseissues, the authors propose a HIerarchical and Sequential Transformer (HIST) structure, which forces each layer of the encoder and decoder to focus on features of different granularities, and strengthen the sequentially semantic information. Specifically, to capture the details of different levels of features in the image, the authors combine the visual features of multiple regions and divide them into multiple levels differently. In addition, to enhance the sequential information, the sequential enhancement module in each decoder layer block extracts different levels of features for sequentially semantic extraction and expression. Extensive experiments on the public datasets MS-COCO and Flickr30k have demonstrated the effectiveness of our proposed method, and show that the authors' method outperforms most of previous state of the arts. The authors propose hierarchical encoder-decoder blocks in the authors' novel hierarchical and sequential transformer for capturing multi-granularity image information and combining it with a sequential enhancement module to generate rich and smooth image descriptions. The authors's method demonstrated good performance by comparing it with numerous SOTA methods on the MSCOCO dataset. image
引用
收藏
页码:1043 / 1056
页数:14
相关论文
共 50 条
  • [41] Hierarchical Image Generation via Transformer-Based Sequential Patch Selection
    Xu, Xiaogang
    Xu, Ning
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 2938 - 2945
  • [42] Memory-enhanced hierarchical transformer for video paragraph captioning
    Zhang, Benhui
    Gao, Junyu
    Yuan, Yuan
    NEUROCOMPUTING, 2025, 615
  • [43] Efficient Image Captioning Based on Vision Transformer Models
    Elbedwehy, Samar
    Medhat, T.
    Hamza, Taher
    Alrahmawy, Mohammed F.
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 73 (01): : 1483 - 1500
  • [44] An Augmented Image Captioning Model: Incorporating Hierarchical Image Information
    Funckes, Nathan
    Carrier, Erin
    Wolffe, Greg
    20TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2021), 2021, : 1608 - 1614
  • [45] External knowledge-assisted Transformer for image captioning
    Li, Zhixin
    Su, Qiang
    Chen, Tianyu
    IMAGE AND VISION COMPUTING, 2023, 140
  • [46] Dual-Spatial Normalized Transformer for image captioning
    Hu, Juntao
    Yang, You
    An, Yongzhi
    Yao, Lu
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 123
  • [47] Caption TLSTMs: combining transformer with LSTMs for image captioning
    Yan, Jie
    Xie, Yuxiang
    Luan, Xidao
    Guo, Yanming
    Gong, Quanzhi
    Feng, Suru
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2022, 11 (02) : 111 - 121
  • [48] Reinforcement Learning Transformer for Image Captioning Generation Model
    Yan, Zhaojie
    FIFTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION, ICMV 2022, 2023, 12701
  • [49] Improving Stylized Image Captioning with Better Use of Transformer
    Tan, Yutong
    Lin, Zheng
    Liu, Huan
    Zuo, Fan
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 347 - 358
  • [50] Graph Alignment Transformer for More Grounded Image Captioning
    Tian, Canwei
    Hu, Haiyang
    Li, Zhongjin
    2022 INTERNATIONAL CONFERENCE ON INDUSTRIAL IOT, BIG DATA AND SUPPLY CHAIN, IIOTBDSC, 2022, : 95 - 102