Style-Enhanced Transformer for Image Captioning in Construction Scenes

被引:0
|
作者
Song, Kani [1 ]
Chen, Linlin [1 ]
Wang, Hengyou [1 ]
机构
[1] Beijing Univ Civil Engn & Architecture, Sch Sci, Beijing 100044, Peoples R China
基金
中国国家自然科学基金;
关键词
image captioning; construction scene; style feature; transformer;
D O I
10.3390/e26030224
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
Image captioning is important for improving the intelligence of construction projects and assisting managers in mastering construction site activities. However, there are few image-captioning models for construction scenes at present, and the existing methods do not perform well in complex construction scenes. According to the characteristics of construction scenes, we label a text description dataset based on the MOCS dataset and propose a style-enhanced Transformer for image captioning in construction scenes, simply called SETCAP. Specifically, we extract the grid features using the Swin Transformer. Then, to enhance the style information, we not only use the grid features as the initial detail semantic features but also extract style information by style encoder. In addition, in the decoder, we integrate the style information into the text features. The interaction between the image semantic information and the text features is carried out to generate content-appropriate sentences word by word. Finally, we add the sentence style loss into the total loss function to make the style of generated sentences closer to the training set. The experimental results show that the proposed method achieves encouraging results on both the MSCOCO and the MOCS datasets. In particular, SETCAP outperforms state-of-the-art methods by 4.2% CIDEr scores on the MOCS dataset and 3.9% CIDEr scores on the MSCOCO dataset, respectively.
引用
收藏
页数:18
相关论文
共 50 条
  • [41] Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion
    Zhao, An
    Yang, Wenzhong
    Chen, Danny
    Wei, Fuyuan
    [J]. ELECTRONICS, 2024, 13 (18)
  • [42] Efficient Image Captioning Based on Vision Transformer Models
    Elbedwehy, Samar
    Medhat, T.
    Hamza, Taher
    Alrahmawy, Mohammed F.
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 73 (01): : 1483 - 1500
  • [43] What Happens in Crowd Scenes: A New Dataset About Crowd Scenes for Image Captioning
    Wang, Lanxiao
    Li, Hongliang
    Hu, Wenzhe
    Zhang, Xiaoliang
    Qiu, Heqian
    Meng, Fanman
    Wu, Qingbo
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5400 - 5412
  • [44] Caption TLSTMs: combining transformer with LSTMs for image captioning
    Yan, Jie
    Xie, Yuxiang
    Luan, Xidao
    Guo, Yanming
    Gong, Quanzhi
    Feng, Suru
    [J]. INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2022, 11 (02) : 111 - 121
  • [45] External knowledge-assisted Transformer for image captioning
    Li, Zhixin
    Su, Qiang
    Chen, Tianyu
    [J]. IMAGE AND VISION COMPUTING, 2023, 140
  • [46] Dual-Spatial Normalized Transformer for image captioning
    Hu, Juntao
    Yang, You
    An, Yongzhi
    Yao, Lu
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 123
  • [47] Graph Alignment Transformer for More Grounded Image Captioning
    Tian, Canwei
    Hu, Haiyang
    Li, Zhongjin
    [J]. 2022 INTERNATIONAL CONFERENCE ON INDUSTRIAL IOT, BIG DATA AND SUPPLY CHAIN, IIOTBDSC, 2022, : 95 - 102
  • [48] Improving Stylized Image Captioning with Better Use of Transformer
    Tan, Yutong
    Lin, Zheng
    Liu, Huan
    Zuo, Fan
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 347 - 358
  • [49] Reinforcement Learning Transformer for Image Captioning Generation Model
    Yan, Zhaojie
    [J]. FIFTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION, ICMV 2022, 2023, 12701
  • [50] Spiking -Transformer Optimization on FPGA for Image Classification and Captioning
    Udeji, Uchechukwu Leo
    Margala, Martin
    [J]. SOUTHEASTCON 2024, 2024, : 1353 - 1357