Style-Enhanced Transformer for Image Captioning in Construction Scenes

被引:0
|
作者
Song, Kani [1 ]
Chen, Linlin [1 ]
Wang, Hengyou [1 ]
机构
[1] Beijing Univ Civil Engn & Architecture, Sch Sci, Beijing 100044, Peoples R China
基金
中国国家自然科学基金;
关键词
image captioning; construction scene; style feature; transformer;
D O I
10.3390/e26030224
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
Image captioning is important for improving the intelligence of construction projects and assisting managers in mastering construction site activities. However, there are few image-captioning models for construction scenes at present, and the existing methods do not perform well in complex construction scenes. According to the characteristics of construction scenes, we label a text description dataset based on the MOCS dataset and propose a style-enhanced Transformer for image captioning in construction scenes, simply called SETCAP. Specifically, we extract the grid features using the Swin Transformer. Then, to enhance the style information, we not only use the grid features as the initial detail semantic features but also extract style information by style encoder. In addition, in the decoder, we integrate the style information into the text features. The interaction between the image semantic information and the text features is carried out to generate content-appropriate sentences word by word. Finally, we add the sentence style loss into the total loss function to make the style of generated sentences closer to the training set. The experimental results show that the proposed method achieves encouraging results on both the MSCOCO and the MOCS datasets. In particular, SETCAP outperforms state-of-the-art methods by 4.2% CIDEr scores on the MOCS dataset and 3.9% CIDEr scores on the MSCOCO dataset, respectively.
引用
收藏
页数:18
相关论文
共 50 条
  • [21] Direction Relation Transformer for Image Captioning
    Song, Zeliang
    Zhou, Xiaofei
    Dong, Linhua
    Tan, Jianlong
    Guo, Li
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5056 - 5064
  • [22] Unsupervised Style Control for Image Captioning
    Tian, Junyu
    Yang, Zhikun
    Shi, Shumin
    [J]. DATA SCIENCE (ICPCSEE 2022), PT I, 2022, 1628 : 413 - 424
  • [23] Relational-Convergent Transformer for image captioning
    Chen, Lizhi
    Yang, You
    Hu, Juntao
    Pan, Longyue
    Zhai, Hao
    [J]. DISPLAYS, 2023, 77
  • [24] MIXED KNOWLEDGE RELATION TRANSFORMER FOR IMAGE CAPTIONING
    Chen, Tianyu
    Li, Zhixin
    Wei, Jiahui
    Xian, Tiantao
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4403 - 4407
  • [25] Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image Captioning
    Li, Jingyu
    Mao, Zhendong
    Li, Hao
    Chen, Weidong
    Zhang, Yongdong
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (05)
  • [26] Context-aware transformer for image captioning
    Yang, Xin
    Wang, Ying
    Chen, Haishun
    Li, Jie
    Huang, Tingting
    [J]. NEUROCOMPUTING, 2023, 549
  • [27] A Position-Aware Transformer for Image Captioning
    Deng, Zelin
    Zhou, Bo
    He, Pei
    Huang, Jianfeng
    Alfarraj, Osama
    Tolba, Amr
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 70 (01): : 2065 - 2081
  • [28] Full-Memory Transformer for Image Captioning
    Lu, Tongwei
    Wang, Jiarong
    Min, Fen
    [J]. SYMMETRY-BASEL, 2023, 15 (01):
  • [29] A position-aware transformer for image captioning
    Deng, Zelin
    Zhou, Bo
    He, Pei
    Huang, Jianfeng
    Alfarraj, Osama
    Tolba, Amr
    [J]. Deng, Zelin (zl_deng@sina.com), 2005, Tech Science Press (70): : 2005 - 2021
  • [30] Retrieval-Augmented Transformer for Image Captioning
    Sarto, Sara
    Cornia, Marcella
    Baraldi, Lorenzo
    Cucchiara, Rita
    [J]. 19TH INTERNATIONAL CONFERENCE ON CONTENT-BASED MULTIMEDIA INDEXING, CBMI 2022, 2022, : 1 - 7