Improving fashion captioning via attribute-based alignment and multi-level language model

被引:1
|
作者
Tang, Yuhao [1 ]
Zhang, Liyan [1 ]
Yuan, Ye [1 ]
Chen, Zhixian [1 ]
机构
[1] Nanjing Univ Aeronaut & Astronaut, Coll Comp Sci & Technol, Nanjing 211106, Peoples R China
基金
中国国家自然科学基金;
关键词
Fashion; Image captioning; E-commerce;
D O I
10.1007/s10489-023-05167-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Fashion captioning aims to generate detailed and captivating descriptions based on a group of item images. It requires the model to precisely describe attribute details under the supervision of complex sentences. Existing image captioning methods typically focus on describing a single image and often struggle to capture fine-grained visual representations in the fashion domain. Furthermore, the presence of complex description noise and unbalanced word distribution in fashion datasets limits diverse sentence generation. To alleviate redundancy in raw images, we propose an Attribute-based Alignment Module (AAM). The AAM captures more content-related information to enhance visual representations. Based on this design, we demonstrate that fashion captioning can benefit greatly from grid features with detailed alignment, in contrast to previous success with dense features. To address the inherent word distribution imbalance, we introduce a more balanced corpus called Fashion-Style-27k, collected from various shopping websites. Additionally, we present a pre-trained Fashion Language Model (FLM) that integrates sentence-level and attribute-level language knowledge into the caption model. Experiments on the FACAD and Fashion-Gen datasets show the proposed AAM-FLM outperforms existing methods. Descriptions in the two datasets are from considerably different lengths and styles, ranging from the 21-word detailed description to the 30-word template-based sentence, demonstrating the generalization ability of the proposed model.
引用
收藏
页码:30757 / 30777
页数:21
相关论文
共 50 条
  • [11] Next Basket Recommendation Model Based on Attribute-Aware Multi-Level Attention
    Liu, Tong
    Yin, Xianrui
    Ni, Weijian
    IEEE ACCESS, 2020, 8 : 153872 - 153880
  • [12] A Multi-level Alignment Training Scheme for Video-and-Language Grounding
    Zhang, Yubo
    Niu, Feiyang
    Ping, Qing
    Thattai, Govind
    2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW, 2022, : 958 - 966
  • [13] Middle-Level Attribute-Based Language Retouching for Image Caption Generation
    Guan, Zhibin
    Liu, Kang
    Ma, Yan
    Qian, Xu
    Ji, Tongkai
    APPLIED SCIENCES-BASEL, 2018, 8 (10):
  • [14] Attribute-Guided Multi-Level Attention Network for Fine-Grained Fashion Retrieval
    Xiao, Ling
    Yamasaki, Toshihiko
    IEEE ACCESS, 2024, 12 (48068-48080) : 48068 - 48080
  • [15] Attribute Based Signatures for Bounded Multi-level Threshold Circuits
    Kumar, Swarun
    Agrawal, Shivank
    Balaraman, Subha
    Rangan, C. Pandu
    PUBLIC KEY INFRASTRUCTURES, SERVICES AND APPLICATIONS, 2011, 6711 : 141 - 154
  • [16] Mulan: A Multi-Level Alignment Model for Video Question Answering
    Fu, Yu
    Cao, Cong
    Yang, Yuling
    Lu, Yuhai
    Yuan, Fangfang
    Wang, Dakui
    Liu, Yanbing
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 5475 - 5489
  • [17] Model based multi-level prototyping
    Bredenfeld, A
    Wilberg, J
    TENTH IEEE INTERNATIONAL WORKSHOP ON RAPID SYSTEMS PROTOTYPING, PROCEEDINGS, 1999, : 190 - 195
  • [18] A multi-level/multi-type model for design-based alignment of instruction, assessment, and testing
    Hickey, DT
    Zuiker, SJ
    McGee, S
    ICLS2004: INTERNATIONAL CONFERENCE OF THE LEARNING SCIENCES, PROCEEDINGS: EMBRACING DIVERSITY IN THE LEARNING SCIENCES, 2004, : 607 - 607
  • [19] MRCap: Multi-modal and Multi-level Relationship-based Dense Video Captioning
    Chen, Wei
    Niu, Jianwei
    Liu, Xuefeng
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2615 - 2620
  • [20] Single-Stream Multi-level Alignment for Vision-Language Pretraining
    Khan, Zaid
    Kumar, B. G. Vijay
    Yu, Xiang
    Schulter, Samuel
    Chandraker, Manmohan
    Fu, Yun
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 735 - 751