Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations

被引:5
|
作者
Yang, Qian [1 ]
Li, Yunxin [1 ]
Hu, Baotian [1 ]
Ma, Lin [2 ]
Ding, Yuxin [1 ]
Zhang, Min [1 ]
机构
[1] Harbin Inst Technol, Shenzhen, Peoples R China
[2] Meituan, Beijing, Peoples R China
关键词
Visual Entailment; Explanation Generation; Semantic Alignment;
D O I
10.1145/3503161.3548284
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Visual Entailment with natural language explanations aims to infer the relationship between a text-image pair and generate a sentence to explain the decision-making process. Previous methods rely mainly on a pre-trained vision-language model to perform the relation inference and a language model to generate the corresponding explanation. However, the pre-trained vision-language models mainly build token-level alignment between text and image yet ignore the high-level semantic alignment between the phrases (chunks) and visual contents, which is critical for vision-language reasoning. Moreover, the explanation generator based only on the encoded joint representation does not explicitly consider the critical decision-making points of relation inference. Thus the generated explanations are less faithful to visual-language reasoning. To mitigate these problems, we propose a unified Chunk-aware Alignment and Lexical Constraint based method, dubbed as CALeC. It contains a Chunk-aware Semantic Interactor (arr.CSI), a relation inferrer, and a Lexical Constraint-aware Generator (arr. LeCG). Specifically, CSI exploits the sentence structure inherent in language and various image regions to build chunk-aware semantic alignment. Relation inferrer uses an attention-based reasoning network to incorporate the token-level and chunk-level vision-language representations. LeCG utilizes lexical constraints to expressly incorporate the words or chunks focused by the relation inferrer into explanation generation, improving the faithfulness and informativeness of the explanations. We conduct extensive experiments on three datasets, and experimental results indicate that CALeC significantly outperforms other competitor models on inference accuracy and quality of generated explanations.
引用
收藏
页码:3587 / 3597
页数:11
相关论文
共 7 条
  • [1] Visual Alignment Constraint for Continuous Sign Language Recognition
    Min, Yuecong
    Hao, Aiming
    Chai, Xiujuan
    Chen, Xilin
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 11522 - 11531
  • [2] GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths
    Chen, Xianyu
    Jiang, Ming
    Zhao, Qi
    COMPUTER VISION - ECCV 2024, PT VIII, 2025, 15066 : 314 - 333
  • [3] A constraint-based model for lexical and syntactic choice in natural language generation
    Moriceau, W
    Saint-Dizier, P
    CONSTRAINT SOLVING AND LANGUAGE PROCESSING, 2005, 3438 : 184 - 204
  • [4] KACE: Generating Knowledge-Aware Contrastive Explanations for Natural Language Inference
    Chen, Qianglong
    Ji, Feng
    Zeng, Xiangji
    Li, Feng-Lin
    Zhang, Ji
    Chen, Haiqing
    Zhang, Yin
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2516 - 2527
  • [5] CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations
    Salewski, Leonard
    Koepke, A. Sophia
    Lensch, Hendrik P. A.
    Akata, Zeynep
    XXAI - BEYOND EXPLAINABLE AI: International Workshop, Held in Conjunction with ICML 2020, July 18, 2020, Vienna, Austria, Revised and Extended Papers, 2022, 13200 : 69 - 88
  • [6] Reversible source-aware natural language watermarking via customized lexical substitution
    Jiang, Ziyu
    Wang, Hongxia
    Shi, Zhenhao
    Jiao, Run
    INFORMATION PROCESSING & MANAGEMENT, 2025, 62 (02)
  • [7] Natural-Language-Driven Multimodal Representation Learning for Audio-Visual Scene-Aware Dialog System
    Heo, Yoonseok
    Kang, Sangwoo
    Seo, Jungyun
    SENSORS, 2023, 23 (18)