Self-supervised Cross-view Representation Reconstruction for Change Captioning

被引:2
|
作者
Tu, Yunbin [1 ]
Li, Liang [2 ]
Su, Li [1 ,3 ]
Zha, Zheng-Jun [4 ]
Yan, Chenggang [5 ,6 ]
Huang, Qingming [1 ,2 ,3 ]
机构
[1] Univ Chinese Acad Sci, Beijing, Peoples R China
[2] Chinese Acad Sci, Key Lab Intelligent Informat Proc, ICT, Beijing, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
[4] Univ Sci & Technol China, Hefei, Peoples R China
[5] Hangzhou Dianzi Univ, Hangzhou, Peoples R China
[6] Hangzhou Dianzi Univ, Lishui Inst, Hangzhou, Peoples R China
关键词
D O I
10.1109/ICCV51070.2023.00263
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Change captioning aims to describe the difference between a pair of similar images. Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change. In this paper, we address this by proposing a self-supervised cross-view representation reconstruction (SCORER) network. Concretely, we first design a multi-head token-wise matching to model relationships between cross-view features from similar/dissimilar images. Then, by maximizing cross-view contrastive alignment of two similar images, SCORER learns two view-invariant image representations in a self-supervised way. Based on these, we reconstruct the representations of unchanged objects by cross-attention, thus learning a stable difference representation for caption generation. Further, we devise a cross-modal backward reasoning to improve the quality of caption. This module reversely models a "hallucination" representation with the caption and "before" representation. By pushing it closer to the "after" representation, we enforce the caption to be informative about the difference in a self-supervised manner. Extensive experiments show our method achieves the state-of-the-art results on four datasets. The code is available at https://github.com/tuyunbin/SCORER.
引用
收藏
页码:2793 / 2803
页数:11
相关论文
共 50 条
  • [1] Cross-View Masked Model for Self-Supervised Graph Representation Learning
    Duan, Haoran
    Yu, Beibei
    Xie, Cheng
    [J]. IEEE Transactions on Artificial Intelligence, 2024, 5 (11): : 5540 - 5552
  • [2] Cross-View Temporal Contrastive Learning for Self-Supervised Video Representation
    Wang, Lulu
    Xu, Zengmin
    Zhang, Xuelian
    Meng, Ruxing
    Lu, Tao
    [J]. Computer Engineering and Applications, 60 (18): : 158 - 166
  • [3] An Efficient Self-Supervised Cross-View Training For Sentence Embedding
    Limkonchotiwat, Peerat
    Ponwitayarat, Wuttikorn
    Lowphansirikul, Lalita
    Udomcharoenchaikit, Can
    Chuangsuwanich, Ekapol
    Nutanong, Sarana
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2023, 11 : 1572 - 1587
  • [4] Learning Where to Learn in Cross-View Self-Supervised Learning
    Huang, Lang
    You, Shan
    Zheng, Mingkai
    Wang, Fei
    Qian, Chen
    Yamasaki, Toshihiko
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 14431 - 14440
  • [5] Self-supervised Feature Learning by Cross-modality and Cross-view Correspondences
    Jing, Longlong
    Zhang, Ling
    Tian, Yingli
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 1581 - 1591
  • [6] On Robust Cross-view Consistency in Self-supervised Monocular Depth Estimation
    Haimei Zhao
    Jing Zhang
    Zhuo Chen
    Bo Yuan
    Dacheng Tao
    [J]. Machine Intelligence Research, 2024, 21 : 495 - 513
  • [7] On Robust Cross-view Consistency in Self-supervised Monocular Depth Estimation
    Zhao, Haimei
    Zhang, Jing
    Chen, Zhuo
    Yuan, Bo
    Tao, Dacheng
    [J]. MACHINE INTELLIGENCE RESEARCH, 2024, 21 (03) : 495 - 513
  • [8] Incremental Cross-view Mutual Distillation for Self-supervised Medical CT Synthesis
    Fang, Chaowei
    Wang, Liang
    Zhang, Dingwen
    Xu, Jun
    Yuan, Yixuan
    Han, Junwei
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 20645 - 20654
  • [9] Cross-view motion consistent self-supervised video inter-intra contrastive for action representation understanding
    Bi, Shuai
    Hu, Zhengping
    Zhang, Hehao
    Di, Jirui
    Sun, Zhe
    [J]. NEURAL NETWORKS, 2024, 179
  • [10] Group Identification via Transitional Hypergraph Convolution with Cross-view Self-supervised Learning
    Yang, Mingdai
    Liu, Zhiwei
    Yang, Liangwei
    Liu, Xiaolong
    Wang, Chen
    Peng, Hao
    Yu, Philip S.
    [J]. PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 2969 - 2979