Modal Contrastive Learning Based End-to-End Text Image Machine Translation

被引:0
|
作者
Ma, Cong [1 ,2 ]
Han, Xu [1 ,2 ]
Wu, Linghui [1 ,2 ]
Zhang, Yaping [1 ,2 ]
Zhao, Yang [1 ,2 ]
Zhou, Yu [1 ,2 ]
Zong, Chengqing [1 ,2 ]
机构
[1] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China
[2] Chinese Acad Sci, Inst Automat, Beijing 100190, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Machine translation; Decoding; Semantics; Pipelines; Text recognition; Task analysis; Text image machine translation; contrastive learning; text image recognition; machine translation; RECOGNITION;
D O I
10.1109/TASLP.2023.3324540
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Text image machine translation (TIMT) aims at directly translating text in the source language embedded in images into the target language. Most existing systems follow the cascaded pipeline diagram from recognition to translation, which suffers from the problem of error propagation, parameter redundancy, and information reduction. The end-to-end model has the potential to alleviate these issues via bridging the recognition and translation models. However, the challenge is the data limitation and modality gap between text and image. In this paper, we propose a novel end-to-end model, namely Modal contrastive learning based End-to-end Text Image Machine Translation (METIMT), which alleviates these issues through end-to-end text image machine translation architecture and modal contrastive learning. Specifically, an image encoder is designed to encode images into the same feature space of corresponding text sentences, with the guidance of an intra-modal and inter-modal contrastive learning module. To further promote the research of text image machine translation, we have constructed one synthetic and two real-world datasets. Extensive experiments show that our lighter, faster model outperforms not only existing pipeline methods but also state-of-the-art end-to-end models on both synthetic and real-world evaluation sets. Our code and dataset will be released to the public.
引用
收藏
页码:2153 / 2165
页数:13
相关论文
共 50 条
  • [1] Improving End-to-End Text Image Translation From the Auxiliary Text Translation Task
    Ma, Cong
    Zhang, Yaping
    Tu, Mei
    Han, Xu
    Wu, Linghui
    Zhao, Yang
    Zhou, Yu
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1664 - 1670
  • [2] RTNet: An End-to-End Method for Handwritten Text Image Translation
    Su, Tonghua
    Liu, Shuchen
    Zhou, Shengjie
    DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT II, 2021, 12822 : 99 - 113
  • [3] End-to-End Network Intrusion Detection Based on Contrastive Learning
    Li, Longlong
    Lu, Yuliang
    Yang, Guozheng
    Yan, Xuehu
    SENSORS, 2024, 24 (07)
  • [4] An End-to-End Discriminative Approach to Machine Translation
    Liang, Percy
    Bouchard-Cote, Alexandre
    Klein, Dan
    Taskar, Ben
    COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, 2006, : 761 - 768
  • [5] Contrastive Learning for improving End-to-end Speaker Verification
    Tang, Yanxi
    Wang, Jianzong
    Qu, Xiaoyang
    Xiao, Jing
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [6] Tell, Imagine, and Search: End-to-end Learning for Composing Text and Image to Image Retrieval
    Zhang, Feifei
    Xu, Mingliang
    Xu, Changsheng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (02)
  • [7] A COMPARATIVE STUDY ON END-TO-END SPEECH TO TEXT TRANSLATION
    Bahar, Parnia
    Bieschke, Tobias
    Ney, Hermann
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 792 - 799
  • [8] SimulSpeech: End-to-End Simultaneous Speech to Text Translation
    Ren, Yi
    Liu, Jinglin
    Tan, Xu
    Zhang, Chen
    Qin, Tao
    Zhao, Zhou
    Liu, Tie-Yan
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3787 - 3796
  • [9] End-to-End Speech-to-Text Translation: A Survey
    Sethiya, Nivedita
    Maurya, Chandresh Kumar
    Computer Speech and Language, 2025, 90
  • [10] Recognizing Multiple Text Sequences from an Image by Pure End-to-End Learning
    Xu, Zhenlong
    Zhou, Shuigeng
    Bai, Fan
    Cheng, Zhanzhan
    Niu, Yi
    Pu, Shiliang
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 7058 - 7065