I2Transformer: Intra- and Inter-Relation Embedding Transformer for TV Show Captioning

被引:20
|
作者
Tu, Yunbin [1 ,2 ]
Li, Liang [3 ]
Su, Li
Gao, Shengxiang [1 ,2 ]
Yan, Chenggang [4 ]
Zha, Zheng-Jun [5 ]
Yu, Zhengtao [1 ,2 ]
Huang, Qingming [6 ,7 ,8 ]
机构
[1] Kunming Univ Sci & Technol, Fac Informat Engn & Automat, Kunming 650500, Yunnan, Peoples R China
[2] Kunming Univ Sci & Technol, Yunnan Prov Key Lab Artificial Intelligence, Kunming 650500, Yunnan, Peoples R China
[3] Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
[4] Hangzhou Dianzi Univ, Sch Automat, Hangzhou 310018, Peoples R China
[5] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230052, Peoples R China
[6] Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 101408, Peoples R China
[7] Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
[8] Peng Cheng Lab, Shenzhen 518057, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Semantics; Task analysis; Visualization; TV; Electronic mail; Graph neural networks; TV Show captioning; video and subtitle; intra-relation embedding; inter-relation embedding; transformer; VIDEO;
D O I
10.1109/TIP.2022.3159472
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
TV show captioning aims to generate a linguistic sentence based on the video and its associated subtitle. Compared to purely video-based captioning, the subtitle can provide the captioning model with useful semantic clues such as actors' sentiments and intentions. However, the effective use of subtitle is also very challenging, because it is the pieces of scrappy information and has semantic gap with visual modality. To organize the scrappy information together and yield a powerful omni-representation for all the modalities, an efficient captioning model requires understanding video contents, subtitle semantics, and the relations in between. In this paper, we propose an Intra- and Inter-relation Embedding Transformer (I(2)Transformer), consisting of an Intra-relation Embedding Block (IAE) and an Inter-relation Embedding Block (IEE) under the framework of a Transformer. First, the IAE captures the intra-relation in each modality via constructing the learnable graphs. Then, IEE learns the cross attention gates, and selects useful information from each modality based on their inter-relations, so as to derive the omni-representation as the input to the Transformer. Experimental results on the public dataset show that the I(2)Transformer achieves the state-of-the-art performance. We also evaluate the effectiveness of the IAE and IEE on two other relevant tasks of video with text inputs, i.e., TV show retrieval and video-guided machine translation. The encouraging performance further validates that the IAE and IEE blocks have a good generalization ability. The code is available at https://github.com/tuyunbin/I2Transformer.
引用
收藏
页码:3565 / 3577
页数:13
相关论文
共 15 条
  • [1] Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network
    Ji, Jiayi
    Luo, Yunpeng
    Sun, Xiaoshuai
    Chen, Fuhai
    Luo, Gen
    Wu, Yongjian
    Gao, Yue
    Ji, Rongrong
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 1655 - 1663
  • [2] Improving Intra- and Inter-Modality Visual Relation for Image Captioning
    Wang, Yong
    Zhang, WenKai
    Liu, Qing
    Zhang, Zhengyuan
    Gao, Xin
    Sun, Xian
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4190 - 4198
  • [3] CasUNeXt: A Cascaded Transformer With Intra- and Inter-Scale Information for Medical Image Segmentation
    Sun, Junding
    Zheng, Xiaopeng
    Wu, Xiaosheng
    Tang, Chaosheng
    Wang, Shuihua
    Zhang, Yudong
    INTERNATIONAL JOURNAL OF IMAGING SYSTEMS AND TECHNOLOGY, 2024, 34 (05)
  • [4] Dual-Modal Transformer with Enhanced Inter- and Intra-Modality Interactions for Image Captioning
    Kumar, Deepika
    Srivastava, Varun
    Popescu, Daniela Elena
    Hemanth, Jude D.
    APPLIED SCIENCES-BASEL, 2022, 12 (13):
  • [5] I3N: Intra- and Inter-Representation Interaction Network for Change Captioning
    Yue, Shengbin
    Tu, Yunbin
    Li, Liang
    Yang, Ying
    Gao, Shengxiang
    Yu, Zhengtao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8828 - 8841
  • [6] HIINT: Historical, Intra- and Inter- personal Dynamics Modeling with Cross-person Memory Transformer
    Kim, Yubin
    Lee, Dong Won
    Liang, Paul Pu
    Alghowinem, Sharifa
    Breazeal, Cynthia
    Park, Hae Won
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, 2023, : 314 - 325
  • [7] PathFinder: a novel graph transformer model to infer multi-cell intra- and inter-cellular signaling pathways and communications
    Feng, Jiarui
    Song, Haoran
    Province, Michael
    Li, Guangfu
    Payne, Philip R. O.
    Chen, Yixin
    Li, Fuhai
    FRONTIERS IN CELLULAR NEUROSCIENCE, 2024, 18
  • [8] I2R-Net: Intra- and Inter-Human Relation Network for Multi-Person Pose Estimation
    Ding, Yiwei
    Deng, Wenjin
    Zheng, Yinglin
    Liu, Pengfei
    Wang, Meihong
    Cheng, Xuan
    Bao, Jianmin
    Chen, Dong
    Zeng, Ming
    PROCEEDINGS OF THE THIRTY-FIRST INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2022, 2022, : 855 - 862
  • [9] In2SET: Intra-Inter Similarity Exploiting Transformer for Dual-Camera Compressive Hyperspectral Imaging
    Wang, Xin
    Wang, Lizhi
    Mal, Xiangtian
    Zhang, Maoqing
    Zhu, Lin
    Hu, Hua
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 24881 - 24891
  • [10] An Intra- and Inter-Emotion Transformer-Based Fusion Model with Homogeneous and Diverse Constraints Using Multi-Emotional Audiovisual Features for Depression Detection
    Teng, Shiyu
    Liu, Jiaqing
    Huang, Yue
    Chai, Shurong
    Tateyama, Tomoko
    Huang, Xinyin
    Lin, Lanfen
    Chen, Yen-Wei
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2024, E107D (03) : 342 - 353