Towards Good Practices for Multi-modal Fusion in Large-Scale Video Classification

被引:0
|
作者
Liu, Jinlai [1 ]
Yuan, Zehuan [1 ]
Wang, Changhu [1 ]
机构
[1] Bytedance AI Lab, Beijing, Peoples R China
关键词
Video classification; Multi-modal learning; Bilinear model;
D O I
10.1007/978-3-030-11018-5_26
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Leveraging both visual frames and audio has been experimentally proven effective to improve large-scale video classification. Previous research on video classification mainly focuses on the analysis of visual content among extracted video frames and their temporal feature aggregation. In contrast, multimodal data fusion is achieved by simple operators like average and concatenation. Inspired by the success of bilinear pooling in the visual and language fusion, we introduce multi-modal factorized bilinear pooling (MFB) to fuse visual and audio representations. We combine MFB with different video-level features and explore its effectiveness in video classification. Experimental results on the challenging Youtube-8M v2 dataset demonstrate that MFB significantly outperforms simple fusion methods in large-scale video classification.
引用
收藏
页码:287 / 296
页数:10
相关论文
共 50 条
  • [11] Tencent-MVSE: A Large-Scale Benchmark Dataset for Multi-Modal Video Similarity Evaluation
    Zeng, Zhaoyang
    Luo, Yongsheng
    Liu, Zhenhua
    Rao, Fengyun
    Li, Dian
    Guo, Weidong
    Wen, Zhen
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3128 - 3137
  • [12] Exploring a large-scale multi-modal transportation recommendation system
    Liu, Yang
    Lyu, Cheng
    Liu, Zhiyuan
    Cao, Jinde
    TRANSPORTATION RESEARCH PART C-EMERGING TECHNOLOGIES, 2021, 126
  • [13] Richpedia: A Large-Scale, Comprehensive Multi-Modal Knowledge Graph
    Wang, Meng
    Wang, Haofen
    Qi, Guilin
    Zheng, Qiushuo
    BIG DATA RESEARCH, 2020, 22 (22)
  • [14] Operational planning of a large-scale multi-modal transportation system
    Jansen, B
    Swinkels, PCJ
    Teeuwen, GJA
    de Fluiter, BV
    Fleuren, HA
    EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2004, 156 (01) : 41 - 53
  • [15] Multi-Modal Multi-Scale Deep Learning for Large-Scale Image Annotation
    Niu, Yulei
    Lu, Zhiwu
    Wen, Ji-Rong
    Xiang, Tao
    Chang, Shih-Fu
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (04) : 1720 - 1731
  • [16] Integrating multi-modal content analysis and hyperbolic visualization for large-scale news video retrieval and exploration
    Luo, H.
    Fan, J.
    Satoh, S.
    Yang, J.
    Ribarsky, W.
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2008, 23 (07) : 538 - 553
  • [17] Improved Sentiment Classification by Multi-modal Fusion
    Gan, Lige
    Benlamri, Rachid
    Khoury, Richard
    2017 THIRD IEEE INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (IEEE BIGDATASERVICE 2017), 2017, : 11 - 16
  • [18] Flexible Online Multi-modal Hashing for Large-scale Multimedia Retrieval
    Lu, Xu
    Zhu, Lei
    Cheng, Zhiyong
    Li, Jingjing
    Nie, Xiushan
    Zhang, Huaxiang
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1129 - 1137
  • [19] Towards Good Practice in Large-Scale Learning for Image Classification
    Perronnin, Florent
    Akata, Zeynep
    Harchaoui, Zaid
    Schmid, Cordelia
    2012 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2012, : 3482 - 3489
  • [20] MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation
    Feng, Jiazhan
    Sun, Qingfeng
    Xu, Can
    Zhao, Pu
    Yang, Yaming
    Tao, Chongyang
    Zhao, Dongyan
    Lin, Qingwei
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 7348 - 7363