Towards Good Practices for Multi-modal Fusion in Large-Scale Video Classification

被引:0
|
作者
Liu, Jinlai [1 ]
Yuan, Zehuan [1 ]
Wang, Changhu [1 ]
机构
[1] Bytedance AI Lab, Beijing, Peoples R China
关键词
Video classification; Multi-modal learning; Bilinear model;
D O I
10.1007/978-3-030-11018-5_26
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Leveraging both visual frames and audio has been experimentally proven effective to improve large-scale video classification. Previous research on video classification mainly focuses on the analysis of visual content among extracted video frames and their temporal feature aggregation. In contrast, multimodal data fusion is achieved by simple operators like average and concatenation. Inspired by the success of bilinear pooling in the visual and language fusion, we introduce multi-modal factorized bilinear pooling (MFB) to fuse visual and audio representations. We combine MFB with different video-level features and explore its effectiveness in video classification. Experimental results on the challenging Youtube-8M v2 dataset demonstrate that MFB significantly outperforms simple fusion methods in large-scale video classification.
引用
收藏
页码:287 / 296
页数:10
相关论文
共 50 条
  • [1] Efficient Large-Scale Multi-Modal Classification
    Kiela, Douwe
    Grave, Edouard
    Joulin, Armand
    Mikolov, Tomas
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 5198 - 5204
  • [2] Effective Classification for Multi-modal Behavioral Authentication on Large-Scale Data
    Yamaguchi, Shuji
    Gomi, Hidehito
    Kobayashi, Ryosuke
    Tran Phuong Thao
    Irvan, Mhd
    Yamaguchi, Rie Shigetomi
    2020 15TH ASIA JOINT CONFERENCE ON INFORMATION SECURITY (ASIAJCIS 2020), 2020, : 101 - 109
  • [3] Effective Classification for Multi-modal Behavioral Authentication on Large-Scale Data
    Yamaguchi, Shuji
    Gomi, Hidehito
    Kobayashi, Ryosuke
    Yamaguchi, Rie Shigetomi
    JOURNAL OF INTERNET TECHNOLOGY, 2021, 22 (05): : 1171 - 1183
  • [4] A Hierarchical Framwork with Improved Loss for Large-scale Multi-modal Video Identification
    Zhang, Shichuan
    Tang, Zengming
    Pan, Hao
    Wei, Xinyu
    Huang, Jun
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 2539 - 2542
  • [5] News video classification based on multi-modal information fusion
    Lie, WN
    Su, CK
    2005 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), VOLS 1-5, 2005, : 1021 - 1024
  • [6] Multi-Modal Learning: Study on A Large-Scale Micro-Video Data Collection
    Chen, Jingyuan
    MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, : 1454 - 1458
  • [7] Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets
    Liao, Yuan-Hong
    Kar, Amlan
    Fidler, Sanja
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4348 - 4357
  • [8] Large-scale Multi-modal Search and QA at Alibaba
    Jin, Rong
    PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 8 - 8
  • [9] MMpedia: A Large-Scale Multi-modal Knowledge Graph
    Wu, Yinan
    Wu, Xiaowei
    Li, Junwen
    Zhang, Yue
    Wang, Haofen
    Du, Wen
    He, Zhidong
    Liu, Jingping
    Ruan, Tong
    SEMANTIC WEB, ISWC 2023, PT II, 2023, 14266 : 18 - 37
  • [10] Multi-modal fusion for video understanding
    Hoogs, A
    Mundy, J
    Cross, G
    30TH APPLIED IMAGERY PATTERN RECOGNITION WORKSHOP, PROCEEDINGS: ANALYSIS AND UNDERSTANDING OF TIME VARYING IMAGERY, 2001, : 103 - 108