Deep multimodal feature fusion for micro-video classification

被引:0
|
作者
Zhang L. [1 ]
Cui T. [1 ]
Jing P. [1 ]
Su Y. [1 ]
机构
[1] School of Electrical and Information Engineering, Tianjin University, Tianjin
来源
Jing, Peiguang (pgjing@tju.edu.cn) | 1600年 / Beijing University of Aeronautics and Astronautics (BUAA)卷 / 47期
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
Classification; Deep network; Feature space; Micro-video; Multimodal learning;
D O I
10.13700/j.bh.1001-5965.2020.0457
中图分类号
学科分类号
摘要
Nowadays, micro-video has become one of the most representative products in the new media era. It has the characteristics of short time and strong editing, which makes the traditional video classification models no longer suitable for micro-video classification task. Based on the characteristics of the micro-video classification problem, the micro-video classification algorithm based on deep multimodal feature fusion is proposed. The proposed algorithm inputs the visual modal information and acoustic modal information into the domain separation network, and divides the entire feature space into a shared domain part shared by all modalities and the private domain part unique to the acoustic and visual modalities respectively. By optimizing the domain separation network, the differences and similarities among different modal features are preserved to the greatest extent. The experiments on the public micro-video classification dataset prove that the proposed algorithm can effectively reduce the redundancy of feature fusion and improve the average classification accuracy to 0. 813. © 2021, Editorial Board of JBUAA. All right reserved.
引用
收藏
页码:478 / 485
页数:7
相关论文
共 25 条
  • [1] ZHANG J, NIE L, WANG X, Et al., Shorter-is-better: Venue category estimation from micro-video, Proceedings of ACM International Conference on Multimedia, pp. 1415-1424, (2016)
  • [2] WEI Y, WANG X, GUAN W, Et al., Neural multimodal cooperative learning toward micro-video understanding, IEEE Transactions on Image Processing, 29, pp. 1-14, (2019)
  • [3] NIE L, WANG X, ZHANG J, Et al., Enhancing micro-video understanding by harnessing external sounds, Proceedings of ACM International Conference on Multimedia, pp. 1192-1200, (2017)
  • [4] JING P, SU Y, NIE L, Et al., Low-rank multi-view embedding learning for micro-video popularity prediction, IEEE Transactions on Knowledge and Data Engineering, 30, 8, pp. 1519-1532, (2017)
  • [5] LIU S, CHEN Z, LIU H, Et al., User-video co-attention network for personalized micro-video recommendation, Proceedings of World Wide Web Conference, pp. 3020-3026, (2019)
  • [6] SHANG S, SHI M, SHANG W, Et al., A micro-video recommendation system based on big data, Proceedings of International Conference on Computer and Information Science, pp. 1-5, (2016)
  • [7] LONG X, GAN C, MELO G D, Et al., Attention clusters: Purely attention based local feature integration for video classification, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 7834-7843, (2018)
  • [8] MA C Y, CHEN M H, KIRA Z, Et al., TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition, Signal Processing: Image Communication, 71, pp. 76-87, (2019)
  • [9] TRAN D, BOURDEY L, FERGUS R, Et al., Learning spatiotemporal features with 3D convolutional networks, Proceedings of IEEE International Conference on Computer Vision, pp. 4489-4497, (2015)
  • [10] CARREIRA J, ZISSERMAN A, QUO V., Action recognition? A new model and the kinetics dataset, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299-6308, (2017)