Hybrid-attention and frame difference enhanced network for micro-video venue recognition

被引:2
|
作者
Wang, Bing [1 ]
Huang, Xianglin [1 ]
Cao, Gang [1 ]
Yang, Lifang [1 ]
Wei, Xiaolong [1 ]
Tao, Zhulin [1 ]
机构
[1] Commun Univ China, State Key Lab Media Convergence & Commun, Dingfuzhuang 1, Beijing 100024, Peoples R China
基金
中国国家自然科学基金;
关键词
Micro-video venue recognition; robust visual features; hybrid attention module; difference enhanced module; SCENE;
D O I
10.3233/JIFS-213191
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many micro-video related applications, such as personalized location recommendation and micro-video verification, can be benefited greatly from the venue information. Most existing works focus on integrating the information from multi-modal for exact venue category recognition. It is important to make full use of the information from different modalities. However, the performance may be limited by the lacked acoustic modality or textual descriptions in uploaded micro-videos. Therefore, in this paper visual modality is explored as the only modality according to its rich and indispensable semantic information. To this end, a hybrid-attention and frame difference enhanced network (HAFDN) is proposed to generate the comprehensive venue representation. Such network mainly contains two parallel branches: content and motion branches. Specifically, in the content branch, a domain-adaptive CNN model combined with temporal shift module (TSM) is employed to extract discriminative visual features. Then, a novel hybrid attention module (HAM) is introduced to enhance extracted features via three attention mechanisms. In HAM, channel attention, local and global spatial attention mechanisms are used to capture salient visual information from different views. In addition, convolutional Long Short-Term Memory (convLSTM) is enforced after HAM to better encode the long spatial-temporal dependency. A difference-enhanced module parallel with HAM is devised to learn the content variations among adjacent frames, which is usually ignored in prior works. Moreover, in the motion branch, 3D-CNNs and LSTM are used to capture movement variation as a supplement of content branch in a different form. Finally, the features from two branches are fused to generate robust video-level representations for predicting venue categories. Extensive experimental results on public datasets verify the effectiveness of the proposed micro-video venue recognition scheme. The source code is available at https://github.com/hs8945/HAFDN.
引用
收藏
页码:3337 / 3353
页数:17
相关论文
共 50 条
  • [1] Attention-enhanced and trusted multimodal learning for micro-video venue recognition
    Wang, Bing
    Huang, Xianglin
    Cao, Gang
    Yang, Lifang
    Wei, Xiaolong
    Tao, Zhulin
    [J]. COMPUTERS & ELECTRICAL ENGINEERING, 2022, 102
  • [2] Attention-enhanced joint learning network for micro-video venue classification
    Wang, Bing
    Huang, Xianglin
    Cao, Gang
    Yang, Lifang
    Tao, Zhulin
    Wei, Xiaolong
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (05) : 12425 - 12443
  • [3] Attention-enhanced joint learning network for micro-video venue classification
    Bing Wang
    Xianglin Huang
    Gang Cao
    Lifang Yang
    Zhulin Tao
    Xiaolong Wei
    [J]. Multimedia Tools and Applications, 2024, 83 : 12425 - 12443
  • [4] Hybrid-Attention Enhanced Two-Stream Fusion Network for Video Venue Prediction
    Zhang, Yanchao
    Min, Weiqing
    Nie, Liqiang
    Jiang, Shuqiang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 2917 - 2929
  • [5] Attention based consistent semantic learning for micro-video scene recognition
    Guo, Jie
    Nie, Xiushan
    Ma, Yuling
    Shaheed, Kashif
    Ullah, Inam
    Yin, Yilong
    [J]. INFORMATION SCIENCES, 2021, 543 : 504 - 516
  • [6] User-Video Co-Attention Network for Personalized Micro-video Recommendation
    Liu, Shang
    Chen, Zhenzhong
    Liu, Hongyi
    Hu, Xinghai
    [J]. WEB CONFERENCE 2019: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2019), 2019, : 3020 - 3026
  • [7] Multimodal semantic enhanced representation network for micro-video event detection
    Li, Yun
    Liu, Xianyi
    Zhang, Lijuan
    Tian, Haoyu
    Jing, Peiguang
    [J]. KNOWLEDGE-BASED SYSTEMS, 2024, 301
  • [8] Enhancing Micro-Video Venue Recognition via Multi-Modal and Multi-Granularity Object Relations
    Liu, Weijia
    Cao, Jiuxin
    Wei, Ran
    Zhu, Xuelin
    Liu, Bo
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5440 - 5451
  • [9] Joint Learning of LSTMs-CNN and Prototype for Micro-video Venue Classification
    Liu, Wei
    Huang, Xianglin
    Cao, Gang
    Song, Gege
    Yang, Lifang
    [J]. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2018, PT II, 2018, 11165 : 705 - 715
  • [10] Joint Learning of NNeXtVLAD, CNN and Context Gating for Micro-Video Venue Classification
    Liu, Wei
    Huang, Xianglin
    Cao, Gang
    Zhang, Jianglong
    Song, Gege
    Yang, Lifang
    [J]. IEEE ACCESS, 2019, 7 : 77091 - 77099