Hybrid-Attention Enhanced Two-Stream Fusion Network for Video Venue Prediction

被引:3
|
作者
Zhang, Yanchao [1 ,2 ]
Min, Weiqing [2 ,3 ]
Nie, Liqiang [1 ]
Jiang, Shuqiang [2 ,3 ]
机构
[1] Shandong Univ, Sch Comp Sci & Technol, Qingdao 266000, Peoples R China
[2] Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
[3] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Feature extraction; Convolution; Streaming media; Object oriented modeling; Three-dimensional displays; Neural networks; knowledge representation; supervised learning; video signal processing;
D O I
10.1109/TMM.2020.3019714
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video venue category prediction has been drawing more attention in the multimedia community for various applications such as personalized location recommendation and video verification. Most of existing works resort to the information from either multiple modalities or other platforms for strengthening video representations. However, noisy acoustic information, sparse textual descriptions and incompatible cross-platform data could limit the performance gain and reduce the universality of the model. Therefore, we focus on discriminative visual feature extraction from videos by introducing a hybrid-attention structure. Particularly, we propose a novel Global-Local Attention Module (GLAM), which can be inserted to neural networks to generate enhanced visual features from video content. In GLAM, the Global Attention (GA) is used to catch contextual scene-oriented information via assigning channels with various weights while the Local Attention (LA) is employed to learn salient object-oriented features via allocating different weights for spatial regions. Moreover, GLAM can be extended to ones with multiple GAs and LAs for further visual enhancement. These two types of features respectively captured by GAs and LAs are integrated via convolution layers, and then delivered into convolutional Long Short-Term Memory (convLSTM) to generate spatial-temporal representations, constituting the content stream. In addition, video motions are explored to learn long-term movement variations, which also contributes to video venue prediction. The content and motion stream constitute our proposed Hybrid-Attention Enhanced Two-Stream Fusion Network (HA-TSFN). HA-TSFN finally merges the features from two streams for comprehensive representations. Extensive experiments demonstrate that our method achieves the state-of-the-art performance in the large-scale dataset Vine. The visualization also shows that the proposed GLAM can capture complementary scene-oriented and object-oriented visual features from videos. Our code is available at: https://github.com/zhangyanchao1014/HA-TSFN.
引用
收藏
页码:2917 / 2929
页数:13
相关论文
共 50 条
  • [1] Hybrid-attention and frame difference enhanced network for micro-video venue recognition
    Wang, Bing
    Huang, Xianglin
    Cao, Gang
    Yang, Lifang
    Wei, Xiaolong
    Tao, Zhulin
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2022, 43 (03) : 3337 - 3353
  • [2] Two-stream LSTM Network with Hybrid Attention for Vehicle Trajectory Prediction
    Li, Chao
    Liu, Zhanwen
    Zhang, Jiaying
    Wang, Yang
    Ding, Fan
    Zhao, Xiangmo
    [J]. 2022 IEEE 25TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC), 2022, : 1927 - 1934
  • [3] TWO-STREAM HYBRID ATTENTION NETWORK FOR MULTIMODAL CLASSIFICATION
    Chen, Qipin
    Shi, Zhenyu
    Zuo, Zhen
    Fu, Jinmiao
    Sun, Yi
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 359 - 363
  • [4] Structured Two-Stream Attention Network for Video Question Answering
    Gao, Lianli
    Zeng, Pengpeng
    Song, Jingkuan
    Li, Yuan-Fang
    Liu, Wu
    Mei, Tao
    Shen, Heng Tao
    [J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 6391 - 6398
  • [5] Convolutional Two-Stream Network Fusion for Video Action Recognition
    Feichtenhofer, Christoph
    Pinz, Axel
    Zisserman, Andrew
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 1933 - 1941
  • [6] Pornographic Video Detection with Convolutional Two-Stream Network Fusion
    Lee, Wonjae
    Kim, Junghak
    Lee, Nam Kyung
    [J]. 11TH INTERNATIONAL CONFERENCE ON ICT CONVERGENCE: DATA, NETWORK, AND AI IN THE AGE OF UNTACT (ICTC 2020), 2020, : 1273 - 1275
  • [7] Compositional Attention Networks With Two-Stream Fusion for Video Question Answering
    Yu, Ting
    Yu, Jun
    Yu, Zhou
    Tao, Dacheng
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 1204 - 1218
  • [8] Compositional attention networks with two-stream fusion for video question answering
    Yu, Ting
    Yu, Jun
    Yu, Zhou
    Tao, Dacheng
    [J]. IEEE Transactions on Image Processing, 2020, 29 : 1204 - 1218
  • [9] Two-Stream Attention Network for Pain Recognition from Video Sequences
    Thiam, Patrick
    Kestler, Hans A.
    Schwenker, Friedhelm
    [J]. SENSORS, 2020, 20 (03)
  • [10] A dual-attention feature fusion network for imbalanced fault diagnosis with two-stream hybrid generated data
    Wang, Chenze
    Wang, Han
    Liu, Min
    [J]. JOURNAL OF INTELLIGENT MANUFACTURING, 2024, 35 (04) : 1707 - 1719