Hybrid-Attention Enhanced Two-Stream Fusion Network for Video Venue Prediction

被引：3

作者：

Zhang, Yanchao ^{[1
,2
]}

Min, Weiqing ^{[2
,3
]}

Nie, Liqiang ^{[1
]}

Jiang, Shuqiang ^{[2
,3
]}

机构：

[1] Shandong Univ, Sch Comp Sci & Technol, Qingdao 266000, Peoples R China

[2] Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China

[3] Univ Chinese Acad Sci, Beijing 100049, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2021年 / 23卷

基金：

中国国家自然科学基金;

关键词：

Visualization; Feature extraction; Convolution; Streaming media; Object oriented modeling; Three-dimensional displays; Neural networks; knowledge representation; supervised learning; video signal processing;

D O I：

10.1109/TMM.2020.3019714

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Video venue category prediction has been drawing more attention in the multimedia community for various applications such as personalized location recommendation and video verification. Most of existing works resort to the information from either multiple modalities or other platforms for strengthening video representations. However, noisy acoustic information, sparse textual descriptions and incompatible cross-platform data could limit the performance gain and reduce the universality of the model. Therefore, we focus on discriminative visual feature extraction from videos by introducing a hybrid-attention structure. Particularly, we propose a novel Global-Local Attention Module (GLAM), which can be inserted to neural networks to generate enhanced visual features from video content. In GLAM, the Global Attention (GA) is used to catch contextual scene-oriented information via assigning channels with various weights while the Local Attention (LA) is employed to learn salient object-oriented features via allocating different weights for spatial regions. Moreover, GLAM can be extended to ones with multiple GAs and LAs for further visual enhancement. These two types of features respectively captured by GAs and LAs are integrated via convolution layers, and then delivered into convolutional Long Short-Term Memory (convLSTM) to generate spatial-temporal representations, constituting the content stream. In addition, video motions are explored to learn long-term movement variations, which also contributes to video venue prediction. The content and motion stream constitute our proposed Hybrid-Attention Enhanced Two-Stream Fusion Network (HA-TSFN). HA-TSFN finally merges the features from two streams for comprehensive representations. Extensive experiments demonstrate that our method achieves the state-of-the-art performance in the large-scale dataset Vine. The visualization also shows that the proposed GLAM can capture complementary scene-oriented and object-oriented visual features from videos. Our code is available at: https://github.com/zhangyanchao1014/HA-TSFN.

引用

页码：2917 / 2929

页数：13

共 50 条

[11] A dual-attention feature fusion network for imbalanced fault diagnosis with two-stream hybrid generated data
Chenze Wang
Han Wang
Min Liu
[J]. Journal of Intelligent Manufacturing, 2024, 35 : 1707 - 1719
[12] Video Saliency Prediction Based on Spatial-Temporal Two-Stream Network
Zhang, Kao
Chen, Zhenzhong
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (12) : 3544 - 3557
[13] Abnormal event detection for video surveillance using an enhanced two-stream fusion method
Yang, Yuxing
Fu, Zeyu
Naqvi, Syed Mohsen
[J]. NEUROCOMPUTING, 2023, 553
[14] Two-Stream Video Classification with Cross-Modality Attention
Chi, Lu
Tian, Guiyu
Mu, Yadong
Tian, Qi
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 4511 - 4520
[15] Two-stream Graph Attention Convolutional for Video Action Recognition
Zhang, Deyuan
Gao, Hongwei
Dai, Hailong
Shi, Xiangbin
[J]. 2021 IEEE 15TH INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING (BIGDATASE 2021), 2021, : 23 - 27
[16] Intention-convolution and hybrid-attention network for vehicle trajectory prediction
Li, Chao
Liu, Zhanwen
Lin, Shan
Wang, Yang
Zhao, Xiangmo
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 236
[17] Two-stream network for infrared and visible images fusion
Liu, Luolin
Chen, Mulin
Xu, Mingliang
Li, Xuelong
[J]. NEUROCOMPUTING, 2021, 460 : 50 - 58
[18] Remote Sensing Image Fusion Algorithm Based on Two-Stream Fusion Network and Residual Channel Attention Mechanism
Huang, Mengxing
Liu, Shi
Li, Zhenfeng
Feng, Siling
Wu, Di
Wu, Yuanyuan
Shu, Feng
[J]. WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2022, 2022
[19] Remote Sensing Image Fusion Based on Two-Stream Fusion Network
Liu, Xiangyu
Wang, Yunhong
Liu, Qingjie
[J]. MULTIMEDIA MODELING, MMM 2018, PT I, 2018, 10704 : 428 - 439
[20] Remote sensing image fusion based on two-stream fusion network
Liu, Xiangyu
Liu, Qingjie
Wang, Yunhong
[J]. INFORMATION FUSION, 2020, 55 : 1 - 15

← 1 2 3 4 5 →