Listen to Look: Action Recognition by Previewing Audio

被引:132
|
作者
Gao, Ruohan [1 ,2 ]
Oh, Tae-Hyun [2 ,3 ]
Grauman, Kristen [1 ,2 ]
Torresani, Lorenzo [2 ]
机构
[1] Univ Texas Austin, Austin, TX 78712 USA
[2] Facebook AI Res, Austin, TX 78701 USA
[3] POSTECH, Dept EE, Pohang, South Korea
关键词
D O I
10.1109/CVPR42600.2020.01047
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical. We propose a framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redun-dancies. First, we devise an IMGAUD2VID framework that hallucinates clip-level features by distilling from lighter modalities-a single frame and its accompanying audio-reducing short-term temporal redundancy for efficient clip-level recognition. Second, building on IMGAUD2VID, we further propose IMGAUD-SKIMMING, an attention-based long short-term memory network that iteratively selects useful moments in untrimmed videos, reducing long-term temporal redundancy for efficient video-level recognition. Extensive experiments on four action recognition datasets demonstrate that our method achieves the state-of-the-art in terms of both recognition accuracy and speed.
引用
收藏
页码:10454 / 10464
页数:11
相关论文
共 50 条