Collaborative Foreground, Background, and Action Modeling Network for Weakly Supervised Temporal Action Localization

被引：9

作者：

Moniruzzaman, Md. ^{[1
]}

Yin, Zhaozheng ^{[2
]}

机构：

[1] SUNY Stony Brook, Dept Comp Sci, Stony Brook, NY 11794 USA

[2] SUNY Stony Brook, Dept Comp Sci, Dept Biomed Informat, Stony Brook, NY 11794 USA

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2023年 / 33卷 / 11期

基金：

美国国家科学基金会;

关键词：

Temporal action localization; foreground modeling; background modeling; action modeling;

D O I：

10.1109/TCSVT.2023.3272891

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In this paper, we explore the problem of Weakly supervised Temporal Action Localization (W-TAL), where the task is to localize the temporal boundaries of all action instances in an untrimmed video with only video-level supervision. The existing W-TAL methods achieve a good action localization performance by separating the discriminative action and background frames. However, there is still a large performance gap between the weakly and fully supervised methods. The main reason comes from that there are plenty of ambiguous action and background frames in addition to the discriminative action and background frames. Due to the lack of temporal annotations in W-TAL, the ambiguous background frames may be localized as foreground and the ambiguous action frames may be suppressed as background, which result in false positives and false negatives, respectively. In this paper, we introduce a novel collaborative Foreground, Background, and Action Modeling Network (FBA Net) to suppress the background (i.e., both the discriminative and ambiguous background) frames, and localize the actual action-related (i.e., both the discriminative and ambiguous action) frames as foreground, for the precise temporal action localization. We design our FBA-Net with three branches: the foreground modeling (FM) branch, the background modeling (BM) branch, and the class-specific action and background modeling (CM) branch. The CM branch learns to highlight the video frames related to C action classes, and separate the action-related frames of C action classes from the (C + 1)th background class. The collaboration between FM and CM regularizes the consistency between the FM and the C action classes of CM, which reduces the false negative rate by localizing different actual-action-related (i.e., both the discriminative and ambiguous action) frames in a video as foreground. On the other hand, the collaboration between BM and CM regularizes the consistency between the BM and the (C + 1)th background class of CM, which reduces the false positive rate by suppressing both the discriminative and ambiguous background frames. Furthermore, the collaboration between FM and BM enforces more effective foreground background separation. To evaluate the effectiveness of our FBA-Net, we perform extensive experiments on two challenging datasets, THUMOS14 and ActivityNet1.3. The experiments show that our FBA-Net attains superior results.

引用

页码：6939 / 6951

页数：13

共 50 条

[21] Weakly Supervised Action Localization by Sparse Temporal Pooling Network
Phuc Nguyen
Liu, Ting
Prasad, Gautam
Han, Bohyung
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6752 - 6761
[22] Self-attention relational modeling and background suppression for weakly supervised temporal action localization
Wang, Jing
Wang, Chuanxu
JOURNAL OF ELECTRONIC IMAGING, 2022, 31 (06)
[23] Integration of Global and Local Knowledge for Foreground Enhancing in Weakly Supervised Temporal Action Localization
Zhang, Tianyi
Li, Ronglu
Feng, Pengming
Zhang, Rubo
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8476 - 8487
[24] ACGNet: Action Complement Graph Network for Weakly-Supervised Temporal Action Localization
Yang, Zichen
Qin, Jie
Huang, Di
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3090 - 3098
[25] Deep cascaded action attention network for weakly-supervised temporal action localization
Xia, Hui-fen
Zhan, Yong-zhao
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (19) : 29769 - 29787
[26] ACSNet: Action-Context Separation Network for Weakly Supervised Temporal Action Localization
Liu, Ziyi
Wang, Le
Zhang, Qilin
Tang, Wei
Yuan, Junsong
Zheng, Nanning
Hua, Gang
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2233 - 2241
[27] Deep cascaded action attention network for weakly-supervised temporal action localization
Hui-fen Xia
Yong-zhao Zhan
Multimedia Tools and Applications, 2023, 82 : 29769 - 29787
[28] Weakly supervised temporal action localization: a survey
Li, Ronglu
Zhang, Tianyi
Zhang, Rubo
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (32) : 78361 - 78386
[29] Temporal Dropout for Weakly Supervised Action Localization
Xie, Chi
Zhuang, Zikun
Zhao, Shengjie
Liang, Shuang
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (03)
[30] Action Shuffling for Weakly Supervised Temporal Localization
Zhang, Xiao-Yu
Shi, Haichao
Li, Changsheng
Shi, Xinchu
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 4447 - 4457

← 1 2 3 4 5 →