Collaborative Foreground, Background, and Action Modeling Network for Weakly Supervised Temporal Action Localization

被引:9
|
作者
Moniruzzaman, Md. [1 ]
Yin, Zhaozheng [2 ]
机构
[1] SUNY Stony Brook, Dept Comp Sci, Stony Brook, NY 11794 USA
[2] SUNY Stony Brook, Dept Comp Sci, Dept Biomed Informat, Stony Brook, NY 11794 USA
基金
美国国家科学基金会;
关键词
Temporal action localization; foreground modeling; background modeling; action modeling;
D O I
10.1109/TCSVT.2023.3272891
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this paper, we explore the problem of Weakly supervised Temporal Action Localization (W-TAL), where the task is to localize the temporal boundaries of all action instances in an untrimmed video with only video-level supervision. The existing W-TAL methods achieve a good action localization performance by separating the discriminative action and background frames. However, there is still a large performance gap between the weakly and fully supervised methods. The main reason comes from that there are plenty of ambiguous action and background frames in addition to the discriminative action and background frames. Due to the lack of temporal annotations in W-TAL, the ambiguous background frames may be localized as foreground and the ambiguous action frames may be suppressed as background, which result in false positives and false negatives, respectively. In this paper, we introduce a novel collaborative Foreground, Background, and Action Modeling Network (FBA Net) to suppress the background (i.e., both the discriminative and ambiguous background) frames, and localize the actual action-related (i.e., both the discriminative and ambiguous action) frames as foreground, for the precise temporal action localization. We design our FBA-Net with three branches: the foreground modeling (FM) branch, the background modeling (BM) branch, and the class-specific action and background modeling (CM) branch. The CM branch learns to highlight the video frames related to C action classes, and separate the action-related frames of C action classes from the (C + 1)th background class. The collaboration between FM and CM regularizes the consistency between the FM and the C action classes of CM, which reduces the false negative rate by localizing different actual-action-related (i.e., both the discriminative and ambiguous action) frames in a video as foreground. On the other hand, the collaboration between BM and CM regularizes the consistency between the BM and the (C + 1)th background class of CM, which reduces the false positive rate by suppressing both the discriminative and ambiguous background frames. Furthermore, the collaboration between FM and BM enforces more effective foreground background separation. To evaluate the effectiveness of our FBA-Net, we perform extensive experiments on two challenging datasets, THUMOS14 and ActivityNet1.3. The experiments show that our FBA-Net attains superior results.
引用
收藏
页码:6939 / 6951
页数:13
相关论文
共 50 条
  • [1] Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization
    Huang, Linjiang
    Wang, Liang
    Li, Hongsheng
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 7982 - 7991
  • [2] Background Suppression Network for Weakly-Supervised Temporal Action Localization
    Lee, Pilhyeon
    Uh, Youngjung
    Byun, Hyeran
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11320 - 11327
  • [3] ACTION COHERENCE NETWORK FOR WEAKLY SUPERVISED TEMPORAL ACTION LOCALIZATION
    Zhai, Yuanhao
    Wang, Le
    Liu, Ziyi
    Zhang, Qilin
    Hua, Gang
    Zheng, Nanning
    2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2019, : 3696 - 3700
  • [4] Action Completeness Modeling with Background Aware Networks for Weakly-Supervised Temporal Action Localization
    Moniruzzaman, Md
    Yin, Zhaozheng
    He, Zhihai
    Qin, Ruwen
    Leu, Ming C.
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 2166 - 2174
  • [5] Weakly-supervised Action Localization with Background Modeling
    Phuc Xuan Nguyen
    Ramanan, Deva
    Fowlkes, Charless C.
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5501 - 5510
  • [6] Action Coherence Network for Weakly-Supervised Temporal Action Localization
    Zhai, Yuanhao
    Wang, Le
    Tang, Wei
    Zhang, Qilin
    Zheng, Nanning
    Hua, Gang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1857 - 1870
  • [7] Weakly-Supervised Temporal Action Localization by Background Suppression
    Liu, Mengxue
    Gao, Xiangjun
    Ge, Fangzhen
    Liu, Huaiyu
    Li, Wenjing
    2021 PROCEEDINGS OF THE 40TH CHINESE CONTROL CONFERENCE (CCC), 2021, : 7074 - 7081
  • [8] Action Unit Memory Network for Weakly Supervised Temporal Action Localization
    Luo, Wang
    Zhang, Tianzhu
    Yang, Wenfei
    Liu, Jingen
    Mei, Tao
    Wu, Feng
    Zhang, Yongdong
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 9964 - 9974
  • [9] Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization: A Clustering-based Approach
    Liu, Qinying
    Wang, Zilei
    Rong, Shenghai
    Li, Junjie
    Zhang, Yixin
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 10399 - 10409
  • [10] Discriminative Action Snippet Propagation Network for Weakly Supervised Temporal Action Localization
    Dang, Yuanjie
    Huang, Chunxia
    Chen, Peng
    Zhao, Dongdong
    Gao, Nan
    Liang, Ronghua
    Huan, Ruohong
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (06)