Semantic2Graph: graph-based multi-modal feature fusion for action segmentation in videos

被引:1
|
作者
Zhang, Junbin [1 ]
Tsai, Pei-Hsuan [2 ]
Tsai, Meng-Hsun [1 ,3 ]
机构
[1] Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 701, Taiwan
[2] Natl Cheng Kung Univ, Inst Mfg Informat & Syst, Tainan 701, Taiwan
[3] Natl Yang Ming Chiao Tung Univ, Dept Comp Sci, Hsinchu 300, Taiwan
关键词
Video action segmentation; Graph neural networks; Computer vision; Semantic features; Multi-modal fusion; CONVOLUTIONAL NETWORK; LOCALIZATION; ATTENTION;
D O I
10.1007/s10489-023-05259-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video action segmentation have been widely applied in many fields. Most previous studies employed video-based vision models for this purpose. However, they often rely on a large receptive field, LSTM or Transformer methods to capture long-term dependencies within videos, leading to significant computational resource requirements. To address this challenge, graph-based model was proposed. However, previous graph-based models are less accurate. Hence, this study introduces a graph-structured approach named Semantic2Graph, to model long-term dependencies in videos, thereby reducing computational costs and raise the accuracy. We construct a graph structure of video at the frame-level. Temporal edges are utilized to model the temporal relations and action order within videos. Additionally, we have designed positive and negative semantic edges, accompanied by corresponding edge weights, to capture both long-term and short-term semantic relationships in video actions. Node attributes encompass a rich set of multi-modal features extracted from video content, graph structures, and label text, encompassing visual, structural, and semantic cues. To synthesize this multi-modal information effectively, we employ a graph neural network (GNN) model to fuse multi-modal features for node action label classification. Experimental results demonstrate that Semantic2Graph outperforms state-of-the-art methods in terms of performance, particularly on benchmark datasets such as GTEA and 50Salads. Multiple ablation experiments further validate the effectiveness of semantic features in enhancing model performance. Notably, the inclusion of semantic edges in Semantic2Graph allows for the cost-effective capture of long-term dependencies, affirming its utility in addressing the challenges posed by computational resource constraints in video-based vision models.
引用
收藏
页码:2084 / 2099
页数:16
相关论文
共 50 条
  • [31] Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos
    Zhang, Zongmeng
    Han, Xianjing
    Song, Xuemeng
    Yan, Yan
    Nie, Liqiang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 8265 - 8277
  • [32] AN INTEGRATED GRAPH-BASED FACE SEGMENTATION APPROACH FROM KINECT VIDEOS
    Zhang, Jixia
    Wang, Haibo
    Liu, Shaoguo
    Duan, Jiangyong
    Wang, Ying
    Pan, Chunhong
    2013 20TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP 2013), 2013, : 2733 - 2737
  • [33] Multi-Modal Sensor Fusion-Based Semantic Segmentation for Snow Driving Scenarios
    Vachmanus, Sirawich
    Ravankar, Ankit A.
    Emaru, Takanori
    Kobayashi, Yukinori
    IEEE SENSORS JOURNAL, 2021, 21 (15) : 16839 - 16851
  • [34] Multi-modal feature selection with anchor graph for Alzheimer's disease
    Li, Jiaye
    Xu, Hang
    Yu, Hao
    Jiang, Zhihao
    Zhu, Lei
    FRONTIERS IN NEUROSCIENCE, 2022, 16
  • [35] Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation
    Zhong, Zeyun
    Schneider, David
    Voit, Michael
    Stiefelhagen, Rainer
    Beyerer, Juergen
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 6057 - 6066
  • [36] Graph-based Regional Feature Enhancing for Abdominal Multi-Organ Segmentation in CT
    Yang, Zefan
    Wang, Yi
    2022 IEEE 35TH INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS (CBMS), 2022, : 125 - 130
  • [37] EISNet: A Multi-Modal Fusion Network for Semantic Segmentation With Events and Images
    Xie, Bochen
    Deng, Yongjian
    Shao, Zhanpeng
    Li, Youfu
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8639 - 8650
  • [38] Graph-based cell pattern recognition for merging the multi-modal optical microscopic image of neurons
    Li, Wenwei
    Chen, Wu
    Dai, Zimin
    Chai, Xiaokang
    An, Sile
    Guan, Zhuang
    Zhou, Wei
    Chen, Jianwei
    Gong, Hui
    Luo, Qingming
    Feng, Zhao
    Li, Anan
    COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2024, 256
  • [39] Twin Graph-based Anomaly Detection via Attentive Multi-Modal Learning for Microservice System
    Huang, Jun
    Yang, Yang
    Yu, Hang
    Li, Jianguo
    Zheng, Xiao
    2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE, 2023, : 66 - 78
  • [40] Graph-based convolution feature aggregation for retinal vessel segmentation
    Shi, Cao
    Xu, Canhui
    He, Jianfei
    Chen, Yinong
    Cheng, Yuanzhi
    Yang, Qi
    Qiu, Haitao
    SIMULATION MODELLING PRACTICE AND THEORY, 2022, 121