Guide and interact: scene-graph based generation and control of video captions

被引:0
|
作者
Xuyang Lu
Yang Gao
机构
[1] Beijing Institute of Technology,The School of Computer Science and Technology
[2] Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications,undefined
来源
Multimedia Systems | 2023年 / 29卷
关键词
Video captioning; Scene graph; Multi-modal; Text generation;
D O I
暂无
中图分类号
学科分类号
摘要
Internet videos contain abounding meaningful information. The task of video captioning is to extract and understand video contents from video, and summarize them into a comprehensive description including one or multiple sentences. The research of video captioning involves challenges from both video understanding and natural language generation area. Among the technical obstacles confronted with video captioning, one of the most critical issue undermining the quality of video captioning is that the model tends to generate fictional contents, which is usually called “hallucination” problem. In this paper, we present scene-graph guidance and interaction (SGI) to solve this problem. The framework of SGI is composed of a faithful scene graph generation module and a multi-modal interactive network module. The scene graph generation module extracts a faithful scene graph from video, which is then regarded as the factual guidance for the text generator. The network module attends and interacts the video features and scene graph input, and generates a video caption including the faithful video contents. On this basis, we further explore our SGI model to realize user intention-based controllable video captioning using elaborate scene graphs. We performed experiments on Charades and ActivityNet Captions datasets, the SGI model achieved state-of-the-art performance by automatic metrics, proving the high quality and outstanding controllability of video captions.
引用
收藏
页码:797 / 809
页数:12
相关论文
共 50 条
  • [41] Prediction and Generation of 3D Functional Scene Based on Relation Graph
    Sun Q.
    Hu R.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2022, 34 (09): : 1351 - 1361
  • [42] IndVisSGG: VLM-based scene graph generation for industrial spatial intelligence
    Wang, Zuoxu
    Yan, Zhijie
    Li, Shufei
    Liu, Jihong
    ADVANCED ENGINEERING INFORMATICS, 2025, 65
  • [43] Remote sensing scene graph generation for improved retrieval based on spatial relationships
    Tang, Jiayi
    Tong, Xiaochong
    Qiu, Chunping
    Sun, Yuekun
    Song, Haoshuai
    Lei, Yaxian
    Lei, Yi
    Guo, Congzhou
    ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2025, 220 : 741 - 752
  • [44] PPDL: Predicate Probability Distribution based Loss for Unbiased Scene Graph Generation
    Li, Wei
    Zhang, Haiwei
    Bai, Qijie
    Zhao, Guoqing
    Jiang, Ning
    Yuan, Xiaojie
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19425 - 19434
  • [45] Automatic Question Generation based on MOOC Video Subtitles and Knowledge Graph
    Ma, Lin
    Ma, Yuchun
    PROCEEDINGS OF 2019 7TH INTERNATIONAL CONFERENCE ON INFORMATION AND EDUCATION TECHNOLOGY (ICIET 2019), 2019, : 49 - 53
  • [46] SGFormer: Semantic Graph Transformer for Point Cloud-Based 3D Scene Graph Generation
    Lv, Changsheng
    Qi, Mengshi
    Li, Xia
    Yang, Zhengyuan
    Ma, Huadong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5, 2024, : 4035 - 4043
  • [47] GPS-based route graph generation for over-the-horizon scene analysis
    Kamejima, K. (kamejima@is.oit.ac.jp), 1600, ICIC Express Letters Office, Tokai University, Kumamoto Campus, 9-1-1, Toroku, Kumamoto, 862-8652, Japan (07):
  • [48] Towards Open-Vocabulary Scene Graph Generation with Prompt-Based Finetuning
    He, Tao
    Gao, Lianli
    Song, Jingkuan
    Li, Yuan-Fang
    COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 : 56 - 73
  • [49] Multi-Prototype Space Learning for Commonsense-Based Scene Graph Generation
    Chen, Lianggangxu
    Song, Youqi
    Cai, Yiqing
    Lu, Jiale
    Li, Yang
    Xie, Yuan
    Wang, Changbo
    He, Gaoqi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 1129 - 1137
  • [50] Video Scene Title Generation based on Explicit and Implicit Relations among Caption Words
    Son, Jeong-Woo
    Park, Wonjoo
    Lee, Sang-Yun
    Kim, Sun-Joong
    2018 20TH INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY (ICACT), 2018, : 571 - 573