Fine-Grained Video Retrieval With Scene Sketches

被引:2
|
作者
Zuo, Ran [1 ,2 ]
Deng, Xiaoming [1 ,2 ]
Chen, Keqi [1 ,2 ]
Zhang, Zhengming [1 ,2 ]
Lai, Yu-Kun [3 ]
Liu, Fang [4 ]
Ma, Cuixia [1 ,2 ]
Wang, Hao [5 ]
Liu, Yong-Jin [4 ]
Wang, Hongan [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Software, Beijing Key Lab Human Comp Interact, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Dept Comp Sci & Technol, Beijing 101408, Peoples R China
[3] Cardiff Univ, Dept Comp Sci & Informat, Cardiff CF24 4AG, Wales
[4] Tsinghua Univ, BNRist, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
[5] Alibaba, Beijing 100102, Peoples R China
基金
中国国家自然科学基金;
关键词
Task analysis; Semantics; Visualization; Convolutional neural networks; Layout; Image coding; Encoding; Fine-grained sketch-based video retrieval; sketch-video dataset; scene sketch; graph convolutional networks;
D O I
10.1109/TIP.2023.3278474
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Benefiting from the intuitiveness and naturalness of sketch interaction, sketch-based video retrieval (SBVR) has received considerable attention in the video retrieval research area. However, most existing SBVR research still lacks the capability of accurate video retrieval with fine-grained scene content. To address this problem, in this paper we investigate a new task, which focuses on retrieving the target video by utilizing a fine-grained storyboard sketch depicting the scene layout and major foreground instances' visual characteristics (e.g., appearance, size, pose, etc.) of video; we call such a task "fine-grained scene-level SBVR". The most challenging issue in this task is how to perform scene-level cross-modal alignment between sketch and video. Our solution consists of two parts. First, we construct a scene-level sketch-video dataset called SketchVideo, in which sketch-video pairs are provided and each pair contains a clip-level storyboard sketch and several keyframe sketches (corresponding to video frames). Second, we propose a novel deep learning architecture called Sketch Query Graph Convolutional Network (SQ-GCN). In SQ-GCN, we first adaptively sample the video frames to improve video encoding efficiency, and then construct appearance and category graphs to jointly model visual and semantic alignment between sketch and video. Experiments show that our fine-grained scene-level SBVR framework with SQ-GCN architecture outperforms the state-of-the-art fine-grained retrieval methods. The SketchVideo dataset and SQ-GCN code are available in the project webpage https://iscas-mmsketch.github.io/FG-SL-SBVR/.
引用
收藏
页码:3136 / 3149
页数:14
相关论文
共 50 条
  • [21] Knowledge Mining with Scene Text for Fine-Grained Recognition
    Wang, Hao
    Liao, Junchao
    Cheng, Tianheng
    Gao, Zewen
    Liu, Hao
    Ren, Bo
    Bai, Xiang
    Liu, Wenyu
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4614 - 4623
  • [22] Fine-Grained Predicates Learning for Scene Graph Generation
    Lyu, Xinyu
    Gao, Lianli
    Guo, Yuyu
    Zhao, Zhou
    Huang, Hao
    Shen, Heng Tao
    Song, Jingkuan
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19445 - 19453
  • [23] Semantic Clustering for Robust Fine-Grained Scene Recognition
    George, Marian
    Dixit, Mandar
    Zogg, Gabor
    Vasconcelos, Nuno
    COMPUTER VISION - ECCV 2016, PT I, 2016, 9905 : 783 - 798
  • [24] A fine-grained approach to scene text script identification
    Gomez, Lluis
    Karatzas, Dimosthenis
    PROCEEDINGS OF 12TH IAPR WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS, (DAS 2016), 2016, : 192 - 197
  • [25] Favor: Fine-Grained Video Rate Adaptation
    He, Jian
    Qureshi, Mubashir Adnan
    Qiu, Lili
    Li, Jin
    Li, Feng
    Han, Lei
    PROCEEDINGS OF THE 9TH ACM MULTIMEDIA SYSTEMS CONFERENCE (MMSYS'18), 2018, : 64 - 75
  • [26] Fine-grained Video Captioning for Sports Narrative
    Yu, Huanyu
    Cheng, Shuo
    Ni, Bingbing
    Wang, Minsi
    Zhang, Jian
    Yang, Xiaokang
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6006 - 6015
  • [27] Fine-Grained Crowdsourcing for Fine-Grained Recognition
    Jia Deng
    Krause, Jonathan
    Li Fei-Fei
    2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, : 580 - 587
  • [28] Online video advertising based on fine-grained video tags
    Lu, Feng
    Wang, Zirui
    Liao, Xiaofei
    Jin, Hai
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2014, 51 (12): : 2733 - 2745
  • [29] CMFG: Cross-Model Fine-Grained Feature Interaction for Text-Video Retrieval
    Zhao, Shengwei
    Liu, Yuying
    Du, Shaoyi
    Tian, Zhiqiang
    Qu, Ting
    Xu, Linhai
    MULTIMEDIA MODELING, MMM 2023, PT II, 2023, 13834 : 435 - 445
  • [30] Query-Adaptive Late Fusion for Hierarchical Fine-Grained Video-Text Retrieval
    Ma, Wentao
    Chen, Qingchao
    Liu, Fang
    Zhou, Tongqing
    Cai, Zhiping
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (05) : 7150 - 7161