Fine-Grained Video Retrieval With Scene Sketches

被引:2
|
作者
Zuo, Ran [1 ,2 ]
Deng, Xiaoming [1 ,2 ]
Chen, Keqi [1 ,2 ]
Zhang, Zhengming [1 ,2 ]
Lai, Yu-Kun [3 ]
Liu, Fang [4 ]
Ma, Cuixia [1 ,2 ]
Wang, Hao [5 ]
Liu, Yong-Jin [4 ]
Wang, Hongan [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Software, Beijing Key Lab Human Comp Interact, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Dept Comp Sci & Technol, Beijing 101408, Peoples R China
[3] Cardiff Univ, Dept Comp Sci & Informat, Cardiff CF24 4AG, Wales
[4] Tsinghua Univ, BNRist, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
[5] Alibaba, Beijing 100102, Peoples R China
基金
中国国家自然科学基金;
关键词
Task analysis; Semantics; Visualization; Convolutional neural networks; Layout; Image coding; Encoding; Fine-grained sketch-based video retrieval; sketch-video dataset; scene sketch; graph convolutional networks;
D O I
10.1109/TIP.2023.3278474
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Benefiting from the intuitiveness and naturalness of sketch interaction, sketch-based video retrieval (SBVR) has received considerable attention in the video retrieval research area. However, most existing SBVR research still lacks the capability of accurate video retrieval with fine-grained scene content. To address this problem, in this paper we investigate a new task, which focuses on retrieving the target video by utilizing a fine-grained storyboard sketch depicting the scene layout and major foreground instances' visual characteristics (e.g., appearance, size, pose, etc.) of video; we call such a task "fine-grained scene-level SBVR". The most challenging issue in this task is how to perform scene-level cross-modal alignment between sketch and video. Our solution consists of two parts. First, we construct a scene-level sketch-video dataset called SketchVideo, in which sketch-video pairs are provided and each pair contains a clip-level storyboard sketch and several keyframe sketches (corresponding to video frames). Second, we propose a novel deep learning architecture called Sketch Query Graph Convolutional Network (SQ-GCN). In SQ-GCN, we first adaptively sample the video frames to improve video encoding efficiency, and then construct appearance and category graphs to jointly model visual and semantic alignment between sketch and video. Experiments show that our fine-grained scene-level SBVR framework with SQ-GCN architecture outperforms the state-of-the-art fine-grained retrieval methods. The SketchVideo dataset and SQ-GCN code are available in the project webpage https://iscas-mmsketch.github.io/FG-SL-SBVR/.
引用
收藏
页码:3136 / 3149
页数:14
相关论文
共 50 条
  • [41] Hierarchical Memory Learning for Fine-Grained Scene Graph Generation
    Deng, Youming
    Li, Yansheng
    Zhang, Yongjun
    Xiang, Xiang
    Wang, Jian
    Chen, Jingdong
    Ma, Jiayi
    COMPUTER VISION - ECCV 2022, PT XXVII, 2022, 13687 : 266 - 283
  • [42] Fine-Grained Categorization for 3D Scene Understanding
    Stark, Michael
    Krause, Jonathan
    Pepik, Bojan
    Meger, David
    Little, James J.
    Schiele, Bernt
    Koller, Daphne
    PROCEEDINGS OF THE BRITISH MACHINE VISION CONFERENCE 2012, 2012,
  • [43] Fine-grained scalable video caching for heterogeneous clients
    Liu, Jiangchuan
    Xu, Jianliang
    Chu, Xiaowen
    IEEE TRANSACTIONS ON MULTIMEDIA, 2006, 8 (05) : 1011 - 1020
  • [44] Temporal Query Networks for Fine-grained Video Understanding
    Zhang, Chuhan
    Gupta, Ankush
    Zisserman, Andrew
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4484 - 4494
  • [45] Fine-grained talking face generation with video reinterpretation
    Huang, Xin
    Wang, Mingjie
    Gong, Minglun
    VISUAL COMPUTER, 2021, 37 (01): : 95 - 105
  • [46] Spotting Temporally Precise, Fine-Grained Events in Video
    Hong, James
    Zhang, Haotian
    Gharbi, Michael
    Fisher, Matthew
    Fatahalian, Kayvon
    COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 33 - 51
  • [47] Fine-Grained Video Categorization with Redundancy Reduction Attention
    Zhu, Chen
    Tan, Xiao
    Zhou, Feng
    Liu, Xiao
    Yue, Kaiyu
    Ding, Errui
    Ma, Yi
    COMPUTER VISION - ECCV 2018, PT V, 2018, 11209 : 139 - 155
  • [48] FiGO: Fine-Grained Query Optimization in Video Analytics
    Cao, Jiashen
    Sarkar, Karan
    Hadidi, Ramyad
    Arulraj, Joy
    Kim, Hyesoon
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22), 2022, : 559 - 572
  • [49] Fine-grained talking face generation with video reinterpretation
    Xin Huang
    Mingjie Wang
    Minglun Gong
    The Visual Computer, 2021, 37 : 95 - 105
  • [50] Fine-Grained Motion Estimation for Video Frame Interpolation
    Yan, Bo
    Tan, Weimin
    Lin, Chuming
    Shen, Liquan
    IEEE TRANSACTIONS ON BROADCASTING, 2021, 67 (01) : 174 - 184