ExpressEdit: Video Editing with Natural Language and Sketching

被引:0
|
作者
Tilekbay, Bekzat [1 ]
Yang, Saelyne [1 ]
Lewkowicz, Michal [2 ]
Suryapranata, Alex [1 ]
Kim, Juho [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Sch Comp, Daejeon, South Korea
[2] Yale Univ, Dept Comp Sci, POB 2158, New Haven, CT 06520 USA
关键词
video editing; human-AI interaction; multimodal input;
D O I
10.1145/3640543.3645164
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Informational videos serve as a crucial source for explaining conceptual and procedural knowledge to novices and experts alike. When producing informational videos, editors edit videos by overlaying text/images or trimming footage to enhance the video quality and make it more engaging. However, video editing can be difficult and time-consuming, especially for novice video editors who often struggle with expressing and implementing their editing ideas. To address this challenge, we first explored how multimodality-natural language (NL) and sketching, which are natural modalities humans use for expression-can be utilized to support video editors in expressing video editing ideas. We gathered 176 multimodal expressions of editing commands from 10 video editors, which revealed the patterns of use of NL and sketching in describing edit intents. Based on the findings, we present ExpressEdit, a system that enables editing videos via NL text and sketching on the video frame. Powered by LLM and vision models, the system interprets (1) temporal, (2) spatial, and (3) operational references in an NL command and spatial references from sketching. The system implements the interpreted edits, which then the user can iterate on. An observational study (N=10) showed that ExpressEdit enhanced the ability of novice video editors to express and implement their edit ideas. The system allowed participants to perform edits more efficiently and generate more ideas by generating edits based on user's multimodal edit commands and supporting iterations on the editing commands. This work offers insights into the design of future multimodal interfaces and AI-based pipelines for video editing.
引用
下载
收藏
页码:515 / 536
页数:22
相关论文
共 50 条
  • [21] Connecting language to space via sketching
    Forbus, Kenneth
    COGNITIVE PROCESSING, 2018, 19 : S15 - S15
  • [22] LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing
    Wang, Bryan
    Li, Yuliang
    Lv, Zhaoyang
    Xia, Haijun
    Xu, Yan
    Sodhi, Raj
    PROCEEDINGS OF 2024 29TH ANNUAL CONFERENCE ON INTELLIGENT USER INTERFACES, IUI 2024, 2024, : 699 - 714
  • [23] Contextually Customized Video Summaries via Natural Language
    Choi, Jinsoo
    Oh, Tae-Hyun
    Kweon, In So
    2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 1718 - 1726
  • [24] Searching a Video Database using Natural Language Queries
    Shubha, M.
    Kapoor, Kritika
    Shrutiya, M.
    Mamatha, H. R.
    2021 INTERNATIONAL CONFERENCE ON EMERGING SMART COMPUTING AND INFORMATICS (ESCI), 2021, : 190 - 196
  • [25] Natural Language Processing for Video Essays and Podcasts in Engineering
    Caratozzolo, Patricia
    Alvarez-Delgado, Alvaro
    Hosseini, Samira
    TECHNOLOGY-ENABLED INNOVATIONS IN EDUCATION, 2022, : 1 - 14
  • [26] Generating natural language tags for video information management
    Muhammad Usman Ghani Khan
    Yoshihiko Gotoh
    Machine Vision and Applications, 2017, 28 : 243 - 265
  • [27] Unsupervised Alignment of Natural Language Instructions with Video Segments
    Naim, Iftekhar
    Song, Young Chol
    Liu, Qiguang
    Kautz, Henry
    Luo, Jiebo
    Gildea, Daniel
    PROCEEDINGS OF THE TWENTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2014, : 1558 - 1564
  • [28] Video Scene Classification based on Natural Language Description
    Zhang, Lei
    Khan, Muhammad Usman Ghani
    Gotoh, Yoshihiko
    2011 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCV WORKSHOPS), 2011,
  • [29] Generating natural language tags for video information management
    Khan, Muhammad Usman Ghani
    Gotoh, Yoshihiko
    MACHINE VISION AND APPLICATIONS, 2017, 28 (3-4) : 243 - 265
  • [30] Zero-shot Natural Language Video Localization
    Nam, Jinwoo
    Ahn, Daechul
    Kang, Dongyeop
    Ha, Seong Jong
    Choi, Jonghyun
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1450 - 1459