Efficient Object Annotation via Speaking and Pointing

被引:0
|
作者
Michael Gygli
Vittorio Ferrari
机构
[1] Google Research,
来源
关键词
Speech-based annotation; Object annotation; Multimodal interfaces; Large-scale computer vision;
D O I
暂无
中图分类号
学科分类号
摘要
Deep neural networks deliver state-of-the-art visual recognition, but they rely on large datasets, which are time-consuming to annotate. These datasets are typically annotated in two stages: (1) determining the presence of object classes at the image level and (2) marking the spatial extent for all objects of these classes. In this work we use speech, together with mouse inputs, to speed up this process. We first improve stage one, by letting annotators indicate object class presence via speech. We then combine the two stages: annotators draw an object bounding box via the mouse and simultaneously provide its class label via speech. Using speech has distinct advantages over relying on mouse inputs alone. First, it is fast and allows for direct access to the class name, by simply saying it. Second, annotators can simultaneously speak and mark an object location. Finally, speech-based interfaces can be kept extremely simple, hence using them requires less mouse movement compared to existing approaches. Through extensive experiments on the COCO and ILSVRC datasets we show that our approach yields high-quality annotations at significant speed gains. Stage one takes 2.3×-14.9×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2.3{\times }-14.9{\times }$$\end{document} less annotation time than existing methods based on a hierarchical organization of the classes to be annotated. Moreover, when combining the two stages, we find that object class labels come for free: annotating them at the same time as bounding boxes has zero additional cost. On COCO, this makes the overall process 1.9×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.9\times $$\end{document} faster than the two-stage approach.
引用
收藏
页码:1061 / 1075
页数:14
相关论文
共 50 条
  • [21] Efficient joint object matching via linear programming
    Antonio De Rosa
    Aida Khajavirad
    Mathematical Programming, 2023, 202 : 1 - 46
  • [22] Efficient object shape recovery via slicing planes
    Lai, Po-Lun
    Yilmaz, Alper
    2008 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-12, 2008, : 3369 - 3374
  • [23] Improved object reidentification via more efficient embeddings
    Bayraktar, Ertugrul
    TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2023, 31 (02) : 282 - 294
  • [24] Efficient Constituency Parsing by Pointing
    Thanh-Tung Nguyen
    Xuan-Phi Nguyen
    Joty, Shafiq
    Li, Xiaoli
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3284 - 3294
  • [25] Pointing and its object: on neuropsychology of objectivation.
    Degos, JD
    Bachoud-Levi, AC
    REVUE NEUROLOGIQUE, 1998, 154 (04) : 283 - 290
  • [26] Object Reconstruction in Non-Pointing Geometry
    Cowden, C. S.
    16TH INTERNATIONAL CONFERENCE ON CALORIMETRY IN HIGH ENERGY PHYSICS (CALOR 2014), 2015, 587
  • [27] Object detection for Verification Based Annotation
    Batchelor, Oliver
    Green, Richard
    2019 INTERNATIONAL CONFERENCE ON IMAGE AND VISION COMPUTING NEW ZEALAND (IVCNZ), 2019,
  • [28] Video Object Annotation, Navigation, and Composition
    Goldman, Dan B.
    Gonterman, Chris
    Curless, Brian
    Salesin, David
    Seitz, Steven M.
    UIST 2008: PROCEEDINGS OF THE 21ST ANNUAL ACM SYMPOSIUM ON USER INTERFACE SOFTWARE AND TECHNOLOGY, 2008, : 3 - 12
  • [29] Interactive Video Object Mask Annotation
    Trung-Nghia Le
    Nguyen, Tam, V
    Quoc-Cuong Tran
    Lam Nguyen
    Trung-Hieu Hoang
    Minh-Quan Le
    Minh-Triet Tran
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 16067 - 16070
  • [30] Towards annotation-efficient segmentation via image-to-image translation
    Vorontsov, Eugene
    Molchanov, Pavlo
    Gazda, Matej
    Beckham, Christopher
    Kautz, Jan
    Kadoury, Samuel
    MEDICAL IMAGE ANALYSIS, 2022, 82