Efficient Object Annotation via Speaking and Pointing

被引：0

作者：

Michael Gygli

Vittorio Ferrari

机构：

[1] Google Research,

来源：

International Journal of Computer Vision | 2020年 / 128卷

关键词：

Speech-based annotation; Object annotation; Multimodal interfaces; Large-scale computer vision;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Deep neural networks deliver state-of-the-art visual recognition, but they rely on large datasets, which are time-consuming to annotate. These datasets are typically annotated in two stages: (1) determining the presence of object classes at the image level and (2) marking the spatial extent for all objects of these classes. In this work we use speech, together with mouse inputs, to speed up this process. We first improve stage one, by letting annotators indicate object class presence via speech. We then combine the two stages: annotators draw an object bounding box via the mouse and simultaneously provide its class label via speech. Using speech has distinct advantages over relying on mouse inputs alone. First, it is fast and allows for direct access to the class name, by simply saying it. Second, annotators can simultaneously speak and mark an object location. Finally, speech-based interfaces can be kept extremely simple, hence using them requires less mouse movement compared to existing approaches. Through extensive experiments on the COCO and ILSVRC datasets we show that our approach yields high-quality annotations at significant speed gains. Stage one takes 2.3×-14.9×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2.3{\times }-14.9{\times }$$\end{document} less annotation time than existing methods based on a hierarchical organization of the classes to be annotated. Moreover, when combining the two stages, we find that object class labels come for free: annotating them at the same time as bounding boxes has zero additional cost. On COCO, this makes the overall process 1.9×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.9\times $$\end{document} faster than the two-stage approach.

引用

页码：1061 / 1075

页数：14

共 50 条

[21] Efficient joint object matching via linear programming
Antonio De Rosa
Aida Khajavirad
Mathematical Programming, 2023, 202 : 1 - 46
[22] Efficient object shape recovery via slicing planes
Lai, Po-Lun
Yilmaz, Alper
2008 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-12, 2008, : 3369 - 3374
[23] Improved object reidentification via more efficient embeddings
Bayraktar, Ertugrul
TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2023, 31 (02) : 282 - 294
[24] Efficient Constituency Parsing by Pointing
Thanh-Tung Nguyen
Xuan-Phi Nguyen
Joty, Shafiq
Li, Xiaoli
58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3284 - 3294
[25] Pointing and its object: on neuropsychology of objectivation.
Degos, JD
Bachoud-Levi, AC
REVUE NEUROLOGIQUE, 1998, 154 (04) : 283 - 290
[26] Object Reconstruction in Non-Pointing Geometry
Cowden, C. S.
16TH INTERNATIONAL CONFERENCE ON CALORIMETRY IN HIGH ENERGY PHYSICS (CALOR 2014), 2015, 587
[27] Object detection for Verification Based Annotation
Batchelor, Oliver
Green, Richard
2019 INTERNATIONAL CONFERENCE ON IMAGE AND VISION COMPUTING NEW ZEALAND (IVCNZ), 2019,
[28] Video Object Annotation, Navigation, and Composition
Goldman, Dan B.
Gonterman, Chris
Curless, Brian
Salesin, David
Seitz, Steven M.
UIST 2008: PROCEEDINGS OF THE 21ST ANNUAL ACM SYMPOSIUM ON USER INTERFACE SOFTWARE AND TECHNOLOGY, 2008, : 3 - 12
[29] Interactive Video Object Mask Annotation
Trung-Nghia Le
Nguyen, Tam, V
Quoc-Cuong Tran
Lam Nguyen
Trung-Hieu Hoang
Minh-Quan Le
Minh-Triet Tran
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 16067 - 16070
[30] Towards annotation-efficient segmentation via image-to-image translation
Vorontsov, Eugene
Molchanov, Pavlo
Gazda, Matej
Beckham, Christopher
Kautz, Jan
Kadoury, Samuel
MEDICAL IMAGE ANALYSIS, 2022, 82

← 1 2 3 4 5 →