Evolution of a Web-Scale Near Duplicate Image Detection System

被引:1
|
作者
Gusev, Andrey [1 ]
Xu, Jiajing [1 ]
机构
[1] Pinterest, San Francisco, CA 94107 USA
关键词
near-duplicate detection; recommendation systems; locality sensitive hashing; transfer learning; clustering;
D O I
10.1145/3366423.3380031
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Detecting near duplicate images is fundamental to the content ecosystem of photo sharing web applications. However, such a task is challenging when involving a web-scale image corpus containing billions of images. In this paper, we present an efficient system for detecting near duplicate images across 8 billion images. Our system consists of three stages: candidate generation, candidate selection, and clustering. We also demonstrate that this system can be used to greatly improve the quality of recommendations and search results across a number of real-world applications. In addition, we include the evolution of the system over the course of six years, bringing out experiences and lessons on how new systems are designed to accommodate organic content growth as well as the latest technology. Finally, we are releasing a human-labeled dataset of similar to 53,000 pairs of images introduced in this paper.
引用
收藏
页码:2733 / 2739
页数:7
相关论文
共 50 条
  • [21] Web-Scale Generic Object Detection at Microsoft Bing
    Chen, Stephen Xi
    Mukherjee, Saurajit
    Phadke, Unmesh
    Wang, Tingting
    Park, Junwon
    Yada, Ravi Theja
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 2674 - 2682
  • [22] Source Retrieval for Web-Scale Text Reuse Detection
    Hagen, Matthias
    Potthast, Martin
    Adineh, Payam
    Fatehifar, Ehsan
    Stein, Benno
    CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, : 2091 - 2094
  • [23] Aligning codebooks for near duplicate image detection
    Battiato, Sebastiano
    Farinella, Giovanni Maria
    Puglisi, Giovanni
    Ravi, Daniele
    MULTIMEDIA TOOLS AND APPLICATIONS, 2014, 72 (02) : 1483 - 1506
  • [24] Aligning codebooks for near duplicate image detection
    Sebastiano Battiato
    Giovanni Maria Farinella
    Giovanni Puglisi
    Daniele Ravì
    Multimedia Tools and Applications, 2014, 72 : 1483 - 1506
  • [25] A Conceptual Model for a Web-Scale Entity Name System
    Bouquet, Paolo
    Palpanas, Themis
    Stoermer, Heiko
    Vignolo, Massimiliano
    SEMANTIC WEB, PROCEEDINGS, 2009, 5926 : 46 - 60
  • [26] Parallelized Near-Duplicate Document Detection Algorithm for Large Scale Chinese Web Pages
    Wei, Yongzhuang
    Wang, Shuai
    Yuan, Chunfeng
    Huang, Yihua
    2012 13TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS, AND TECHNOLOGIES (PDCAT 2012), 2012, : 523 - 528
  • [27] Duplicate image detection in a stream of web visual data
    Gadeski, Etienne
    Le Borgne, Herve
    Popescu, Adrian
    2015 13TH INTERNATIONAL WORKSHOP ON CONTENT-BASED MULTIMEDIA INDEXING (CBMI), 2015,
  • [28] Candidate Document Retrieval for Web-Scale Text Reuse Detection
    Hagen, Matthias
    Stein, Benno
    STRING PROCESSING AND INFORMATION RETRIEVAL, 2011, 7024 : 356 - 367
  • [29] An Integrated Approach to Near-duplicate Image Detection
    Yang, Heesung
    Park, Hyeyoung
    2023 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE IN INFORMATION AND COMMUNICATION, ICAIIC, 2023, : 425 - 428
  • [30] Web-scale Knowledge Collection
    Lockard, Colin
    Shiralkar, Prashant
    Dong, Xin Luna
    Hajishirzi, Hannaneh
    PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM '20), 2020, : 888 - 889