Binary Embedding-based Retrieval at Tencent

被引:2
|
作者
Gan, Yukang [1 ]
Ge, Yixiao [1 ]
Zhou, Chang [2 ]
Su, Shupeng [1 ]
Xu, Zhouchuan [3 ]
Xu, Xuyuan [2 ]
Hui, Quanchao [3 ]
Chen, Xiang [3 ]
Wang, Yexin [2 ]
Shan, Ying [1 ,3 ]
机构
[1] Tencent PCG, ARC Lab, Shenzhen, Peoples R China
[2] Tencent Video, PCG, Shenzhen, Peoples R China
[3] Tencent Search, PCG, Shenzhen, Peoples R China
关键词
embedding-based retrieval; embedding binarization; backward compatibility;
D O I
10.1145/3580305.3599782
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale embedding-based retrieval (EBR) is the cornerstone of search-related industrial applications. Given a user query, the system of EBR aims to identify relevant information from a large corpus of documents that may be tens or hundreds of billions in size. The storage and computation turn out to be expensive and inefficient with massive documents and high concurrent queries, making it difficult to further scale up. To tackle the challenge, we propose a binary embedding-based retrieval (BEBR) engine equipped with a recurrent binarization algorithm that enables customized bits per dimension. Specifically, we compress the full-precision query and document embeddings, formulated as float vectors in general, into a composition of multiple binary vectors using a lightweight transformation model with residual multi-layer perception (MLP) blocks. The bits of transformed binary vectors are jointly determined by the output dimension of MLP blocks (termed..) and the number of residual blocks (termed u), i.e., m x (u + 1). We can therefore tailor the number of bits for different applications to trade off accuracy loss and cost savings. Importantly, we enable task-agnostic efficient training of the binarization model using a new embedding-to-embedding strategy, e.g., only 2 V100 GPU hours are required by millions of vectors for training. We also exploit the compatible training of binary embeddings so that the BEBR engine can support indexing among multiple embedding versions within a unified system. To further realize efficient search, we propose Symmetric Distance Calculation (SDC) to achieve lower response time than Hamming codes. The technique exploits Single Instruction Multiple Data (SIMD) units widely available in current CPUs. We successfully employed the introduced BEBR to web search and copyright detection of Tencent products, including Sogou, Tencent Video, QQ World, etc. The binarization algorithm can be seamlessly generalized to various tasks with multiple modalities, for instance, natural language processing (NLP) and computer vision (CV). Extensive experiments on offline benchmarks and online A/B tests demonstrate the efficiency and effectiveness of our method, significantly saving 30% similar to 50% index costs with almost no loss of accuracy at the system level(1).
引用
收藏
页码:4056 / 4067
页数:12
相关论文
共 50 条
  • [31] Explanations for Network Embedding-Based Link Predictions
    Kang, Bo
    Lijffijt, Jefrey
    De Bie, Tijl
    MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2021, PT I, 2021, 1524 : 473 - 488
  • [32] MEAL: Manifold Embedding-based Active Learning
    Sreenivasaiah, Deepthi
    Otterbach, Johannes
    Wollmann, Thomas
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 1029 - 1037
  • [33] An Embedding-Based Topic Model for Document Classification
    Seifollahi, Sattar
    Piccardi, Massimo
    Jolfaei, Alireza
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2021, 20 (03)
  • [34] An Embedding-Based Approach to Repairing Question Semantics
    Zhou, Haixin
    Wang, Kewen
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS: DASFAA 2021 INTERNATIONAL WORKSHOPS, 2021, 12680 : 107 - 122
  • [35] Embedding-based approximate query for knowledge graph
    Qiu, Jingyi
    Zhang, Duxi
    Song, Aibo
    Wang, Honglin
    Zhang, Tianbo
    Jin, Jiahui
    Fang, Xiaolin
    Li, Yaqi
    Journal of Southeast University (English Edition), 2024, 40 (04) : 417 - 424
  • [36] EMBEDDING-BASED INTERPOLATION ON THE SPECIAL ORTHOGONAL GROUP
    Gawlik, Evan S.
    Leok, Melvin
    SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2018, 40 (02): : A721 - A746
  • [37] An Embedding-based Approach to Recommending SPARQL Queries
    Zhang, Lijing
    Zhang, Xiaowang
    Feng, Zhiyong
    2018 IEEE 30TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI), 2018, : 991 - 998
  • [38] SEMS: Scalable Embedding Memory System for Accelerating Embedding-Based DNNs
    Kim, Sejin
    Kim, Jungwoo
    Jang, Yongjoo
    Kung, Jaeha
    Lee, Sungjin
    IEEE COMPUTER ARCHITECTURE LETTERS, 2022, 21 (02) : 157 - 160
  • [39] Word Embedding-Based Topic Similarity Measures
    Terragni, Silvia
    Fersini, Elisabetta
    Messina, Enza
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2021), 2021, 12801 : 33 - 45
  • [40] Neural embedding-based indices for semantic search
    Lashkari, Fatemeh
    Bagheri, Ebrahim
    Ghorbani, Ali A.
    INFORMATION PROCESSING & MANAGEMENT, 2019, 56 (03) : 733 - 755