Nearest Neighbor Search Based on Product Quantization in Clusters

被引：0

作者：

Liu S.-W. ^{[1
,2
]}

Chen W. ^{[1
,2
]}

Zhao W. ^{[3
]}

Chen J.-C. ^{[1
,2
]}

Lu P. ^{[2
,3
]}

机构：

[1] Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan

[2] Key Laboratory of Information Storage System, Huazhong University of Science and Technology, Wuhan

[3] School of Computer Science & Technology, Huazhong University of Science and Technology, Wuhan

来源：

Jisuanji Xuebao/Chinese Journal of Computers | 2020年 / 43卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Approaching nearest neighbor search; Index structure; Plane quanlitization; Product quantilization; Vector quantilization;

D O I：

10.11897/SP.J.1016.2020.00303

中图分类号：

学科分类号：

摘要：

With the rapid development of Internet and multimedia technologies, the volume and dimension of data increase significantly which lead to the need for effective management of large scale and high dimensional data in many applications. Approximate Nearest Neighbor search is one of the most basic problems. Given a query vector, how to retrieve its nearest neighbor vectors in a large scale dataset quickly and accurately, the study of this problem faces many bottlenecks. Among them, the establishment of an efficient index structure to reconstruct the original dataset, making the output indexing table balanced and effectively fitting the dataset's underlying distribution is one of the most important tasks. In addition, due to the curse of dimension, estimating the distance between high-dimensional vectors effectively is also one of the bottlenecks in the nearest neighbor search. Aiming these difficult problems, we propose a neighbor search method based on product quantization in clusters. First, recombine the hierarchical relationship between vector quantization and product quantization to generate a more compact and balanced index structure which can better fit the original large scale and high dimensional dataset, and the empty bucket rate of the index table is significantly reduced. By adjusting the centroid scale of each layer, the intra-cluster quantization tree structure can be flexibly applied to different neighbor search tasks on different large scale datasets. Second, we design a new approach for generating neighbor clusters based on greedy queues. The probability of containing neighboring vectors of the query in its nearest neighbor cluster is low, so a certain number of neighbor clusters are needed to provide enough candidate vectors to apply to the re-ranking part. We propose a greedy queue, which can generate sufficient neighbor clusters for query vector quickly compared with greedy algorithms. Third, our approach also includes an improved re-ranking method called plane quantilization approaching method which is inspired by line quantilization. The calculation of the distance between the candidate set and the query vector is an important part of the re-ranking phase and directly affects the query accuracy. However accurate Euclidean distance calculation between vectors in large-scale and high-dimensional datasets is time consuming and sometimes impossible to achieve under limited physical resources. Therefore, it is necessary to adopt approximate distance calculation approaches. Our method can more efficiently calculate the distance between high dimensional vectors and cause smaller approximation error compared to the state-of-art quantilization approaches, which significantly improved the accuracy of the nearest neighbor retrieval. In the experiments, we use several large scale and high dimensional dataset described by the operator Sift and Gist to measure our method respectively. Compared with the product quantization tree technique, the accuracy of the first recall is increased by 57.7% and the reduction rate of empty bin rate is more than 50%. Compared with local optimized product quantization technology, the recall rate can also be as high as 0.97, but the required query time is reduced by 8 times. The experimental results show that our method based on product quantization in clusters can significantly improve the performance of nearest neighbor search, and is an effective solution for nearest neighbor search on large scale and high dimensional datasets. © 2020, Science Press. All right reserved.

引用

页码：303 / 314

页数：11

共 24 条

[1] Sivic J., Zisserman A., Video Google: A text retrieval approach to object matching in videos, Proceedings of the 9th International Conference on Computer Vision, pp. 1470-1478, (2003)
[2] Jegou H., Tavenard R., Douze M., Et al., Searching in one billion vectors: Re-rank with source coding, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 861-864, (2011)
[3] Zhao Q.-L., Li Z.-M., Cross-modal social image clustering, Chinese Journal of Computers, 41, 1, pp. 98-111, (2018)
[4] Shakhnarovich G., Darrell T., Indyk P., Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing), (2006)
[5] Muja M., Lowe D.G., Scalable nearest neighbor algorithms for high dimensional data, IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 11, pp. 2227-2240, (2014)
[6] Zhu L., Qiu Y.-Y., Yu S., Yuan S., A fast kNN-based MST outlier detection method, Chinese Journal of Computers, 40, 12, pp. 2856-2870, (2017)
[7] Beyer K., Goldstein J., Ramakrishnan R., Et al., When is "nearest neighbor" meaningful, Proceedings of the International Conference on Database Theory, pp. 217-235, (1999)
[8] Bohm C., Berchtold S., Keim D.A., Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases, ACM Computing Surveys, 33, 3, pp. 322-373, (2001)
[9] Friedman J.H., Bentley J.L., Finkel R.A., An algorithm for finding best matches in logarithmic expected time, ACM Transactions on Mathematical Software, 3, 3, pp. 209-226, (1977)
[10] Jegou H., Douze M., Schmid C., Product quantization for nearest neighbor search, IEEE Transactions on Pattern Analysis and Machine Inteligence, 33, 1, pp. 117-128, (2011)

← 1 2 3 →