Large-Scale Similarity Search Profiling of ChEMBL Compound Data Sets

被引：64

作者：

Heikamp, Kathrin ^{[1
]}

Bajorath, Juergen ^{[1
]}

机构：

[1] Univ Bonn, Dept Life Sci Informat, B IT, LIMES Program Unit Chem Biol & Med Chem, D-53113 Bonn, Germany

来源：

JOURNAL OF CHEMICAL INFORMATION AND MODELING | 2011年 / 51卷 / 08期

关键词：

FINGERPRINTS; RECOMBINATION;

D O I：

10.1021/ci200199u

中图分类号：

R914 [药物化学];

学科分类号：

100701 ;

摘要：

A large-scale similarity search investigation has been carried out on 266 well-defined compound activity classes extracted from the ChEMBL database. The analysis was performed using two widely applied two-dimensional (2D) fingerprints that mark opposite ends of the current performance spectrum of these types of fingerprints, i.e., MACCS structural keys and the extended connectivity fingerprint with bond diameter four (ECFP4). For each fingerprint, three nearest neighbor search strategies were applied. On the basis of these search calculations, a similarity search profile of the ChEMBL database was generated. Overall, the fingerprint search campaign was surprisingly successful. In 203 of 266 test cases (similar to 76%), a compound recovery rate of at least 50% was observed with at least the better performing fingerprint and one search strategy. The similarity search profile also revealed several general trends. For example, fingerprint searching was often characterized by an early enrichment of active compounds in database selection sets. In addition, compound activity classes have been categorized according to different similarity search performance levels, which helps to put the results of benchmark calculations into perspective. Therefore, a compendium of activity classes falling into different search performance categories is provided. On the basis of our large-scale investigation, the performance range of state-of-the-art 2D fingerprinting has been delineated for compound data sets directed against a wide spectrum of pharmaceutical targets.

引用

页码：1831 / 1839

页数：9

共 50 条

[41] Efficient Similarity Search in Very Large String Sets
Fenz, Dandy
Lange, Dustin
Rheinlaender, Astrid
Naumann, Felix
Leser, Ulf
SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, SSDBM 2012, 2012, 7338 : 262 - 279
[42] Comparative assessment of large-scale data sets of protein–protein interactions
Christian von Mering
Roland Krause
Berend Snel
Michael Cornell
Stephen G. Oliver
Stanley Fields
Peer Bork
Nature, 2002, 417 : 399 - 403
[43] A Structure Optimization Algorithm of Neural Networks for Large-Scale Data Sets
Yang, Jie
Ma, Jun
Berryman, Matthew
Perez, Pascal
2014 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ-IEEE), 2014, : 956 - 961
[44] Sequential learning with LS-SVM for large-scale data sets
Jung, Tobias
Polani, Daniel
ARTIFICIAL NEURAL NETWORKS - ICANN 2006, PT 2, 2006, 4132 : 381 - 390
[45] Massively parallel software rendering for visualizing large-scale data sets
Ma, KL
Parker, S
IEEE COMPUTER GRAPHICS AND APPLICATIONS, 2001, 21 (04) : 72 - 83
[46] Robust Composite Quantile Regression with Large-scale Streaming Data Sets
Wang, Kangning
Zhang, Di
Sun, Xiaofei
SCANDINAVIAN JOURNAL OF STATISTICS, 2025,
[47] An Improved Affinity Propagation Clustering Algorithm for Large-scale Data Sets
Liu, Xiaonan
Yin, Meijuan
Luo, Junyong
Chen, Wuping
2013 NINTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION (ICNC), 2013, : 894 - 899
[48] On Distributed Deep Network for Processing Large-Scale Sets of Complex Data
Qin Chao
Gao Xiao-guang
Chen Da-qing
2016 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN-MACHINE SYSTEMS AND CYBERNETICS (IHMSC), VOL. 1, 2016, : 395 - 399
[49] Fast and fully-automated histograms for large-scale data sets
Mendizabal, Valentina Zelaya
Boulle, Marc
Rossi, Fabrice
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2023, 180
[50] Compressed constrained spectral clustering framework for large-scale data sets
Liu, Wenfen
Ye, Mao
Wei, Jianghong
Hu, Xuexian
KNOWLEDGE-BASED SYSTEMS, 2017, 135 : 77 - 88

← 1 2 3 4 5 →