Large-Scale Similarity Search Profiling of ChEMBL Compound Data Sets

被引:63
|
作者
Heikamp, Kathrin [1 ]
Bajorath, Juergen [1 ]
机构
[1] Univ Bonn, Dept Life Sci Informat, B IT, LIMES Program Unit Chem Biol & Med Chem, D-53113 Bonn, Germany
关键词
FINGERPRINTS; RECOMBINATION;
D O I
10.1021/ci200199u
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
A large-scale similarity search investigation has been carried out on 266 well-defined compound activity classes extracted from the ChEMBL database. The analysis was performed using two widely applied two-dimensional (2D) fingerprints that mark opposite ends of the current performance spectrum of these types of fingerprints, i.e., MACCS structural keys and the extended connectivity fingerprint with bond diameter four (ECFP4). For each fingerprint, three nearest neighbor search strategies were applied. On the basis of these search calculations, a similarity search profile of the ChEMBL database was generated. Overall, the fingerprint search campaign was surprisingly successful. In 203 of 266 test cases (similar to 76%), a compound recovery rate of at least 50% was observed with at least the better performing fingerprint and one search strategy. The similarity search profile also revealed several general trends. For example, fingerprint searching was often characterized by an early enrichment of active compounds in database selection sets. In addition, compound activity classes have been categorized according to different similarity search performance levels, which helps to put the results of benchmark calculations into perspective. Therefore, a compendium of activity classes falling into different search performance categories is provided. On the basis of our large-scale investigation, the performance range of state-of-the-art 2D fingerprinting has been delineated for compound data sets directed against a wide spectrum of pharmaceutical targets.
引用
收藏
页码:1831 / 1839
页数:9
相关论文
共 50 条
  • [1] ChEMBL: Large-scale mapping of medicinal chemistry and pharmacology data to genomes
    Overington, John P.
    [J]. ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2009, 238
  • [2] Wafer Map Failure Pattern Recognition and Similarity Ranking for Large-Scale Data Sets
    Wu, Ming-Ju
    Jang, Jyh-Shing R.
    Chen, Jui-Long
    [J]. IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, 2015, 28 (01) : 1 - 12
  • [3] LARGE-SCALE BACTERIAL GENE DISCOVERY BY SIMILARITY SEARCH
    ROBISON, K
    GILBERT, W
    CHURCH, GM
    [J]. NATURE GENETICS, 1994, 7 (02) : 205 - 214
  • [4] Tree Quantization for Large-Scale Similarity Search and Classification
    Babenko, Artem
    Lempitsky, Victor
    [J]. 2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2015, : 4240 - 4248
  • [5] Compact representation for large-scale clustering and similarity search
    Wang, Bin
    Chen, Yuanhao
    Lie, Zhiwei
    Lie, Mingjing
    [J]. Advances in Multimedia Information Processing - PCM 2006, Proceedings, 2006, 4261 : 835 - 843
  • [6] ChEMBL: a large-scale bioactivity database for drug discovery
    Gaulton, Anna
    Bellis, Louisa J.
    Bento, A. Patricia
    Chambers, Jon
    Davies, Mark
    Hersey, Anne
    Light, Yvonne
    McGlinchey, Shaun
    Michalovich, David
    Al-Lazikani, Bissan
    Overington, John P.
    [J]. NUCLEIC ACIDS RESEARCH, 2012, 40 (D1) : D1100 - D1107
  • [7] Large-scale Visual Search and Similarity for E-Commerce
    Anand, Gaurav
    Wang, Siyun
    Ni, Karl
    [J]. APPLICATIONS OF MACHINE LEARNING 2021, 2021, 11843
  • [8] Efficient Large-Scale Similarity Search Using Matrix Factorization
    Iscen, Ahmet
    Rabbat, Michael
    Furon, Teddy
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 2073 - 2081
  • [9] Understanding Data Similarity in Large-Scale Scientific Datasets
    Linton, Payton
    Melodia, William
    Lazar, Alina
    Agarwal, Deborah
    Bianchi, Ludovico
    Ghoshal, Devarshi
    Pastorello, Gilbert
    Ramakrishnan, Lavanya
    Wu, Kesheng
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 4525 - 4531
  • [10] Feature selection for large-scale data sets in GrC
    Liang, Jiye
    [J]. 2012 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING (GRC 2012), 2012, : 2 - 7