Large-Scale Similarity Search Profiling of ChEMBL Compound Data Sets

被引:64
|
作者
Heikamp, Kathrin [1 ]
Bajorath, Juergen [1 ]
机构
[1] Univ Bonn, Dept Life Sci Informat, B IT, LIMES Program Unit Chem Biol & Med Chem, D-53113 Bonn, Germany
关键词
FINGERPRINTS; RECOMBINATION;
D O I
10.1021/ci200199u
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
A large-scale similarity search investigation has been carried out on 266 well-defined compound activity classes extracted from the ChEMBL database. The analysis was performed using two widely applied two-dimensional (2D) fingerprints that mark opposite ends of the current performance spectrum of these types of fingerprints, i.e., MACCS structural keys and the extended connectivity fingerprint with bond diameter four (ECFP4). For each fingerprint, three nearest neighbor search strategies were applied. On the basis of these search calculations, a similarity search profile of the ChEMBL database was generated. Overall, the fingerprint search campaign was surprisingly successful. In 203 of 266 test cases (similar to 76%), a compound recovery rate of at least 50% was observed with at least the better performing fingerprint and one search strategy. The similarity search profile also revealed several general trends. For example, fingerprint searching was often characterized by an early enrichment of active compounds in database selection sets. In addition, compound activity classes have been categorized according to different similarity search performance levels, which helps to put the results of benchmark calculations into perspective. Therefore, a compendium of activity classes falling into different search performance categories is provided. On the basis of our large-scale investigation, the performance range of state-of-the-art 2D fingerprinting has been delineated for compound data sets directed against a wide spectrum of pharmaceutical targets.
引用
收藏
页码:1831 / 1839
页数:9
相关论文
共 50 条
  • [21] Greedy column subset selection for large-scale data sets
    Farahat, Ahmed K.
    Elgohary, Ahmed
    Ghodsi, Ali
    Kamel, Mohamed S.
    KNOWLEDGE AND INFORMATION SYSTEMS, 2015, 45 (01) : 1 - 34
  • [22] Outlier Detection Forest for Large-Scale Categorical Data Sets
    Sun, Zhipeng
    Du, Hongwei
    Ye, Qiang
    Liu, Chuang
    Kibenge, Patricia Lilian
    Huang, Hui
    Li, Yuying
    COMPUTATIONAL DATA AND SOCIAL NETWORKS, 2019, 11917 : 45 - 56
  • [23] WORK HISTORY ANALYSIS, WOMEN AND LARGE-SCALE DATA SETS
    DEX, S
    SOCIOLOGICAL REVIEW, 1984, 32 (04): : 637 - 661
  • [24] A novel feature selection method for large-scale data sets
    Chen, Wei-Chou
    Yang, Ming-Chun
    Tseng, Shian-Shyong
    INTELLIGENT DATA ANALYSIS, 2005, 9 (03) : 237 - 251
  • [25] Greedy column subset selection for large-scale data sets
    Ahmed K. Farahat
    Ahmed Elgohary
    Ali Ghodsi
    Mohamed S. Kamel
    Knowledge and Information Systems, 2015, 45 : 1 - 34
  • [26] Computing the Schulze Method for Large-Scale Preference Data Sets
    Csar, Theresa
    Lackner, Martin
    Pichler, Reinhard
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 180 - 187
  • [27] Neighborhood Preprocessing SVM for Large-scale Data Sets Classification
    Chen, Guangxi
    Xu, Jian
    Xiang, Xiaolin
    FIFTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 2, PROCEEDINGS, 2008, : 245 - +
  • [28] Intergenerational Family Storytelling and Modeling with Large-Scale Data Sets
    Axelrod, Daryl B.
    Kahn, Jennifer
    PROCEEDINGS OF ACM INTERACTION DESIGN AND CHILDREN (IDC 2019), 2019, : 352 - 360
  • [29] KEGG for integration and interpretation of large-scale molecular data sets
    Kanehisa, Minoru
    Goto, Susumu
    Sato, Yoko
    Furumichi, Miho
    Tanabe, Mao
    NUCLEIC ACIDS RESEARCH, 2012, 40 (D1) : D109 - D114
  • [30] Parallel Clustering Algorithm for Large-Scale Biological Data Sets
    Wang, Minchao
    Zhang, Wu
    Ding, Wang
    Dai, Dongbo
    Zhang, Huiran
    Xie, Hao
    Chen, Luonan
    Guo, Yike
    Xie, Jiang
    PLOS ONE, 2014, 9 (04):