Visualization of very large high-dimensional data sets as minimum spanning trees

被引:0
|
作者
Daniel Probst
Jean-Louis Reymond
机构
[1] University of Bern,Department of Chemistry and Biochemistry
来源
关键词
Data visualization; Chemistry databases; Algorithms; Big data; Dimensionality reduction;
D O I
暂无
中图分类号
学科分类号
摘要
The chemical sciences are producing an unprecedented amount of large, high-dimensional data sets containing chemical structures and associated properties. However, there are currently no algorithms to visualize such data while preserving both global and local features with a sufficient level of detail to allow for human inspection and interpretation. Here, we propose a solution to this problem with a new data visualization method, TMAP, capable of representing data sets of up to millions of data points and arbitrary high dimensionality as a two-dimensional tree (http://tmap.gdb.tools). Visualizations based on TMAP are better suited than t-SNE or UMAP for the exploration and interpretation of large data sets due to their tree-like nature, increased local and global neighborhood and structure preservation, and the transparency of the methods the algorithm is based on. We apply TMAP to the most used chemistry data sets including databases of molecules such as ChEMBL, FDB17, the Natural Products Atlas, DSSTox, as well as to the MoleculeNet benchmark collection of data sets. We also show its broad applicability with further examples from biology, particle physics, and literature.[graphic not available: see fulltext]
引用
收藏
相关论文
共 50 条
  • [1] Visualization of very large high-dimensional data sets as minimum spanning trees
    Probst, Daniel
    Reymond, Jean-Louis
    [J]. JOURNAL OF CHEMINFORMATICS, 2020, 12 (01)
  • [2] Very Fast Interactive Visualization of Large Sets of High-dimensional Data
    Dzwinel, Witold
    Wcislo, Rafal
    [J]. INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, ICCS 2015 COMPUTATIONAL SCIENCE AT THE GATES OF NATURE, 2015, 51 : 572 - 581
  • [3] Outlier mining in large high-dimensional data sets
    Angiulli, F
    Pizzuti, C
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (02) : 203 - 215
  • [4] High-dimensional data visualization
    Tang, Lin
    [J]. NATURE METHODS, 2020, 17 (02) : 129 - 129
  • [5] High-dimensional data visualization
    Lin Tang
    [J]. Nature Methods, 2020, 17 : 129 - 129
  • [6] GAUSSIAN PROCESSES FOR HIGH-DIMENSIONAL, LARGE DATA SETS: A REVIEW
    Jiang, Mengrui
    Pedrielli, Giulia
    Szu Hui Ng
    [J]. 2022 WINTER SIMULATION CONFERENCE (WSC), 2022, : 49 - 60
  • [7] Approximate minimum spanning tree clustering in high-dimensional space
    Lai, Chih
    Rafa, Taras
    Nelson, Dwight E.
    [J]. INTELLIGENT DATA ANALYSIS, 2009, 13 (04) : 575 - 597
  • [8] Dynamic visualization of high-dimensional data
    Eric D. Sun
    Rong Ma
    James Zou
    [J]. Nature Computational Science, 2023, 3 : 86 - 100
  • [9] Visualization for high-dimensional data: VisHD
    Yang, CC
    Chiang, CC
    Hung, YP
    Lee, GC
    [J]. Ninth International Conference on Information Visualisation, Proceedings, 2005, : 692 - 696
  • [10] Visualization and data mining of high-dimensional data
    Inselberg, A
    [J]. CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2002, 60 (1-2) : 147 - 159