Measuring the quality of projections of high-dimensional labeled data

被引：1

作者：

Benato, Barbara C. ^{[1
]}

Falcao, Alexandre X. ^{[1
]}

Telea, Alexandru C. ^{[2
]}

机构：

[1] Univ Estadual Campinas, Inst Comp, Ave Albert Einstein 1251, BR-13083852 Campinas, Brazil

[2] Univ Utrecht, Fac Sci, Dept Informat & Comp Sci, Utrecht, Netherlands

来源：

COMPUTERS & GRAPHICS-UK | 2023年 / 116卷

基金：

巴西圣保罗研究基金会;

关键词：

Quality of projections; Labeled data; Pseudo labeling; REDUCTION; ALGORITHMS;

D O I：

10.1016/j.cag.2023.08.023

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Dimensionality reduction techniques, also called projections, are one of the main tools for visualizing high-dimensional data. To compare such techniques, several quality metrics have been proposed. However, such metrics may not capture the visual separation among groups/classes of samples in a projection, i.e., having groups of similar (same label) points far from other (distinct label) groups of points. For this, we propose a pseudo-labeling mechanism to assess visual separation using the performance of a semi-supervised optimum-path forest classifier (OPFSemi), measured by Cohen's Kappa. We argue that lower label propagation errors by OPFSemi in projections are related to higher data/visual separation. OPFSemi explores local and global information of data distribution when computing optimum connectivity between samples in a projection for label propagation. It is parameter-free, fast to compute, easy to implement, and generically handles any high-dimensional quantitative labeled dataset and projection technique. We compare our approach with four commonly used scalar metrics in the literature for 18 datasets and 39 projection techniques. Our results consistently show that our proposed metric consistently scores values in line with the perceived visual separation, surpassing existing projection-quality metrics in this respect. (c) 2023 Elsevier Ltd. All rights reserved.

引用

下载

页码：287 / 297

页数：11

共 50 条

[41] Randomized nonlinear projections uncover high-dimensional structure
Cowen, LJ
Priebe, CE
ADVANCES IN APPLIED MATHEMATICS, 1997, 19 (03) : 319 - 331
[42] High-dimensional outlier detection using random projections
P. Navarro-Esteban
J. A. Cuesta-Albertos
TEST, 2021, 30 : 908 - 934
[43] On Criticality in High-Dimensional Data
Saremi, Saeed
Sejnowski, Terrence J.
NEURAL COMPUTATION, 2014, 26 (07) : 1329 - 1339
[44] High-Dimensional Data Bootstrap
Chernozhukov, Victor
Chetverikov, Denis
Kato, Kengo
Koike, Yuta
ANNUAL REVIEW OF STATISTICS AND ITS APPLICATION, 2023, 10 : 427 - 449
[45] High-dimensional data clustering
Bouveyron, C.
Girard, S.
Schmid, C.
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2007, 52 (01) : 502 - 519
[46] Visualizing high-dimensional data
Nature Methods, 2013, 10 (7) : 608 - 608
[47] High-dimensional data visualization
Tang, Lin
NATURE METHODS, 2020, 17 (02) : 129 - 129
[48] High-dimensional data visualization
Lin Tang
Nature Methods, 2020, 17 : 129 - 129
[49] High-dimensional Data Cubes
John, Sachin Basil
Koch, Christoph
PROCEEDINGS OF THE VLDB ENDOWMENT, 2022, 15 (13): : 3828 - 3840
[50] Modeling High-Dimensional Data
Vempala, Santosh S.
COMMUNICATIONS OF THE ACM, 2012, 55 (02) : 112 - 112

← 1 2 3 4 5 →