Large-Scale Data Classification System Based on Galaxy Server and Protected from Information Leak

被引:7
|
作者
Fujarewicz, Krzysztof [1 ,2 ]
Student, Sebastian [1 ,2 ]
Zielanski, Tomasz [1 ,2 ]
Jakubczak, Michal [1 ,2 ]
Pieter, Justyna [1 ,2 ]
Pojda, Katarzyna [1 ,2 ]
Swierniak, Andrzej [1 ,2 ]
机构
[1] Silesian Tech Univ, Inst Automat Control, Ul Akad 16, PL-44100 Gliwice, Poland
[2] Silesian Tech Univ, Biotechnol Ctr, Ul Krzywoustego 8, PL-44100 Gliwice, Poland
关键词
Machine learning; Information leak; Galaxy Server; Classification; Feature selection; Model validation; Model selection; Large-scale data; Small-sample data; Genomic data; Proteomic data; PAPILLARY THYROID-CARCINOMA; SUPPORT VECTOR MACHINES; GENE-EXPRESSION DATA; DNA MICROARRAY DATA; SELECTION; CANCER;
D O I
10.1007/978-3-319-54430-4_73
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work we present SPICY (SPecialized Classification sYstem) application for a supervised data analysis (feature selection, classification, model validation and model selection) with the structure preventing the data processing work-flow from so called information leak. The information leak may result in optimistically biased classification quality assessment, especially for large-scale, small-sample data sets. The application uses the Galaxy Server environment that originally allows the user to manual data processing and is not prevented from the information leak. The way how the classification model is built by the user and the specific structure of all implemented methods makes the information leak impossible. The lack of information leak in the presented supervised data analysis tool is demonstrated on numerical examples, where synthetic and real data sets are used.
引用
收藏
页码:765 / 773
页数:9
相关论文
共 50 条
  • [1] An Information Classification Collection Protocol for Large-Scale RFID System
    Zhao, Jumin
    Yang, Haizhu
    Li, Wenjing
    Li, Dengao
    Yan, Ruijuan
    [J]. WIRELESS ALGORITHMS, SYSTEMS, AND APPLICATIONS (WASA 2018), 2018, 10874 : 660 - 670
  • [2] Key Technologies of a Large-Scale Urban Geological Information Management System Based on a Browser/Server Structure
    Zhou, Cuiying
    Du, Zichun
    Gao, Li
    Ming, Weihua
    Ouyang, Jinwu
    Wang, Xiangdong
    Zhang, Zhilong
    Liu, Zhen
    [J]. IEEE ACCESS, 2019, 7 : 135582 - 135594
  • [3] Nonparametric System for Automatic Classification of Large-Scale Statistical Data
    A. V. Lapko
    V. A. Lapko
    V. P. Tuboltsev
    [J]. Pattern Recognition and Image Analysis, 2023, 33 : 576 - 583
  • [4] Nonparametric System for Automatic Classification of Large-Scale Statistical Data
    Lapko, A. V.
    Lapko, V. A.
    Tuboltsev, V. P.
    [J]. PATTERN RECOGNITION AND IMAGE ANALYSIS, 2023, 33 (03) : 576 - 583
  • [5] A DECENTRALIZED AND SCALABLE METADATA SERVER MANAGEMENT SYSTEM FOR LARGE-SCALE DATA STORAGE
    Chen Ningjiang
    Xiao Zhongzheng
    [J]. JOURNAL OF INVESTIGATIVE MEDICINE, 2014, 62 (08) : S17 - S17
  • [6] Cosmology from large-scale galaxy surveys
    Colless, M.
    [J]. NUOVO CIMENTO DELLA SOCIETA ITALIANA DI FISICA B-BASIC TOPICS IN PHYSICS, 2007, 122 (9-11): : 1195 - 1201
  • [7] ESIR: A Deployment System for Large-scale Server Cluster
    Xue, Zhenghua
    Dong, Xiaoshe
    Li, Junyang
    Tian, Hongbo
    [J]. GCC 2008: SEVENTH INTERNATIONAL CONFERENCE ON GRID AND COOPERATIVE COMPUTING, PROCEEDINGS, 2008, : 563 - 569
  • [8] Large-Scale Hierarchical Text classification Based on Path Semantic Information
    Gao, Feng
    Wu, Chengrong
    Guo, Naiwang
    Zhao, Danfeng
    [J]. 2009 INTERNATIONAL CONFERENCE ON BUSINESS INTELLIGENCE AND FINANCIAL ENGINEERING, PROCEEDINGS, 2009, : 223 - 227
  • [9] From data to theory An emergent semantic classification based on the large-scale Russian constructicon
    Janda, Laura A.
    Endresen, Anna
    Zhukova, Valentina
    Mordashova, Daria
    Rakhilina, Ekaterina
    [J]. CONSTRUCTIONS AND FRAMES, 2023, 15 (01) : 1 - 58
  • [10] Large-Scale Information Extraction from Emails with Data Constraints
    Gupta, Rajeev
    Kondapally, Ranganath
    Guha, Siddharth
    [J]. BIG DATA ANALYTICS (BDA 2019), 2019, 11932 : 124 - 139