Coping with high dimensionality in massive datasets

被引:1
|
作者
Kettenring, Jon R. [1 ]
机构
[1] Drew Univ, Charles A Dana Res Inst Scientists Emeriti RISE, Madison, NJ 07940 USA
关键词
variable selection; reduction of dimensionality; exploratory data analysis;
D O I
10.1002/wics.141
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Amassive dataset is characterized by its size and complexity. In its most basic form, such a dataset can be represented as a collection of n observations on p variables. Aggravation or even impasse can result if either number is huge. The more difficult challenge is usually associated with the case of very high dimensionality or 'big p'. There is a fast growing literature on how to handle such challenges, but most of it is in a supervised learning context involving a specific objective function, as in regression or classification. Much less is known about effective strategies for more exploratory data analytic activities. The purpose of this article is to put into historical perspective much of the recent research on dimensionality reduction and variable selection in such problems. Examples of applications that have stimulated this research are discussed along with a sampling of the latest methodologies to illustrate the onslaught of creative ideas that have surfaced. From a practitioner's perspective, the most effective strategy may be to emphasize the role of interdisciplinary teamwork with decisions on how best to grapple with high dimensionality emerging from a mixture of statistical thinking and consideration of the circumstances of the application. (C) 2011 John Wiley & Sons, Inc.
引用
收藏
页码:95 / 103
页数:9
相关论文
共 50 条
  • [1] Dimensionality scale back in massive datasets using PDLPP
    Alostad, Jasem M.
    [J]. JOURNAL OF COMPUTATIONAL SCIENCE, 2018, 26 : 141 - 146
  • [2] Dimensionality Reduction of Massive Sparse Datasets Using Coresets
    Feldman, Dan
    Volkov, Mikhail
    Rus, Daniela
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
  • [3] Dimensionality Reduction Algorithms on High Dimensional Datasets
    Syarif, Iwan
    [J]. EMITTER-INTERNATIONAL JOURNAL OF ENGINEERING TECHNOLOGY, 2014, 2 (02) : 28 - 38
  • [4] Joining massive high-dimensional datasets
    Kahveci, T
    Lang, CA
    Singh, AK
    [J]. 19TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2003, : 265 - 276
  • [5] Clustering high dimensional massive scientific datasets
    Otoo, EJ
    Shoshani, A
    Hwang, S
    [J]. THIRTEENTH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, PROCEEDINGS, 2001, : 147 - 157
  • [6] Clustering high dimensional massive scientific datasets
    Otoo, EJ
    Shoshani, A
    Hwang, SW
    [J]. JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2001, 17 (2-3) : 147 - 168
  • [7] Clustering High Dimensional Massive Scientific Datasets
    Ekow J. Otoo
    Arie Shoshani
    Seung-Won Hwang
    [J]. Journal of Intelligent Information Systems, 2001, 17 : 147 - 168
  • [8] A new approach for cluster detection for large datasets with high dimensionality
    Gebski, M
    Wong, RK
    [J]. DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2005, 3589 : 498 - 508
  • [9] Massive datasets
    Kettenring, Jon R.
    [J]. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2009, 1 (01) : 25 - 32
  • [10] Mining High Utility Itemsets in Massive Transactional Datasets
    Thi, Vu Due
    Nguyen Huy Due
    [J]. ACTA CYBERNETICA, 2011, 20 (02): : 331 - 346