Wrangling Categorical Data in R

被引:0
|
作者
McNamara, Amelia [1 ]
Horton, Nicholas J. [2 ]
机构
[1] Smith Coll, Program Stat & Data Sci, 215 Burton Hall, Northampton, MA 01063 USA
[2] Amherst Coll, Dept Math & Stat, Amherst, MA 01002 USA
来源
AMERICAN STATISTICIAN | 2018年 / 72卷 / 01期
关键词
Data derivation; Data management; Data science; Statistical computing;
D O I
10.1080/00031305.2017.1356375
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Data wrangling is a critical foundation of data science, and wrangling of categorical data is an important component of this process. However, categorical data can introduce unique issues in data wrangling, particularly in real-world settings with collaborators and periodically-updated dynamic data. This article discusses common problems arising from categorical variable transformations in R, demonstrates the use of factors, and suggests approaches to address data wrangling challenges. For each problem, we present at least two strategies for management, one in base R and the other from the "tidyverse." We consider several motivating examples, suggest defensive coding strategies, and outline principles for data wrangling to help ensure data quality and sound analysis. Supplementary materials for this article are available online.
引用
收藏
页码:97 / 104
页数:8
相关论文
共 50 条
  • [1] Categorical Data Analysis in R
    Prokop, Martin
    APPLICATIONS OF MATHEMATICS AND STATISTICS IN ECONOMY: AMSE 2009, 2009, : 371 - 380
  • [2] Big data: Data wrangling
    Goldston, David
    NATURE, 2008, 455 (7209) : 15 - 15
  • [3] Big data: Data wrangling
    David Goldston
    Nature, 2008, 455 : 15 - 15
  • [4] Fairness in Data Wrangling
    Mazilu, Lacramioara
    Paton, Norman W.
    Konstantinou, Nikolaos
    Fernandes, Alvaro A. A.
    2020 IEEE 21ST INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE (IRI 2020), 2020, : 341 - 348
  • [5] Data Context Informed Data Wrangling
    Koehler, Martin
    Bogatu, Alex
    Civili, Cristina
    Konstantinou, Nikolaos
    Abel, Edward
    Fernandes, Alvaro A. A.
    Keane, John
    Libkin, Leonid
    Paton, Norman W.
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 956 - 963
  • [6] cubble: An R Package for Organizing and Wrangling Multivariate Spatio-Temp oral Data
    Zhang, H. Sherry
    Cook, Dianne
    Laa, Ursula
    Langrene, Nicolas
    Menendez, Patricia
    JOURNAL OF STATISTICAL SOFTWARE, 2024, 110 (07): : 1 - 27
  • [7] Data Wrangling: Making data useful again
    Ender, Florian
    Piringer, Harald
    IFAC PAPERSONLINE, 2015, 48 (01): : 111 - +
  • [8] MGLM: An R Package for Multivariate Categorical Data Analysis
    Kim, Juhyun
    Zhang, Yiwen
    Day, Joshua
    Zhou, Hua
    R JOURNAL, 2018, 10 (01): : 73 - 90
  • [9] A Conceptual Approach to Traffic Data Wrangling
    Aljubairah, Mashael
    Sampaio, Sandra
    Permana, Hapsoro Adi
    Sampaio, Pedro
    DATA ANALYTICS, 2017, 10365 : 9 - 22
  • [10] Data wrangling practices and collaborative interactions with aggregated data
    Shiyan Jiang
    Jennifer Kahn
    International Journal of Computer-Supported Collaborative Learning, 2020, 15 : 257 - 281