Wrangling Categorical Data in R

被引:0
|
作者
McNamara, Amelia [1 ]
Horton, Nicholas J. [2 ]
机构
[1] Smith Coll, Program Stat & Data Sci, 215 Burton Hall, Northampton, MA 01063 USA
[2] Amherst Coll, Dept Math & Stat, Amherst, MA 01002 USA
来源
AMERICAN STATISTICIAN | 2018年 / 72卷 / 01期
关键词
Data derivation; Data management; Data science; Statistical computing;
D O I
10.1080/00031305.2017.1356375
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Data wrangling is a critical foundation of data science, and wrangling of categorical data is an important component of this process. However, categorical data can introduce unique issues in data wrangling, particularly in real-world settings with collaborators and periodically-updated dynamic data. This article discusses common problems arising from categorical variable transformations in R, demonstrates the use of factors, and suggests approaches to address data wrangling challenges. For each problem, we present at least two strategies for management, one in base R and the other from the "tidyverse." We consider several motivating examples, suggest defensive coding strategies, and outline principles for data wrangling to help ensure data quality and sound analysis. Supplementary materials for this article are available online.
引用
收藏
页码:97 / 104
页数:8
相关论文
共 50 条
  • [41] Data Wrangling at Scale The experience of EW-Shopp
    Nikolov, Nikolay
    Ciavotta, Michele
    De Paoli, Flavio
    ECSA 2018: PROCEEDINGS OF THE 12TH EUROPEAN CONFERENCE ON SOFTWARE ARCHITECTURE: COMPANION PROCEEDINGS, 2018,
  • [43] A Conceptual Approach for Supporting Traffic Data Wrangling Tasks
    Sampaio, Sandra
    Aljubairah, Mashael
    Permana, Hapsoro Adi
    Sampaio, Pedro
    COMPUTER JOURNAL, 2019, 62 (03): : 461 - 480
  • [44] Wrangling
    Bredow, Rob
    COMPUTER GRAPHICS WORLD, 2007, 30 (06) : 18 - +
  • [45] The VADA Architecture for Cost-Effective Data Wrangling
    Konstantinou, Nikolaos
    Koehler, Martin
    Abel, Edward
    Civili, Cristina
    Neumayr, Bernd
    Sallinger, Emanuel
    Fernandes, Alvaro A. A.
    Gottlob, Georg
    Keane, John A.
    Libkin, Leonid
    Paton, Norman W.
    SIGMOD'17: PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2017, : 1599 - 1602
  • [46] VARIABLE SELECTION TECHNIQUE FOR CATEGORICAL DATA USING AN R2 CRITERION
    ANDERSON, RJ
    LANDIS, JR
    BIOMETRICS, 1978, 34 (03) : 529 - 530
  • [47] APPLYING R2-TYPE MEASURES TO ORDERED CATEGORICAL-DATA
    AGRESTI, A
    TECHNOMETRICS, 1986, 28 (02) : 133 - 138
  • [48] AI Assistants: A Framework for Semi-Automated Data Wrangling
    Petricek, Tomas
    van den Burg, Gerrit J. J.
    Nazabal, Alfredo
    Ceritli, Taha
    Jimenez-Ruiz, Ernesto
    Williams, Christopher K. I.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (09) : 9295 - 9306
  • [49] Categorical data analysis
    Wickens, TD
    ANNUAL REVIEW OF PSYCHOLOGY, 1998, 49 : 537 - 557
  • [50] Paired categorical data
    Lydersen, Stian
    TIDSSKRIFT FOR DEN NORSKE LAEGEFORENING, 2022, 142 (08) : 714 - 714