Large-scale text processing pipeline with Apache Spark

被引:0
|
作者
Svyatkovskiy, A. [1 ,2 ,3 ]
Imai, K. [1 ]
Kroeger, M. [1 ]
Shiraito, Y. [1 ]
机构
[1] Princeton Univ, Princeton, NJ 08544 USA
[2] Princeton Univ, Dept Polit, Princeton, NJ 08544 USA
[3] Princeton Univ, Ctr Stat & Machine Learning, Princeton, NJ 08544 USA
关键词
Spark; Avro; Spark ML; Spark GraphFrames; INNOVATIONS; DIFFUSION; STATES;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has been unable to make an all-pairs comparison between bills due to computational intensity. As a substitute, scholars have studied single topic areas. We provide an implementation of this analysis workflow as a distributed text processing pipeline with Spark dataframes and Scala application programming interface. We discuss the challenges and strategies of unstructured data processing, data formats for storage and efficient access, and graph processing at scale.
引用
收藏
页码:3928 / 3935
页数:8
相关论文
共 50 条
  • [31] Trans-Proteomic Pipeline, a standardized data processing pipeline for large-scale reproducible proteomics informatics
    Deutsch, Eric W.
    Mendoza, Luis
    Shteynberg, David
    Slagel, Joseph
    Sun, Zhi
    Moritz, Robert L.
    PROTEOMICS CLINICAL APPLICATIONS, 2015, 9 (7-8) : 745 - 754
  • [32] An Apache Spark Implementation for Text Document Clustering
    Dritsas, Elias
    Trigka, Maria
    Vonitsanos, Gerasimos
    Kanavos, Andreas
    Mylonas, Phivos
    2022 17TH INTERNATIONAL WORKSHOP ON SEMANTIC AND SOCIAL MEDIA ADAPTATION & PERSONALIZATION (SMAP 2022), 2022, : 50 - 55
  • [33] An information-theoretic based model for large-scale contextual text processing
    Perrin, P
    Petry, F
    INFORMATION SCIENCES, 1999, 116 (2-4) : 229 - 252
  • [34] Large-scale geographically weighted regression on Spark
    Hung Tien Tran
    Hiep Tuan Nguyen
    Viet-Trung Tran
    2016 EIGHTH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2016, : 127 - 132
  • [35] Accelerating Large-Scale Genomic Analysis with Spark
    Li, Xueqi
    Tan, Guangming
    Zhang, Chunming
    Li, Xu
    Zhang, Zhonghai
    Sun, Ninghui
    2016 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2016, : 747 - 751
  • [36] Large-Scale Human Action Recognition with Spark
    Wang, Hanli
    Zheng, Xiaobin
    Xiao, Bo
    2015 IEEE 17TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2015,
  • [37] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
    N. Ahmed
    Andre L. C. Barczak
    Teo Susnjak
    Mohammed A. Rashid
    Journal of Big Data, 7
  • [38] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
    Ahmed, N.
    Barczak, Andre L. C.
    Susnjak, Teo
    Rashid, Mohammed A.
    JOURNAL OF BIG DATA, 2020, 7 (01)
  • [39] Large-scale processing of coals
    Procycat, F
    ZEITSCHRIFT DES VEREINES DEUTSCHER INGENIEURE, 1933, 77 : 893 - 897
  • [40] Text Relevance Analysis Method over Large-Scale High-Dimensional Text Data Processing
    Wang, Ling
    Ding, Wei
    Zhou, Tie Hua
    Ryu, Keun Ho
    COMPUTATIONAL COLLECTIVE INTELLIGENCE (ICCCI 2015), PT I, 2015, 9329 : 371 - 379