Large-scale text processing pipeline with Apache Spark

被引:0
|
作者
Svyatkovskiy, A. [1 ,2 ,3 ]
Imai, K. [1 ]
Kroeger, M. [1 ]
Shiraito, Y. [1 ]
机构
[1] Princeton Univ, Princeton, NJ 08544 USA
[2] Princeton Univ, Dept Polit, Princeton, NJ 08544 USA
[3] Princeton Univ, Ctr Stat & Machine Learning, Princeton, NJ 08544 USA
关键词
Spark; Avro; Spark ML; Spark GraphFrames; INNOVATIONS; DIFFUSION; STATES;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has been unable to make an all-pairs comparison between bills due to computational intensity. As a substitute, scholars have studied single topic areas. We provide an implementation of this analysis workflow as a distributed text processing pipeline with Spark dataframes and Scala application programming interface. We discuss the challenges and strategies of unstructured data processing, data formats for storage and efficient access, and graph processing at scale.
引用
收藏
页码:3928 / 3935
页数:8
相关论文
共 50 条
  • [41] Large-scale MHC peptidomics provides a new outlook on the antigen- processing pipeline
    Komov, L.
    Kadosh, D. Melamed
    Barnea, E.
    Admon, A.
    FEBS JOURNAL, 2017, 284 : 74 - 75
  • [42] Data Processing Pipeline of Short-Term Depression Detection with Large-Scale Dataset
    Lee, Yonggeon
    Noh, Youngtae
    Lee, Uichin
    2023 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING, BIGCOMP, 2023, : 391 - 392
  • [43] Automated pipeline framework for processing of large-scale building energy time series data
    Khalilnejad, Arash
    Karimi, Ahmad M.
    Kamath, Shreyas
    Haddadian, Rojiar
    French, Roger H.
    Abramson, Alexis R.
    PLOS ONE, 2020, 15 (12):
  • [44] A Visualization Pipeline for Large-Scale Tractography Data
    Kress, James
    Anderson, Erik
    Childs, Hank
    2015 IEEE 5TH SYMPOSIUM ON LARGE DATA ANALYSIS AND VISUALIZATION (LDAV), 2015, : 115 - 123
  • [45] Emptying of Large-Scale Pipeline by Pressurized Air
    Laanearu, Janek
    Annus, Ivar
    Koppel, Tiit
    Bergant, Anton
    Vuckovic, Saso
    Hou, Qingzhi
    Tijsseling, Arris S.
    Anderson, Alexander
    van't Westende, Jos M. C.
    JOURNAL OF HYDRAULIC ENGINEERING, 2012, 138 (12) : 1090 - 1100
  • [46] Improved Text Extraction from PDF Documents for Large-Scale Natural Language Processing
    Tiedemann, Jorg
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, CICLING 2014, PT I, 2014, 8403 : 102 - 112
  • [47] Distributed Classification of Text Documents on Apache Spark Platform
    Semberecki, Piotr
    Maciejewski, Henryk
    ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, ICAISC 2016, 2016, 9692 : 621 - 630
  • [48] Appraising SPARK on Large-Scale Social Media Analysis
    Belcastro, Loris
    Marozzo, Fabrizio
    Talia, Domenico
    Trunfio, Paolo
    EURO-PAR 2017: PARALLEL PROCESSING WORKSHOPS, 2018, 10659 : 483 - 495
  • [49] Parallelism and Partitioning in Large-Scale GAs using Spark
    Alterkawi, Laila
    Migliavacca, Matteo
    PROCEEDINGS OF THE 2019 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE (GECCO'19), 2019, : 736 - 744
  • [50] Topic modeling for large-scale text data
    Li, Xi-ming
    Ouyang, Ji-hong
    Lu, You
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2015, 16 (06) : 457 - 465