Large-scale text processing pipeline with Apache Spark

被引：0

作者：

Svyatkovskiy, A. ^{[1
,2
,3
]}

Imai, K. ^{[1
]}

Kroeger, M. ^{[1
]}

Shiraito, Y. ^{[1
]}

机构：

[1] Princeton Univ, Princeton, NJ 08544 USA

[2] Princeton Univ, Dept Polit, Princeton, NJ 08544 USA

[3] Princeton Univ, Ctr Stat & Machine Learning, Princeton, NJ 08544 USA

来源：

2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA) | 2016年

关键词：

Spark; Avro; Spark ML; Spark GraphFrames; INNOVATIONS; DIFFUSION; STATES;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has been unable to make an all-pairs comparison between bills due to computational intensity. As a substitute, scholars have studied single topic areas. We provide an implementation of this analysis workflow as a distributed text processing pipeline with Spark dataframes and Scala application programming interface. We discuss the challenges and strategies of unstructured data processing, data formats for storage and efficient access, and graph processing at scale.

引用

页码：3928 / 3935

页数：8

共 50 条

[41] Large-scale MHC peptidomics provides a new outlook on the antigen- processing pipeline
Komov, L.
Kadosh, D. Melamed
Barnea, E.
Admon, A.
FEBS JOURNAL, 2017, 284 : 74 - 75
[42] Data Processing Pipeline of Short-Term Depression Detection with Large-Scale Dataset
Lee, Yonggeon
Noh, Youngtae
Lee, Uichin
2023 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING, BIGCOMP, 2023, : 391 - 392
[43] Automated pipeline framework for processing of large-scale building energy time series data
Khalilnejad, Arash
Karimi, Ahmad M.
Kamath, Shreyas
Haddadian, Rojiar
French, Roger H.
Abramson, Alexis R.
PLOS ONE, 2020, 15 (12):
[44] A Visualization Pipeline for Large-Scale Tractography Data
Kress, James
Anderson, Erik
Childs, Hank
2015 IEEE 5TH SYMPOSIUM ON LARGE DATA ANALYSIS AND VISUALIZATION (LDAV), 2015, : 115 - 123
[45] Emptying of Large-Scale Pipeline by Pressurized Air
Laanearu, Janek
Annus, Ivar
Koppel, Tiit
Bergant, Anton
Vuckovic, Saso
Hou, Qingzhi
Tijsseling, Arris S.
Anderson, Alexander
van't Westende, Jos M. C.
JOURNAL OF HYDRAULIC ENGINEERING, 2012, 138 (12) : 1090 - 1100
[46] Improved Text Extraction from PDF Documents for Large-Scale Natural Language Processing
Tiedemann, Jorg
COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, CICLING 2014, PT I, 2014, 8403 : 102 - 112
[47] Distributed Classification of Text Documents on Apache Spark Platform
Semberecki, Piotr
Maciejewski, Henryk
ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, ICAISC 2016, 2016, 9692 : 621 - 630
[48] Appraising SPARK on Large-Scale Social Media Analysis
Belcastro, Loris
Marozzo, Fabrizio
Talia, Domenico
Trunfio, Paolo
EURO-PAR 2017: PARALLEL PROCESSING WORKSHOPS, 2018, 10659 : 483 - 495
[49] Parallelism and Partitioning in Large-Scale GAs using Spark
Alterkawi, Laila
Migliavacca, Matteo
PROCEEDINGS OF THE 2019 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE (GECCO'19), 2019, : 736 - 744
[50] Topic modeling for large-scale text data
Li, Xi-ming
Ouyang, Ji-hong
Lu, You
FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2015, 16 (06) : 457 - 465

← 1 2 3 4 5 →