Large-Scale Network Embedding in Apache Spark

被引：10

作者：

Lin, Wenqing ^{[1
]}

机构：

[1] Tencent, Interact Entertainment Grp, Shenzhen, Guangdong, Peoples R China

来源：

KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING | 2021年

关键词：

network embedding; distributed computing; graph partitioning;

D O I：

10.1145/3447548.3467136

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Network embedding has been widely used in social recommendation and network analysis, such as recommendation systems and anomaly detection with graphs. However, most of previous approaches cannot handle large graphs efficiently, due to that (i) computation on graphs is often costly and (ii) the size of graph or the intermediate results of vectors could be prohibitively large, rendering it difficult to be processed on a single machine. In this paper, we propose an efficient and effective distributed algorithm for network embedding on large graphs using Apache Spark, which recursively partitions a graph into several small-sized subgraphs to capture the internal and external structural information of nodes, and then computes the network embedding for each subgraph in parallel. Finally, by aggregating the outputs on all subgraphs, we obtain the embeddings of nodes in a linear cost. After that, we demonstrate in various experiments that our proposed approach is able to handle graphs with billions of edges within a few hours and is at least 4 times faster than the state-of-the-art approaches. Besides, it achieves up to 4.25% and 4.27% improvements on link prediction and node classification tasks respectively. In the end, we deploy the proposed algorithms in two online games of Tencent with the applications of friend recommendation and item recommendation, which improve the competitors by up to 91.11% in running time and up to 12.80% in the corresponding evaluation metrics.

引用

页码：3271 / 3279

页数：9

共 50 条

[1] Large-Scale Data Pollution with Apache Spark
Hildebrandt, Kai
Panse, Fabian
Wilcke, Niklas
Ritter, Norbert
[J]. IEEE TRANSACTIONS ON BIG DATA, 2020, 6 (02) : 396 - 411
[2] Processing large-scale data with Apache Spark
Ko, Seyoon
Won, Joong-Ho
[J]. KOREAN JOURNAL OF APPLIED STATISTICS, 2016, 29 (06) : 1077 - 1094
[3] Building a Large-Scale Microscopic Road Network Traffic Simulator in Apache Spark
Fu, Zishan
Yu, Jia
Sarwat, Mohamed
[J]. 2019 20TH INTERNATIONAL CONFERENCE ON MOBILE DATA MANAGEMENT (MDM 2019), 2019, : 320 - 328
[4] Large-scale text processing pipeline with Apache Spark
Svyatkovskiy, A.
Imai, K.
Kroeger, M.
Shiraito, Y.
[J]. 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 3928 - 3935
[5] GeoMatch: Efficient Large-Scale Map Matching on Apache Spark
Zeidan, Ayman
Lagerspetz, Eemil
Zhao, Kai
Nurmi, Petteri
Tarkoma, Sasu
Vo, Huy T.
[J]. 2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 384 - 391
[6] Filter Large-scale Engine Data using Apache Spark
Pirozzi, Donato
Scarano, Vittorio
Begg, Steven
De Sercey, Guillaume
Fish, Andrew
Harvey, Andrew
[J]. 2016 IEEE 14TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN), 2016, : 1300 - 1305
[7] Particle Swarm Optimization for Large-Scale Clustering on Apache Spark
Sherar, Matthew
Zulkernine, Farhana
[J]. 2017 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2017, : 801 - 808
[8] GeoMatch: Efficient Large-scale Map Matching on Apache Spark
Zeidan, Ayman
Lagerspetz, Eemil
Zhao, Kai
Nurmi, Petteri
Tarkoma, Sasu
Vo, Huy T.
[J]. ACM/IMS Transactions on Data Science, 2020, 1 (03):
[9] Large-scale virtual screening on public cloud resources with Apache Spark
Capuccini, Marco
Ahmed, Laeeq
Schaal, Wesley
Laure, Erwin
Spjuth, Ola
[J]. JOURNAL OF CHEMINFORMATICS, 2017, 9
[10] Large-scale digital forensic investigation for Windows registry on Apache Spark
Lee, Jun-Ha
Kwon, Hyuk-Yoon
[J]. PLOS ONE, 2022, 17 (12):

← 1 2 3 4 5 →