Large-Scale Network Embedding in Apache Spark

被引：10

作者：

Lin, Wenqing ^{[1
]}

机构：

[1] Tencent, Interact Entertainment Grp, Shenzhen, Guangdong, Peoples R China

来源：

KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING | 2021年

关键词：

network embedding; distributed computing; graph partitioning;

D O I：

10.1145/3447548.3467136

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Network embedding has been widely used in social recommendation and network analysis, such as recommendation systems and anomaly detection with graphs. However, most of previous approaches cannot handle large graphs efficiently, due to that (i) computation on graphs is often costly and (ii) the size of graph or the intermediate results of vectors could be prohibitively large, rendering it difficult to be processed on a single machine. In this paper, we propose an efficient and effective distributed algorithm for network embedding on large graphs using Apache Spark, which recursively partitions a graph into several small-sized subgraphs to capture the internal and external structural information of nodes, and then computes the network embedding for each subgraph in parallel. Finally, by aggregating the outputs on all subgraphs, we obtain the embeddings of nodes in a linear cost. After that, we demonstrate in various experiments that our proposed approach is able to handle graphs with billions of edges within a few hours and is at least 4 times faster than the state-of-the-art approaches. Besides, it achieves up to 4.25% and 4.27% improvements on link prediction and node classification tasks respectively. In the end, we deploy the proposed algorithms in two online games of Tencent with the applications of friend recommendation and item recommendation, which improve the competitors by up to 91.11% in running time and up to 12.80% in the corresponding evaluation metrics.

引用

页码：3271 / 3279

页数：9

共 50 条

[41] Large-Scale Text Similarity Computing with Spark
Bao, Xiaoan
Dai, Shichao
Zhang, Na
Yu, Chenghai
[J]. INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2016, 9 (04): : 95 - 100
[42] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
N. Ahmed
Andre L. C. Barczak
Teo Susnjak
Mohammed A. Rashid
[J]. Journal of Big Data, 7
[43] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
Ahmed, N.
Barczak, Andre L. C.
Susnjak, Teo
Rashid, Mohammed A.
[J]. JOURNAL OF BIG DATA, 2020, 7 (01)
[44] Understanding Coarsening for Embedding Large-Scale Graphs
Akyildiz, Taha Atahan
Aljundi, Amro Alabsi
Kaya, Kamer
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 2937 - 2946
[45] Decentralized Embedding Framework for Large-Scale Networks
Imran, Mubashir
Yin, Hongzhi
Chen, Tong
Shao, Yingxia
Zhang, Xiangliang
Zhou, Xiaofang
[J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS (DASFAA 2020), PT III, 2020, 12114 : 425 - 441
[46] Gaussian Embedding of Large-Scale Attributed Graphs
Hettige, Bhagya
Li, Yuan-Fang
Wang, Weiqing
Buntine, Wray
[J]. DATABASES THEORY AND APPLICATIONS, ADC 2020, 2020, 12008 : 134 - 146
[47] Large-Scale Clustering through Functional Embedding
Ratle, Frederic
Weston, Jason
Miller, Matthew L.
[J]. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PART II, PROCEEDINGS, 2008, 5212 : 266 - +
[48] Large-scale prediction of adverse drug reactions-related proteins with network embedding
Park, Jaesub
Lee, Sangyeon
Kim, Kwansoo
Jung, Jaegyun
Lee, Doheon
[J]. BIOINFORMATICS, 2023, 39 (01)
[49] Gated Multi-channel Network Embedding for Large-scale Mobile App Clustering
Yoon, Yeo-Chan
Kim, Soo Kyun
[J]. KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2023, 17 (06): : 1620 - 1634
[50] Train rescheduling for large-scale disruptions in a large-scale railway network
Zhang, Chuntian
Gao, Yuan
Cacchiani, Valentina
Yang, Lixing
Gao, Ziyou
[J]. TRANSPORTATION RESEARCH PART B-METHODOLOGICAL, 2023, 174

← 1 2 3 4 5 →