Faster Parallel Training of Word Embeddings

被引：0

作者：

Wszola, Eliza ^{[1
]}

Jaggi, Martin ^{[2
]}

Puschel, Markus ^{[1
]}

机构：

[1] Swiss Fed Inst Technol, Dept Comp Sci, Zurich, Switzerland

[2] Ecole Polytech Fed Lausanne, Sch Comp & Commun Sci, Lausanne, Switzerland

来源：

2021 IEEE 28TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC 2021) | 2021年

关键词：

machine learning; natural language processing; parallel computing; performance; word2vec; fasttext;

D O I：

10.1109/HiPC53243.2021.00017

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Word embeddings have gained increasing popularity in the recent years due to the Word2vec library and its extension fastText that uses subword information. In this paper, we aim at improving the execution speed of fastText training on homogeneous multi- and manycore CPUs while maintaining accuracy. We present a novel open-source implementation that flexibly incorporates various algorithmic Variants including negative sample sharing, batched updates. and a byte-pair encoding-based alternative for subword units. We build these novel variants over a fastText implementation that we carefully optimized for the architecture, memory hierarchy, and parallelism of current manycore CPUs. Our experiments on three languages demonstrate 3-20x speed-up in training time at competitive semantic and syntactic accuracy.

引用

页码：31 / 41

页数：11

共 50 条

[1] Parallel Data-Local Training for Optimizing Word2Vec Embeddings for Word and Graph Embeddings
Moon, Gordon E.
Newman-Griffis, Denis
Kim, Jinsung
Sukumaran-Rajam, Aravind
Fosler-Lussier, Eric
Sadayappan, P.
[J]. PROCEEDINGS OF 2019 5TH IEEE/ACM WORKSHOP ON MACHINE LEARNING IN HIGH PERFORMANCE COMPUTING ENVIRONMENTS (MLHPC 2019), 2019, : 44 - 55
[2] Learning Word Embeddings in Parallel by Alignment
Zubair, Sahil
Zubair, Mohammad
[J]. 2017 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), 2017, : 566 - 571
[3] Faster Training by Selecting Samples Using Embeddings
Gonzalez, Santiago
Landgraf, Joshua
Miikkulainen, Risto
[J]. 2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
[4] Multilingual Training of Crosslingual Word Embeddings
Duong, Long
Kanayama, Hiroshi
Ma, Tengfei
Bird, Steven
Cohn, Trevor
[J]. 15TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2017), VOL 1: LONG PAPERS, 2017, : 894 - 904
[5] A survey on training and evaluation of word embeddings
Torregrossa, Francois
Allesiardo, Robin
Claveau, Vincent
Kooli, Nihel
Gravier, Guillaume
[J]. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, 2021, 11 (02) : 85 - 103
[6] A survey on training and evaluation of word embeddings
François Torregrossa
Robin Allesiardo
Vincent Claveau
Nihel Kooli
Guillaume Gravier
[J]. International Journal of Data Science and Analytics, 2021, 11 : 85 - 103
[7] Training Temporal Word Embeddings with a Compass
Di Carlo, Valerio
Bianchi, Federico
Palmonari, Matteo
[J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 6326 - 6334
[8] Quantifying Context Overlap for Training Word Embeddings
Zhuang, Yimeng
Xie, Jinghui
Zheng, Yinhe
Zhu, Xuan
[J]. 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 587 - 593
[9] Unsupervised Joint Training of Bilingual Word Embeddings
Marie, Benjamin
Fujita, Atsushi
[J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3224 - 3230
[10] Asynchronous Training of Word Embeddings for Large Text Corpora
Anand, Avishek
Khosla, Megha
Singh, Jaspreet
Zab, Jan-Hendrik
Zhang, Zijian
[J]. PROCEEDINGS OF THE TWELFTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'19), 2019, : 168 - 176

← 1 2 3 4 5 →