Scaling Word2Vec on Big Corpus

被引：0

作者：

Bofang Li

Aleksandr Drozd

Yuhe Guo

Tao Liu

Satoshi Matsuoka

Xiaoyong Du

机构：

[1] Renmin University of China,

[2] Tokyo Institute of Technology,undefined

[3] AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory,undefined

[4] RIKEN Center for Computational Science,undefined

来源：

Data Science and Engineering | 2019年 / 4卷

关键词：

Machine learning; Natural language processing; High performance computing; Word embeddings;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Word embedding has been well accepted as an important feature in the area of natural language processing (NLP). Specifically, the Word2Vec model learns high-quality word embeddings and is widely used in various NLP tasks. The training of Word2Vec is sequential on a CPU due to strong dependencies between word–context pairs. In this paper, we target to scale Word2Vec on a GPU cluster. To do this, one main challenge is reducing dependencies inside a large training batch. We heuristically design a variation of Word2Vec, which ensures that each word–context pair contains a non-dependent word and a uniformly sampled contextual word. During batch training, we “freeze” the context part and update only on the non-dependent part to reduce conflicts. This variation also directly controls the training iterations by fixing the number of samples and treats high-frequency and low-frequency words equally. We conduct extensive experiments over a range of NLP tasks. The results show that our proposed model achieves a 7.5 times acceleration on 16 GPUs without accuracy drop. Moreover, by using high-level Chainer deep learning framework, we can easily implement Word2Vec variations such as CNN-based subword-level models and achieves similar scaling results.

引用

页码：157 / 175

页数：18

共 50 条

[1] Scaling Word2Vec on Big Corpus
Li, Bofang
Drozd, Aleksandr
Guo, Yuhe
Liu, Tao
Matsuoka, Satoshi
Du, Xiaoyong
[J]. DATA SCIENCE AND ENGINEERING, 2019, 4 (02) : 157 - 175
[2] Using Word2Vec to Process Big Text Data
Ma, Long
Zhang, Yanqing
[J]. PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 2895 - 2897
[3] Modelling of Topic from Hindi Corpus using Word2Vec
Panigrahi, Sabitra Sankalp
Panigrahi, Narayan
Paul, Biswajit
[J]. 2018 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, CONTROL AND COMMUNICATION TECHNOLOGY (IAC3T), 2018, : 97 - 100
[4] The Spectral Underpinning of word2vec
Jaffe, Ariel
Kluger, Yuval
Lindenbaum, Ofir
Patsenker, Jonathan
Peterfreund, Erez
Steinerberger, Stefan
[J]. FRONTIERS IN APPLIED MATHEMATICS AND STATISTICS, 2020, 6
[5] Emerging Trends Word2Vec
Church, Kenneth Ward
[J]. NATURAL LANGUAGE ENGINEERING, 2017, 23 (01) : 155 - 162
[6] Stability of Word Embeddings Using Word2Vec
Chugh, Mansi
Whigham, Peter A.
Dick, Grant
[J]. AI 2018: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, 11320 : 812 - 818
[7] Word2vec for Arabic Word Sense Disambiguation
Laatar, Rim
Aloulou, Chafik
Belghuith, Lamia Hadrich
[J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2018), 2018, 10859 : 308 - 311
[8] Corpus specificity in LSA and word2vec: the role of out-of-domain documents
Altszyler, Edgar
Sigman, Mariano
Slezak, Diego Fernandez
[J]. REPRESENTATION LEARNING FOR NLP, 2018, : 1 - 10
[9] Considerations about learning Word2Vec
Giovanni Di Gennaro
Amedeo Buonanno
Francesco A. N. Palmieri
[J]. The Journal of Supercomputing, 2021, 77 : 12320 - 12335
[10] Considerations about learning Word2Vec
Di Gennaro, Giovanni
Buonanno, Amedeo
Palmieri, Francesco A. N.
[J]. JOURNAL OF SUPERCOMPUTING, 2021, 77 (11): : 12320 - 12335

← 1 2 3 4 5 →