Scaling Word2Vec on Big Corpus

被引：0

作者：

Bofang Li

Aleksandr Drozd

Yuhe Guo

Tao Liu

Satoshi Matsuoka

Xiaoyong Du

机构：

[1] Renmin University of China,

[2] Tokyo Institute of Technology,undefined

[3] AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory,undefined

[4] RIKEN Center for Computational Science,undefined

来源：

Data Science and Engineering | 2019年 / 4卷

关键词：

Machine learning; Natural language processing; High performance computing; Word embeddings;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Word embedding has been well accepted as an important feature in the area of natural language processing (NLP). Specifically, the Word2Vec model learns high-quality word embeddings and is widely used in various NLP tasks. The training of Word2Vec is sequential on a CPU due to strong dependencies between word–context pairs. In this paper, we target to scale Word2Vec on a GPU cluster. To do this, one main challenge is reducing dependencies inside a large training batch. We heuristically design a variation of Word2Vec, which ensures that each word–context pair contains a non-dependent word and a uniformly sampled contextual word. During batch training, we “freeze” the context part and update only on the non-dependent part to reduce conflicts. This variation also directly controls the training iterations by fixing the number of samples and treats high-frequency and low-frequency words equally. We conduct extensive experiments over a range of NLP tasks. The results show that our proposed model achieves a 7.5 times acceleration on 16 GPUs without accuracy drop. Moreover, by using high-level Chainer deep learning framework, we can easily implement Word2Vec variations such as CNN-based subword-level models and achieves similar scaling results.

引用

页码：157 / 175

页数：18

共 50 条

[41] Encrypted Malicious Traffic Detection Based on Word2Vec
Ferriyan, Andrey
Thamrin, Achmad Husni
Takeda, Keiji
Murai, Jun
[J]. ELECTRONICS, 2022, 11 (05)
[42] Word Mover's Embedding: From Word2Vec to Document Embedding
Wu, Lingfei
Yen, Ian En-Hsu
Xu, Kun
Xu, Fangli
Balakrishnan, Avinash
Chen, Pin-Yu
Ravikumar, Pradeep
Witbrock, Michael J.
[J]. 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 4524 - 4534
[43] Representation of Semantic Word Embeddings Based on SLDA and Word2vec Model
Tang Huanling
Zhu Hui
Wei Hongmin
Zheng Han
Mao Xueli
Lu Mingyu
Guo Jin
[J]. CHINESE JOURNAL OF ELECTRONICS, 2023, 32 (03) : 647 - 654
[44] Tuning Word2vec for Large Scale Recommendation Systems
Chamberlain, Benjamin P.
Rossi, Emanuele
Shiebler, Dan
Sedhain, Suvash
Bronstein, Michael M.
[J]. RECSYS 2020: 14TH ACM CONFERENCE ON RECOMMENDER SYSTEMS, 2020, : 732 - 737
[45] Extending Word2Vec with Domain-Specific Labels
Svana, Milos
[J]. PROCEEDINGS OF THE 2022 17TH CONFERENCE ON COMPUTER SCIENCE AND INTELLIGENCE SYSTEMS (FEDCSIS), 2022, : 157 - 160
[46] A User Profile Modeling Method Based on Word2Vec
Hu, Jianqiao
Jin, Feng
Zhang, Guigang
Wang, Jian
Yang, Yi
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY COMPANION (QRS-C), 2017, : 410 - 414
[47] Representation of Semantic Word Embeddings Based on SLDA and Word2vec Model
TANG Huanling
ZHU Hui
WEI Hongmin
ZHENG Han
MAO Xueli
LU Mingyu
GUO Jin
[J]. Chinese Journal of Electronics, 2023, 32 (03) : 647 - 654
[48] The Improved Model for word2vec Based on Part of Speech and Word Order
Pan, Bo
Yu, Chong-Chong
Zhang, Qing-Chuan
Xu, Shi-Xuan
Cao, Shuai
[J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2018, 46 (08): : 1976 - 1982
[49] Chinese Sentiment Classification Using Extended Word2Vec
张胜
张鑫
程佳军
王晖
[J]. Journal of Donghua University(English Edition), 2016, 33 (05) : 823 - 826
[50] Using Word2Vec Recommendation for Improved Purchase Prediction
Esmeli, Ramazan
Bader-El-Den, Mohamed
Abdullahi, Hassana
[J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,

← 1 2 3 4 5 →