Scaling Word2Vec on Big Corpus

被引：0

作者：

Bofang Li

Aleksandr Drozd

Yuhe Guo

Tao Liu

Satoshi Matsuoka

Xiaoyong Du

机构：

[1] Renmin University of China,

[2] Tokyo Institute of Technology,undefined

[3] AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory,undefined

[4] RIKEN Center for Computational Science,undefined

来源：

Data Science and Engineering | 2019年 / 4卷

关键词：

Machine learning; Natural language processing; High performance computing; Word embeddings;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Word embedding has been well accepted as an important feature in the area of natural language processing (NLP). Specifically, the Word2Vec model learns high-quality word embeddings and is widely used in various NLP tasks. The training of Word2Vec is sequential on a CPU due to strong dependencies between word–context pairs. In this paper, we target to scale Word2Vec on a GPU cluster. To do this, one main challenge is reducing dependencies inside a large training batch. We heuristically design a variation of Word2Vec, which ensures that each word–context pair contains a non-dependent word and a uniformly sampled contextual word. During batch training, we “freeze” the context part and update only on the non-dependent part to reduce conflicts. This variation also directly controls the training iterations by fixing the number of samples and treats high-frequency and low-frequency words equally. We conduct extensive experiments over a range of NLP tasks. The results show that our proposed model achieves a 7.5 times acceleration on 16 GPUs without accuracy drop. Moreover, by using high-level Chainer deep learning framework, we can easily implement Word2Vec variations such as CNN-based subword-level models and achieves similar scaling results.

引用

页码：157 / 175

页数：18

共 50 条

[31] A detailed review on word embedding techniques with emphasis on word2vec
Johnson, S. Joshua
Murty, M. Ramakrishna
Navakanth, I.
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (13) : 37979 - 38007
[32] Malware classification with Word2Vec, HMM2Vec, BERT, and ELMo
Kale, Aparna Sunil
Pandya, Vinay
Di Troia, Fabio
Stamp, Mark
[J]. JOURNAL OF COMPUTER VIROLOGY AND HACKING TECHNIQUES, 2023, 19 (01) : 1 - 16
[33] Key word extraction for short text via word2vec, doc2vec, and textrank
Li, Jun
Huang, Guimin
Fan, Chunli
Sun, Zhenglin
Zhu, Hongtao
[J]. TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2019, 27 (03) : 1794 - 1805
[34] Word2vec's Distributed Word Representation for Hindi Word Sense Disambiguation
Kumari, Archana
Lobiyal, D. K.
[J]. DISTRIBUTED COMPUTING AND INTERNET TECHNOLOGY (ICDCIT 2020), 2020, 11969 : 325 - 335
[35] Word2vec Word Similarities on IBM's TrueNorth Neurosynaptic System
Mendat, Daniel R.
Cassidy, Andrew S.
Zarrella, Guido
Andreou, Andreas G.
[J]. 2018 IEEE BIOMEDICAL CIRCUITS AND SYSTEMS CONFERENCE (BIOCAS): ADVANCED SYSTEMS FOR ENHANCING HUMAN HEALTH, 2018, : 595 - 598
[36] ExMrec2vec: Explainable Movie Recommender System based on Word2vec
Samih, Amina
Ghadi, Abderrahim
Fennan, Abdelhadi
[J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (08) : 653 - 660
[37] Research on the Construction of Sentiment Dictionary Based on Word2vec
Song, Xiao-yu
Zhao, Yang
Jin, Li-ting
Sun, Yue
Liu, Tong
[J]. 2018 INTERNATIONAL CONFERENCE ON ALGORITHMS, COMPUTING AND ARTIFICIAL INTELLIGENCE (ACAI 2018), 2018,
[38] 基于Word2vec的文档分类方法
陈杰
陈彩
梁毅
[J]. 计算机系统应用, 2017, 26 (11) : 159 - 164
[39] Construction Method of Sentiment Lexicon Based on Word2vec
Yuan, Zhengwu
Duan, Lian
[J]. PROCEEDINGS OF 2019 IEEE 8TH JOINT INTERNATIONAL INFORMATION TECHNOLOGY AND ARTIFICIAL INTELLIGENCE CONFERENCE (ITAIC 2019), 2019, : 848 - 851
[40] Research on Chinese Text Classification Based on Word2vec
Yang, Zhi-Tong
Zheng, Jun
[J]. 2016 2ND IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATIONS (ICCC), 2016, : 1166 - 1170

← 1 2 3 4 5 →