Improving Word Embeddings for Low Frequency Words by Pseudo Contexts

被引：1

作者：

Li, Fang ^{[1
]}

Wang, Xiaojie ^{[1
]}

机构：

[1] Beijing Univ Posts & Telecommun, Sch Comp, Beijing, Peoples R China

来源：

CHINESE COMPUTATIONAL LINGUISTICS AND NATURAL LANGUAGE PROCESSING BASED ON NATURALLY ANNOTATED BIG DATA, CCL 2017 | 2017年 / 10565卷

关键词：

Word embedding; Low frequency word;

D O I：

10.1007/978-3-319-69005-6_4

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper investigates relations between word semantic density and word frequency. A distributed representations based word average similarity is defined as the measure of word semantic density. We find that the average similarities of low frequency words are always bigger than that of high frequency words, when the frequency approaches to 400 around, the average similarity tends to stable. The finding keeps correct with changes of the size of training corpus, dimension of distributed representations and number of negative samples in skip-gram model. It also keeps on 17 different languages. Basing on the finding, we propose a pseudo context skip-gram model, which makes use of context words of semantic nearest neighbors of target words. Experiment results show our model achieves significant performance improvements in both word similarity and analogy tasks.

引用

页码：37 / 47

页数：11

共 50 条

[1] Collaboratively Improving Topic Discovery and Word Embeddings by Coordinating Global and Local Contexts
Xun, Guangxu
Li, Yaliang
Gao, Jing
Zhang, Aidong
[J]. KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, : 535 - 543
[2] Automatic Ranking of Swear Words using Word Embeddings and Pseudo-Relevance Feedback
D'Haro, Luis Fernando
Banchs, Rafael E.
[J]. 2015 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2015, : 815 - 820
[3] Using pseudo-senses for improving the extraction of synonyms from word embeddings
Ferret, Olivier
[J]. PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, 2018, : 351 - 357
[4] Word Embeddings of Monosemous Words in Dictionary for Word Sense Disambiguation
Sasaki, Minoru
[J]. SEMAPRO 2018: THE TWELFTH INTERNATIONAL CONFERENCE ON ADVANCES IN SEMANTIC PROCESSING, 2018, : 4 - 7
[5] Improving WordNet using Word Embeddings
Chiru, Costin-Gabriel
Truica, Ciprian-Octavian
Apostol, Elena-Simona
Ionescu, Alexandru
[J]. 2021 23RD INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING (SYNASC 2021), 2021, : 121 - 128
[6] Analysis of The Characteristics of Similar Words Computed by Word Embeddings
Zhou, Shuhui
Liu, Peihan
Liu, Lizhen
Song, Wei
Cheng, Miaomiao
[J]. PROCEEDINGS OF 2020 IEEE 10TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION (ICEIEC 2020), 2020, : 327 - 330
[7] Generating Bags of Words from the Sums of Their Word Embeddings
White, Lyndon
Togneri, Roberto
Liu, Wei
Bennamoun, Mohammed
[J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, (CICLING 2016), PT I, 2018, 9623 : 91 - 102
[8] Improving Bilingual Lexicon Induction for Low Frequency Words
Huang, Jiaji
Cai, Xingyu
Church, Kenneth
[J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 1310 - 1314
[9] Improving Automatic Categorization of Technical vs. Laymen Medical Words using FastText Word Embeddings
Pylieva, Hanna
Chernodub, Artem
Grabar, Natalia
Hamon, Thierry
[J]. PROCEEDINGS OF THE 1ST INTERNATIONAL WORKSHOP ON INFORMATICS & DATA- DRIVEN MEDICINE (IDDM 2018), 2018, 2255 : 93 - 102
[10] Improving Word Embeddings Using Kernel PCA
Gupta, Vishwani
Giesselbach, Sven
Rueping, Stefan
Bauckhage, Christian
[J]. 4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019), 2019, : 200 - 208

← 1 2 3 4 5 →