Word Embeddings as Statistical Estimators

被引：0

作者：

Dey, Neil ^{[1
]}

Singer, Matthew ^{[1
]}

Williams, Jonathan P. ^{[1
,2
]}

Sengupta, Srijan ^{[1
]}

机构：

[1] North Carolina State Univ, Dept Stat, Raleigh, NC 27695 USA

[2] Norwegian Acad Sci & Letters, Ctr Adv Study, Oslo, Norway

来源：

SANKHYA-SERIES B-APPLIED AND INTERDISCIPLINARY STATISTICS | 2024年

基金：

美国国家卫生研究院;

关键词：

Copula; Word2Vec; distributed representation; statistical linguistics; language modeling; missing values SVD;

D O I：

10.1007/s13571-024-00331-1

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

Word embeddings are a fundamental tool in natural language processing. Currently, word embedding methods are evaluated on the basis of empirical performance on benchmark data sets, and there is a lack of rigorous understanding of their theoretical properties. This paper studies word embeddings from a statistical theoretical perspective, which is essential for formal inference and uncertainty quantification. We propose a copula-based statistical model for text data and show that under this model, the now-classical Word2Vec method can be interpreted as a statistical estimation method for estimating the theoretical pointwise mutual information (PMI). We further illustrate the utility of this statistical model by using it to develop a missing value-based estimator as a statistically tractable and interpretable alternative to the Word2Vec approach. The estimation error of this estimator is comparable to Word2Vec and improves upon the truncation-based method proposed by Levy and Goldberg (Adv. Neural Inf. Process. Syst., 27, 2177-2185 2014). The resulting estimator also is comparable to Word2Vec in a benchmark sentiment analysis task on the IMDb Movie Reviews data set and a part-of-speech tagging task on the OntoNotes data set.

引用

页数：27

共 50 条

[1] Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change
Hamilton, William L.
Leskovec, Jure
Jurafsky, Dan
[J]. PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 1489 - 1501
[2] Comparing the Performance of Neural and Statistical Sentence Embeddings on Summarization and Word Sense Disambiguation
Juvekar, Gaurav
Lolage, Abhishek
Sahasrabudhe, Dhruva
Haribhakta, Yashodhara
[J]. 2018 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2018, : 1787 - 1792
[3] Socialized Word Embeddings
Zeng, Ziqian
Yin, Yichun
Song, Yangqiu
Zhang, Ming
[J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3915 - 3921
[4] Dynamic Word Embeddings
Bamler, Robert
Mandt, Stephan
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
[5] Urdu Word Embeddings
Haider, Samar
[J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 964 - 968
[6] isiZulu Word Embeddings
Dlamini, Sibonelo
Jembere, Edgar
Pillay, Anban
van Niekerk, Brett
[J]. 2021 CONFERENCE ON INFORMATION COMMUNICATIONS TECHNOLOGY AND SOCIETY (ICTAS), 2021, : 121 - 126
[7] Topical Word Embeddings
Liu, Yang
Liu, Zhiyuan
Chua, Tat-Seng
Sun, Maosong
[J]. PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2015, : 2418 - 2424
[8] Bias in Word Embeddings
Papakyriakopoulos, Orestis
Hegelich, Simon
Serrano, Juan Carlos Medina
Marco, Fabienne
[J]. FAT* '20: PROCEEDINGS OF THE 2020 CONFERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, 2020, : 446 - 457
[9] Compressing Word Embeddings
Andrews, Martin
[J]. NEURAL INFORMATION PROCESSING, ICONIP 2016, PT IV, 2016, 9950 : 413 - 422
[10] Relational Word Embeddings
Camacho-Collados, Jose
Espinosa-Anke, Luis
Schockaert, Steven
[J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3286 - 3296

← 1 2 3 4 5 →