Word Embeddings as Statistical Estimators

被引:0
|
作者
Dey, Neil [1 ]
Singer, Matthew [1 ]
Williams, Jonathan P. [1 ,2 ]
Sengupta, Srijan [1 ]
机构
[1] North Carolina State Univ, Dept Stat, Raleigh, NC 27695 USA
[2] Norwegian Acad Sci & Letters, Ctr Adv Study, Oslo, Norway
基金
美国国家卫生研究院;
关键词
Copula; Word2Vec; distributed representation; statistical linguistics; language modeling; missing values SVD;
D O I
10.1007/s13571-024-00331-1
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Word embeddings are a fundamental tool in natural language processing. Currently, word embedding methods are evaluated on the basis of empirical performance on benchmark data sets, and there is a lack of rigorous understanding of their theoretical properties. This paper studies word embeddings from a statistical theoretical perspective, which is essential for formal inference and uncertainty quantification. We propose a copula-based statistical model for text data and show that under this model, the now-classical Word2Vec method can be interpreted as a statistical estimation method for estimating the theoretical pointwise mutual information (PMI). We further illustrate the utility of this statistical model by using it to develop a missing value-based estimator as a statistically tractable and interpretable alternative to the Word2Vec approach. The estimation error of this estimator is comparable to Word2Vec and improves upon the truncation-based method proposed by Levy and Goldberg (Adv. Neural Inf. Process. Syst., 27, 2177-2185 2014). The resulting estimator also is comparable to Word2Vec in a benchmark sentiment analysis task on the IMDb Movie Reviews data set and a part-of-speech tagging task on the OntoNotes data set.
引用
收藏
页数:27
相关论文
共 50 条
  • [1] Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change
    Hamilton, William L.
    Leskovec, Jure
    Jurafsky, Dan
    [J]. PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 1489 - 1501
  • [2] Comparing the Performance of Neural and Statistical Sentence Embeddings on Summarization and Word Sense Disambiguation
    Juvekar, Gaurav
    Lolage, Abhishek
    Sahasrabudhe, Dhruva
    Haribhakta, Yashodhara
    [J]. 2018 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2018, : 1787 - 1792
  • [3] Socialized Word Embeddings
    Zeng, Ziqian
    Yin, Yichun
    Song, Yangqiu
    Zhang, Ming
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3915 - 3921
  • [4] Dynamic Word Embeddings
    Bamler, Robert
    Mandt, Stephan
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [5] Urdu Word Embeddings
    Haider, Samar
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 964 - 968
  • [6] isiZulu Word Embeddings
    Dlamini, Sibonelo
    Jembere, Edgar
    Pillay, Anban
    van Niekerk, Brett
    [J]. 2021 CONFERENCE ON INFORMATION COMMUNICATIONS TECHNOLOGY AND SOCIETY (ICTAS), 2021, : 121 - 126
  • [7] Topical Word Embeddings
    Liu, Yang
    Liu, Zhiyuan
    Chua, Tat-Seng
    Sun, Maosong
    [J]. PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2015, : 2418 - 2424
  • [8] Bias in Word Embeddings
    Papakyriakopoulos, Orestis
    Hegelich, Simon
    Serrano, Juan Carlos Medina
    Marco, Fabienne
    [J]. FAT* '20: PROCEEDINGS OF THE 2020 CONFERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, 2020, : 446 - 457
  • [9] Compressing Word Embeddings
    Andrews, Martin
    [J]. NEURAL INFORMATION PROCESSING, ICONIP 2016, PT IV, 2016, 9950 : 413 - 422
  • [10] Relational Word Embeddings
    Camacho-Collados, Jose
    Espinosa-Anke, Luis
    Schockaert, Steven
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3286 - 3296