A Comparative Study of Methods for Visualizable Semantic Embedding of Small Text Corpora

被引:3
|
作者
Choudhary, Rishabh [1 ]
Doboli, Simona [2 ]
Minai, Ali A. [1 ]
机构
[1] Univ Cincinnati, Dept Elect Engn & Comp Sci, Cincinnati, OH 45221 USA
[2] Hofstra Univ, Dept Comp Sci, Hempstead, NY 11550 USA
关键词
semantic spaces; text embedding; language models; semantic visualization; REPRESENTATIONS; BRAIN;
D O I
10.1109/IJCNN52387.2021.9534250
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text embedding has recently emerged as a very useful and successful method for semantic representation. Following initial word-level embedding methods such as Latent Semantic Analysis (LSA) and topic-based bag-of-words approaches like Latent Dirichlet Allocation (LDA), the focus has turned to language models and text encoders implemented as neural networks - ranging from word-level models to those embedding whole documents. The distinctive feature of these models is their ability to infer semantic spaces at all levels based purely on data, with no need for complexities such as syntactic analysis or ontology building. Many of these models are available pretrained on enormous amounts of data, providing downstream applications with general-purpose semantic spaces. In particular, embedding models at the sentence level or higher are most useful in applications because the meaning of text only becomes clear at that level. Most text embedding methods produce text embeddings in high-dimensional spaces, with a dimensionality ranging from a few hundred to thousands. However, it is often useful to visualize semantic spaces in very low dimension, which requires the use of dimensionality reduction methods. It is not clear what language models and what method of dimensionality reduction would work well in these cases. In this paper, we compare four text embedding methods in combination with three methods of dimensionality reduction to map three related real-world datasets comprising textual descriptions of items in a particular domain (sports) to a 2-dimensional semantic visualization space. The results provide several insights into the utility of these methods for data of this type.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Embedding Semantic Anchors to Guide Topic Models on Short Text Corpora
    Steuber, Florian
    Schneider, Sinclair
    Schoenfeld, Mirco
    BIG DATA RESEARCH, 2022, 27
  • [2] Word Embedding In Small Corpora: A Case Study in Quran
    Aghahadi, Zeinab
    Talebpour, Alireza
    2018 8TH INTERNATIONAL CONFERENCE ON COMPUTER AND KNOWLEDGE ENGINEERING (ICCKE), 2018, : 303 - 307
  • [3] Automatic annotation of corpora for text summarisation: A comparative study
    Orasan, C
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2005, 3406 : 670 - 681
  • [4] Comparative study of embedding methods
    Cellucci, CJ
    Albano, AM
    Rapp, PE
    PHYSICAL REVIEW E, 2003, 67 (06):
  • [5] Dynamic Semantic Network Analysis of Unstructured Text Corpora
    Kharlamov, Alexander
    Gradoselskaya, Galina
    Dokuka, Sofia
    ANALYSIS OF IMAGES, SOCIAL NETWORKS AND TEXTS, AIST 2017, 2018, 10716 : 392 - 403
  • [6] Extracting semantic representations from large text corpora
    Patel, M
    Bullinaria, JA
    Levy, JP
    4TH NEURAL COMPUTATION AND PSYCHOLOGY WORKSHOP, LONDON, 9-11 APRIL 1997: CONNECTIONIST REPRESENTATIONS, 1997, : 199 - 212
  • [7] Authorship Attribution: A Comparative Study of Three Text Corpora and Three Languages
    Savoy, Jacques
    JOURNAL OF QUANTITATIVE LINGUISTICS, 2012, 19 (02) : 132 - 161
  • [8] A comparative study on text clustering methods
    Zheng, Yan
    Cheng, Xiaochun
    Huang, Ronghuai
    Man, Yi
    ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2006, 4093 : 644 - 651
  • [9] COMPARATIVE STUDY OF TEXT REPRESENTATION METHODS
    Zhang, Nuo
    Watanabe, Toshinori
    3RD INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND COMPUTER SCIENCE (ITCS 2011), PROCEEDINGS, 2011, : 263 - 266
  • [10] Text Semantic Steganalysis Based on Word Embedding
    Zuo, Xin
    Hu, Huanhuan
    Zhang, Weiming
    Yu, Nenghai
    CLOUD COMPUTING AND SECURITY, PT IV, 2018, 11066 : 485 - 495