A Comparative Study of Methods for Visualizable Semantic Embedding of Small Text Corpora

被引:3
|
作者
Choudhary, Rishabh [1 ]
Doboli, Simona [2 ]
Minai, Ali A. [1 ]
机构
[1] Univ Cincinnati, Dept Elect Engn & Comp Sci, Cincinnati, OH 45221 USA
[2] Hofstra Univ, Dept Comp Sci, Hempstead, NY 11550 USA
关键词
semantic spaces; text embedding; language models; semantic visualization; REPRESENTATIONS; BRAIN;
D O I
10.1109/IJCNN52387.2021.9534250
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text embedding has recently emerged as a very useful and successful method for semantic representation. Following initial word-level embedding methods such as Latent Semantic Analysis (LSA) and topic-based bag-of-words approaches like Latent Dirichlet Allocation (LDA), the focus has turned to language models and text encoders implemented as neural networks - ranging from word-level models to those embedding whole documents. The distinctive feature of these models is their ability to infer semantic spaces at all levels based purely on data, with no need for complexities such as syntactic analysis or ontology building. Many of these models are available pretrained on enormous amounts of data, providing downstream applications with general-purpose semantic spaces. In particular, embedding models at the sentence level or higher are most useful in applications because the meaning of text only becomes clear at that level. Most text embedding methods produce text embeddings in high-dimensional spaces, with a dimensionality ranging from a few hundred to thousands. However, it is often useful to visualize semantic spaces in very low dimension, which requires the use of dimensionality reduction methods. It is not clear what language models and what method of dimensionality reduction would work well in these cases. In this paper, we compare four text embedding methods in combination with three methods of dimensionality reduction to map three related real-world datasets comprising textual descriptions of items in a particular domain (sports) to a 2-dimensional semantic visualization space. The results provide several insights into the utility of these methods for data of this type.
引用
收藏
页数:8
相关论文
共 50 条
  • [41] COMPARATIVE-STUDY OF 3 METHODS OF PLASTIC EMBEDDING IN DIAGNOSTIC DERMATOPATHOLOGY
    MASON, M
    MACKIE, RM
    JOURNAL OF CLINICAL PATHOLOGY, 1985, 38 (12) : 1397 - 1399
  • [42] A Comparative Study of Classification and Clustering Methods from Text of Books
    Probierz, Barbara
    Kozak, Jan
    Hrabia, Anita
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2022, PT II, 2022, 13758 : 13 - 25
  • [43] A comparative study on unsupervised feature selection methods for text clustering
    Liu, LY
    Kang, JC
    Yu, J
    Wang, ZL
    PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), 2005, : 597 - 601
  • [44] Comparative study on corpora for speech translation
    Kikui, Genichiro
    Yamamoto, Seiichi
    Takezawa, Toshiyuki
    Sumita, Eiichiro
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (05): : 1674 - 1682
  • [45] CONCISE COMPARATIVE SUMMARIES (CCS) OF LARGE TEXT CORPORA WITH A HUMAN EXPERIMENT
    Jia, Jinzhu
    Miratrix, Luke
    Yu, Bin
    Gawalt, Brian
    El Ghaoui, Laurent
    Barnesmoore, Luke
    Clavier, Sophie
    ANNALS OF APPLIED STATISTICS, 2014, 8 (01): : 499 - 529
  • [46] Text Corpora in Translator Training A Case Study of the Use of Comparable Corpora in Classroom Teaching
    Laursen, Anne Lise
    Arinas Pellon, Ismael
    INTERPRETER AND TRANSLATOR TRAINER, 2012, 6 (01): : 45 - 70
  • [47] Short text matching model with multiway semantic interaction based on multi-granularity semantic embedding
    Tang, Xianlun
    Luo, Yang
    Xiong, Deyi
    Yang, Jingming
    Li, Rui
    Peng, Deguang
    APPLIED INTELLIGENCE, 2022, 52 (13) : 15632 - 15642
  • [48] Short text matching model with multiway semantic interaction based on multi-granularity semantic embedding
    Xianlun Tang
    Yang Luo
    Deyi Xiong
    Jingming Yang
    Rui Li
    Deguang Peng
    Applied Intelligence, 2022, 52 : 15632 - 15642
  • [49] Comparative Analysis of Semantic Similarity Techniques for Medical Text
    Alam, Fakhare
    Afzal, Muhammad
    Malik, Khalid Mahmood
    2020 34TH INTERNATIONAL CONFERENCE ON INFORMATION NETWORKING (ICOIN 2020), 2020, : 106 - 109
  • [50] Cellular wave computer algorithms with spatial semantic embedding for handwritten text recognition
    Karacs, Kristof
    Proszeky, Gabor
    Roska, Tamas
    INTERNATIONAL JOURNAL OF CIRCUIT THEORY AND APPLICATIONS, 2009, 37 (10) : 1019 - 1050