WORD DISTINCTIVITY-QUANTIFYING IMPROVEMENT OF TOPIC MODELING RESULTS FROM N-GRAMMING

被引:0
|
作者
Chai, Christine P. [1 ]
机构
[1] Microsoft Corp, Washington, DC 20001 USA
关键词
latent Dirichlet allocation; text mining; topic modeling; n-gramming; data cleaning; quantification; RETRIEVAL;
D O I
暂无
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Text data cleaning is an important but often overlooked step in text mining because it is difficult to quantify the contribution. Therefore, we propose the word distinctivity to measure the im-provement of topic modeling results from n-gramming, which preserves special phrases in a corpus. The word distinctivity evaluates the signal strength of a word's topic assignments, and a high distinctivity means a high posterior probability for the word to come from a certain topic. We implemented the latent Dirichlet allocation for topic modeling, and discovered that some special phrases show an increase in word distinctivity, reducing uncertainty in topic identification.
引用
收藏
页码:199 / 220
页数:22
相关论文
共 11 条
  • [1] Quantifying Individual Research's Distance from the Trends based on Dynamic Topic Modeling
    Meng J.
    Lou W.
    He J.
    [J]. Proceedings of the Association for Information Science and Technology, 2022, 59 (01): : 762 - 763
  • [2] Co-word Maps and Topic Modeling: A Comparison Using Small and Medium-Sized Corpora (N < 1,000)
    Leydesdorff, Loet
    Nerghes, Adina
    [J]. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2017, 68 (04) : 1024 - 1035
  • [3] Modeling Common Real-Word Relations Using Triples Extracted from n-Grams
    Sipos, Ruben
    Mladenic, Dunja
    Grobelnik, Marko
    Brank, Janez
    [J]. SEMANTIC WEB, PROCEEDINGS, 2009, 5926 : 16 - 30
  • [4] Finding Long-COVID: temporal topic modeling of electronic health records from the N3C and RECOVER programs
    O'Neil, Shawn T.
    Madlock-Brown, Charisse
    Wilkins, Kenneth J.
    McGrath, Brenda M.
    Davis, Hannah E.
    Assaf, Gina S.
    Wei, Hannah
    Zareie, Parya
    French, Evan T.
    Loomba, Johanna
    McMurry, Julie A.
    Zhou, Andrea
    Chute, Christopher G.
    Moffitt, Richard A.
    Pfaff, Emily R.
    Yoo, Yun Jae
    Leese, Peter
    Chew, Robert F.
    Lieberman, Michael
    Haendel, Melissa A.
    [J]. NPJ DIGITAL MEDICINE, 2024, 7 (01):
  • [5] Improvement in High-Grade Osteosarcoma Survival Results from 202 Patients Treated at a Single Instituti n in Taiwan
    Hung, Giun-Yi
    Yen, Hsiu-Ju
    Yen, Chueh-Chuan
    Wu, Po-Kuei
    Chen, Cheng-Fong
    Chen, Paul C-H
    Wu, Hung-Ta H.
    Chiou, Hong-Jen
    Chen, Wei-Ming
    [J]. MEDICINE, 2016, 95 (15)
  • [6] Using Results From Infectious Disease Modeling to Improve the Response to a Potential H7N9 Influenza Pandemic
    Rasmussen, Sonja A.
    Redd, Stephen C.
    [J]. CLINICAL INFECTIOUS DISEASES, 2015, 60 : S9 - S10
  • [7] Improvement in glucose control after switching to advanced hybrid closed-loop: results from a one year real-word study with Minimed 780G
    Turchi, F.
    Luconi, M. P.
    Ciappini, B.
    Giorgini, N.
    Rosati, S.
    Lanari, L.
    Tortato, E.
    [J]. DIABETOLOGIA, 2023, 66 (SUPPL 1) : S389 - S389
  • [8] Molecular modeling of four stereoisomers of the major B[a]PDE adduct (at N2-dG) in five cases where the structure is known from NMR studies:: Molecular modeling is consistent with NMR results
    Lee, CH
    Chandani, S
    Loechler, EL
    [J]. CHEMICAL RESEARCH IN TOXICOLOGY, 2002, 15 (11) : 1429 - 1444
  • [10] Direct and Indirect Pathways for Health-Related Quality of Life Change from Pain Improvement in Neuropathic Pain Patients with Spine Diseases: Path Analysis with Structural Equation Modeling Using Non-Interventional Study Results of Pregabalin
    Taguchi, Toshihiko
    Nozawa, Kazutaka
    Zeniya, Shigeki
    Murata, Tatsunori
    Laurent, Thomas
    Hirano, Takahiro
    Fujii, Koichi
    [J]. JOURNAL OF PAIN RESEARCH, 2021, 14 : 1543 - 1551