Unsupervised Anomaly Detection in Multi-Topic Short-Text Corpora

被引:0
|
作者
Ait-Saada, Mira [1 ,2 ]
Nadif, Mohamed [1 ]
机构
[1] Univ Paris Cite, Ctr Borelli, F-75006 Paris, France
[2] Caisse Depots & Consignat, F-75013 Paris, France
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Unsupervised anomaly detection seeks to identify deviant data samples in a dataset without using labels and constitutes a challenging task, particularly when the majority class is heterogeneous. This paper addresses this topic for textual data and aims to determine whether a text sample is an outlier within a potentially multi-topic corpus. To this end, it is crucial to grasp the semantic aspects of words, particularly when dealing with short texts, since it is difficult to syntactically discriminate data samples based only on a few words. Thereby we make use of word embeddings to represent each sample by a dense vector, efficiently capturing the underlying semantics. Then, we rely on the Mixture Model approach to detect which samples deviate the most from the underlying distributions of the corpus. Experiments carried out on real datasets show the effectiveness of the proposed approach in comparison to state-of-the-art techniques both in terms of performance and time efficiency, especially when more than one topic is present in the corpus.
引用
收藏
页码:1392 / 1403
页数:12
相关论文
共 50 条
  • [1] Proximity estimation and hardness of short-text corpora
    Luis Errecalde, Marcelo
    Ingaramo, Diego
    Rosso, Paolo
    [J]. DEXA 2008: 19TH INTERNATIONAL CONFERENCE ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2008, : 15 - +
  • [2] Unsupervised Multi-Topic Labeling for Spoken Utterances
    Weigelt, Sebastian
    Keim, Jan
    Hey, Tobias
    Tichy, Walter F.
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON HUMANIZED COMPUTING AND COMMUNICATION (HCC 2019), 2019, : 38 - 45
  • [3] Detection As Multi-Topic Tracking
    James Allan
    [J]. Information Retrieval, 2002, 5 : 139 - 157
  • [4] Detection as multi-topic tracking
    Allan, J
    [J]. INFORMATION RETRIEVAL, 2002, 5 (2-3): : 139 - 157
  • [5] Multi-topic aspects in clinical text classification
    Sasaki, Yutaka
    Rea, Brian
    Ananiadou, Sophia
    [J]. 2007 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, PROCEEDINGS, 2007, : 62 - 67
  • [6] Density-based clustering of short-text corpora
    Ingaramo, Diego A.
    Errecalde, Marcelo L.
    Rosso, Paolo
    [J]. PROCESAMIENTO DEL LENGUAJE NATURAL, 2008, (41): : 81 - 88
  • [7] Topic Modeling on Podcast Short-Text Metadata
    Valero, Francisco B.
    Baranes, Marion
    Epure, Elena, V
    [J]. ADVANCES IN INFORMATION RETRIEVAL, PT I, 2022, 13185 : 472 - 486
  • [8] On Clustering and Evaluation of Narrow Domain Short-Text Corpora
    Pinto Avendano, David Eduardo
    [J]. PROCESAMIENTO DEL LENGUAJE NATURAL, 2009, (42): : 129 - 130
  • [9] Evaluation of internal validity measures in short-text corpora
    Ingaramo, Diego
    Pinto, David
    Rosso, Paolo
    Errecalde, Marcelo
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2008, 4919 : 555 - 567
  • [10] A new AntTree-based algorithm for clustering short-text corpora
    Luis Errecalde, Marcelo
    Alejandro Ingaramo, Diego
    Rosso, Paolo
    [J]. JOURNAL OF COMPUTER SCIENCE & TECHNOLOGY, 2010, 10 (01): : 1 - 7