Finite State Automata on Multi-Word Units for Efficient Text-Mining

被引:1
|
作者
Postiglione, Alberto [1 ]
机构
[1] Univ Salerno, Dept Business Sci & Management & Innovat Syst, Via San Giovanni Paolo 2, I-84084 Fisciano, Italy
关键词
text mining; knowledge extraction; finite automata; ontology; multi-word units; natural language processing; INTERNET; THINGS;
D O I
10.3390/math12040506
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Text mining is crucial for analyzing unstructured and semi-structured textual documents. This paper introduces a fast and precise text mining method based on a finite automaton to extract knowledge domains. Unlike simple words, multi-word units (such as credit card) are emphasized for their efficiency in identifying specific semantic areas due to their predominantly monosemic nature, their limited number and their distinctiveness. The method focuses on identifying multi-word units within terminological ontologies, where each multi-word unit is associated with a sub-domain of ontology knowledge. The algorithm, designed to handle the challenges posed by very long multi-word units composed of a variable number of simple words, integrates user-selected ontologies into a single finite automaton during a fast pre-processing step. At runtime, the automaton reads input text character by character, efficiently locating multi-word units even if they overlap. This approach is efficient for both short and long documents, requiring no prior training. Ontologies can be updated without additional computational costs. An early system prototype, tested on 100 short and medium-length documents, recognized the knowledge domains for the vast majority of texts (over 90%) analyzed. The authors suggest that this method could be a valuable semantic-based knowledge domain extraction technique in unstructured documents.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] Multiflex: A Multilingual Finite-State Tool for Multi-Word Units
    Savary, Agata
    IMPLEMENTATION AND APPLICATION OF AUTOMATA, PROCEEDINGS, 2009, 5642 : 237 - 240
  • [2] Efficient Multi-word Parameterized Matching on Compressed Text
    Prasad, Rajesh
    Garg, Rama
    PROCEEDINGS OF THE 2014 IEEE 6TH INTERNATIONAL CONFERENCE ON ADAPTIVE SCIENCE AND TECHNOLOGY (ICAST 2014), 2014,
  • [3] Chunks, multi-word units et cetera: The role of multi-word units in second language acquisition
    Aguado, Karin
    DEUTSCH ALS FREMDSPRACHE-ZEITSCHRIFT ZUR THEORIE UND PRAXIS DES FACHES DEUTSCH ALS FREMDSPRACHE, 2024, 61 (01):
  • [4] Phonological similarity in multi-word units
    Gries, Stefan Th.
    COGNITIVE LINGUISTICS, 2011, 22 (03) : 491 - 510
  • [5] Text classification using multi-word features
    Zhang, Wen
    Yoshida, Taketoshi
    Tang, Xijin
    2007 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS, VOLS 1-8, 2007, : 3740 - +
  • [6] Inclusion strategies for multi-word units in monolingual dictionaries
    Louw, Phillip
    LEXIKOS, 2006, 16 : 95 - 103
  • [7] Corpus analysis and phraseology: Transfer of multi-word units
    Peromingo, Juan Pedro Rica
    LINGUISTICS AND THE HUMAN SCIENCES, 2010, 6 (1-3): : 321 - 343
  • [8] The role of multi-word units in interactive information retrieval
    Vechtomova, O
    ADVANCES IN INFORMATION RETRIEVAL, 2005, 3408 : 403 - 420
  • [9] Probabilistic multi-word spotting in handwritten text images
    Alejandro H. Toselli
    Enrique Vidal
    Joan Puigcerver
    Ernesto Noya-García
    Pattern Analysis and Applications, 2019, 22 : 23 - 32
  • [10] "The Song of Words" Teaching Multi-Word Units with Songs
    Tomczak, Ewa
    Lew, Robert
    3L-LANGUAGE LINGUISTICS LITERATURE-THE SOUTHEAST ASIAN JOURNAL OF ENGLISH LANGUAGE STUDIES, 2019, 25 (04): : 16 - 33