Finite State Automata on Multi-Word Units for Efficient Text-Mining

被引:1
|
作者
Postiglione, Alberto [1 ]
机构
[1] Univ Salerno, Dept Business Sci & Management & Innovat Syst, Via San Giovanni Paolo 2, I-84084 Fisciano, Italy
关键词
text mining; knowledge extraction; finite automata; ontology; multi-word units; natural language processing; INTERNET; THINGS;
D O I
10.3390/math12040506
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Text mining is crucial for analyzing unstructured and semi-structured textual documents. This paper introduces a fast and precise text mining method based on a finite automaton to extract knowledge domains. Unlike simple words, multi-word units (such as credit card) are emphasized for their efficiency in identifying specific semantic areas due to their predominantly monosemic nature, their limited number and their distinctiveness. The method focuses on identifying multi-word units within terminological ontologies, where each multi-word unit is associated with a sub-domain of ontology knowledge. The algorithm, designed to handle the challenges posed by very long multi-word units composed of a variable number of simple words, integrates user-selected ontologies into a single finite automaton during a fast pre-processing step. At runtime, the automaton reads input text character by character, efficiently locating multi-word units even if they overlap. This approach is efficient for both short and long documents, requiring no prior training. Ontologies can be updated without additional computational costs. An early system prototype, tested on 100 short and medium-length documents, recognized the knowledge domains for the vast majority of texts (over 90%) analyzed. The authors suggest that this method could be a valuable semantic-based knowledge domain extraction technique in unstructured documents.
引用
收藏
页数:20
相关论文
共 50 条
  • [21] Simple and Effective Multi-word Query Spotting in Handwritten Text Images
    Noya-Garcia, Ernesto
    Toselli, Alejandro H.
    Vidal, Enrique
    PATTERN RECOGNITION AND IMAGE ANALYSIS (IBPRIA 2017), 2017, 10255 : 76 - 84
  • [22] SOME PROBLEMS WITH THE DESCRIPTION OF PARADIGMS OF POLISH VERBAL MULTI-WORD UNITS
    Przybyszewski, Sebastian
    BEITRAGE ZUM 18. ARBEITSTREFFEN DER EUROPAISCHEN SLAVISTISCHEN LINGUISTIK (POLYSLAV), 2015, 57 : 213 - 223
  • [23] Eye-tracking multi-word units: some methodological questions
    Carrol, Gareth
    Conklin, Kathy
    JOURNAL OF EYE MOVEMENT RESEARCH, 2014, 7 (05):
  • [24] Orwell's 1984-From Simple to Multi-word Units
    Krstev, Cvetana
    Vitas, Dusko
    Trtovac, Aleksandra
    HUMAN LANGUAGE TECHNOLOGY CHALLENGES FOR COMPUTER SCIENCE AND LINGUISTICS, 2014, 8387 : 276 - 287
  • [25] State Complexity of Partial Word Finite Automata
    Kutrib, Martin
    Wendlandt, Matthias
    DESCRIPTIONAL COMPLEXITY OF FORMAL SYSTEMS, DCFS 2021, 2021, 13037 : 113 - 124
  • [26] State Complexity of Partial Word Finite Automata
    Kutrib, Martin
    Wendlandt, Matthias
    INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE, 2025,
  • [27] Short Text Entity Disambiguation Algorithm Based on Multi-Word Vector Ensemble
    Zhang, Qin
    Xiang, Xuyu
    Qin, Jiaohua
    Tan, Yun
    Liu, Qiang
    Xiong, Neal N.
    INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2021, 30 (01): : 227 - 241
  • [28] The Impact of User Personality Traits on Word of Mouth: Text-Mining Social Media Platforms
    Adamopoulos, Panagiotis
    Ghose, Anindya
    Todri, Vilma
    INFORMATION SYSTEMS RESEARCH, 2018, 29 (03) : 612 - 640
  • [29] Combine at Will? Body-based Description of prepositional Multi-word Units in Language Comparison
    Steyer, Kathrin
    Hein, Katrin
    PROCEEDINGS OF THE XVII EURALEX INTERNATIONAL CONGRESS: LEXICOGRAPHY AND LINGUISTIC DIVERSITY, 2016, : 402 - 408
  • [30] Prepositional constituents in multi-word units: an experimental reading study of the French preposition de
    Hennecke, Inga
    LINGUISTICS, 2022, 60 (06) : 1785 - 1810