Finite State Automata on Multi-Word Units for Efficient Text-Mining

被引:1
|
作者
Postiglione, Alberto [1 ]
机构
[1] Univ Salerno, Dept Business Sci & Management & Innovat Syst, Via San Giovanni Paolo 2, I-84084 Fisciano, Italy
关键词
text mining; knowledge extraction; finite automata; ontology; multi-word units; natural language processing; INTERNET; THINGS;
D O I
10.3390/math12040506
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Text mining is crucial for analyzing unstructured and semi-structured textual documents. This paper introduces a fast and precise text mining method based on a finite automaton to extract knowledge domains. Unlike simple words, multi-word units (such as credit card) are emphasized for their efficiency in identifying specific semantic areas due to their predominantly monosemic nature, their limited number and their distinctiveness. The method focuses on identifying multi-word units within terminological ontologies, where each multi-word unit is associated with a sub-domain of ontology knowledge. The algorithm, designed to handle the challenges posed by very long multi-word units composed of a variable number of simple words, integrates user-selected ontologies into a single finite automaton during a fast pre-processing step. At runtime, the automaton reads input text character by character, efficiently locating multi-word units even if they overlap. This approach is efficient for both short and long documents, requiring no prior training. Ontologies can be updated without additional computational costs. An early system prototype, tested on 100 short and medium-length documents, recognized the knowledge domains for the vast majority of texts (over 90%) analyzed. The authors suggest that this method could be a valuable semantic-based knowledge domain extraction technique in unstructured documents.
引用
收藏
页数:20
相关论文
共 50 条
  • [41] An efficient algorithm for local testability problem of finite state automata
    Kim, SM
    McNaughton, R
    COMPUTING AND COMBINATORICS, 1995, 959 : 597 - 606
  • [42] An Efficient Parallel Determinisation Algorithm for Finite-state Automata
    Hanneforth, Thomas
    Watson, Bruce W.
    PROCEEDINGS OF THE PRAGUE STRINGOLOGY CONFERENCE 2012, 2012, : 42 - 52
  • [43] An Efficient Text-Mining Framework of Automatic Essay Grading Using Discourse Macrostructural and Statistical Lexical Features
    Alawadh, Husam M.
    Meraj, Talha
    Aldosari, Lama
    Tayyab Rauf, Hafiz
    SAGE OPEN, 2024, 14 (04):
  • [44] Functionally-defined recurrent multi-word units in English-to-Polish translation A corpus-based study
    Grabowski, Lukasz
    Groom, Nicholas
    REVISTA ESPANOLA DE LINGUISTICA APLICADA, 2022, 35 (01): : 1 - 29
  • [45] Head and state hierarchies for unary multi-head finite automata
    Martin Kutrib
    Andreas Malcher
    Matthias Wendlandt
    Acta Informatica, 2014, 51 : 553 - 569
  • [46] Head and state hierarchies for unary multi-head finite automata
    Kutrib, Martin
    Malcher, Andreas
    Wendlandt, Matthias
    ACTA INFORMATICA, 2014, 51 (08) : 553 - 569
  • [47] TEXT-MINING IN ELECTRONIC HEALTHCARE RECORDS FOR EFFICIENT RECRUITMENT AND DATA-COLLECTION IN CARDIOVASCULAR TRIALS: A MULTICENTER VALIDATION STUDY
    Van Dijk, Wouter
    Fiolet, Aernoud
    Schuit, Ewoud
    Sammani, Arjan
    Groenhof, Katrien
    van der Graaf, Rieke
    de Vries, Martine
    Alings, Marco
    Schaap, Jeroen
    Asselbergs, Folkert
    Grobbee, Diederick
    Groenwold, Rolf
    Mosterd, Arend
    JOURNAL OF THE AMERICAN COLLEGE OF CARDIOLOGY, 2020, 75 (11) : 3622 - 3622
  • [48] A word-document model for text mining by multi-objective programming technology
    Lu, Jie
    Shi, Chenggen
    Xue, Huacheng
    PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON INFORMATION AND MANAGEMENT SCIENCES, 2004, 3 : 455 - 460
  • [49] The effects of binge-watching and spacing on learning L2 multi-word units from captioned TV series
    Pattemore, Anastasia
    Munoz, Carmen
    LANGUAGE LEARNING JOURNAL, 2023, 51 (04): : 401 - 415
  • [50] DEVELOPING COLLOCATIONAL AND PHONOLOGICAL COMPETENCES OF EMERGING TEACHERS OF ENGLISH AS A FOREIGN LANGUAGE THROUGH COGNITIVE APPROACH TO PROCESSING MULTI-WORD UNITS
    Berga, Irisa
    SOCIETY, INTEGRATION, EDUCATION, VOL I, 2014, 2014, : 42 - 55