Finite State Automata on Multi-Word Units for Efficient Text-Mining

被引:1
|
作者
Postiglione, Alberto [1 ]
机构
[1] Univ Salerno, Dept Business Sci & Management & Innovat Syst, Via San Giovanni Paolo 2, I-84084 Fisciano, Italy
关键词
text mining; knowledge extraction; finite automata; ontology; multi-word units; natural language processing; INTERNET; THINGS;
D O I
10.3390/math12040506
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Text mining is crucial for analyzing unstructured and semi-structured textual documents. This paper introduces a fast and precise text mining method based on a finite automaton to extract knowledge domains. Unlike simple words, multi-word units (such as credit card) are emphasized for their efficiency in identifying specific semantic areas due to their predominantly monosemic nature, their limited number and their distinctiveness. The method focuses on identifying multi-word units within terminological ontologies, where each multi-word unit is associated with a sub-domain of ontology knowledge. The algorithm, designed to handle the challenges posed by very long multi-word units composed of a variable number of simple words, integrates user-selected ontologies into a single finite automaton during a fast pre-processing step. At runtime, the automaton reads input text character by character, efficiently locating multi-word units even if they overlap. This approach is efficient for both short and long documents, requiring no prior training. Ontologies can be updated without additional computational costs. An early system prototype, tested on 100 short and medium-length documents, recognized the knowledge domains for the vast majority of texts (over 90%) analyzed. The authors suggest that this method could be a valuable semantic-based knowledge domain extraction technique in unstructured documents.
引用
收藏
页数:20
相关论文
共 50 条
  • [31] Populating Sub-entries in Dictionaries with Multi-word Units from Concordance Lines
    Otlogetswe, Thapelo J.
    LEXIKOS, 2009, 19 : 446 - 457
  • [32] Extracting Chinese multi-word units from large-scale balanced corpus
    Liu, JZ
    He, TT
    Xiaohua, LH
    PACLIC 17: Language, Information and Computation, Proceedings, 2003, : 282 - 289
  • [33] Mining Twitter Multi-word Product Opinions with Most Frequent Sequences of Aspect Terms
    Ezeife, C., I
    Chaturvedi, Ritu
    Nasir, Mahreen
    Manjunath, Vinay
    INFORMATION INTEGRATION AND WEB INTELLIGENCE, IIWAS 2022, 2022, 13635 : 126 - 136
  • [34] Extended multi-word trigger pair language model using data mining technique
    Chen, Y
    Chan, KP
    2003 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS, VOLS 1-5, CONFERENCE PROCEEDINGS, 2003, : 262 - 267
  • [35] Efficient instruction scheduling using finite state automata
    Vasanth Bala
    Norman Rubin
    International Journal of Parallel Programming, 1997, 25 : 53 - 82
  • [36] Efficient instruction scheduling using finite state automata
    Bala, V
    Rubin, N
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 1997, 25 (02) : 53 - 82
  • [37] Test Model for Stop Word Removal of Devnagari Text Documents Based on Finite Automata
    Pimpalshende, Anjusha
    Mahajan, A. R.
    2017 IEEE INTERNATIONAL CONFERENCE ON POWER, CONTROL, SIGNALS AND INSTRUMENTATION ENGINEERING (ICPCSI), 2017, : 672 - 674
  • [38] CREATIVE USE OF IDIOMS AND OTHER MULTI-WORD LEXICAL UNITS IN THE WORKS OF MLADEN KERSTNER: POSSIBILITIES AND INTENTIONS
    Markovic, Bojana
    JEZIKOSLOVLJE, 2013, 14 (01): : 129 - 159
  • [39] ACRank: a multi-evidence text-mining model for alliance discovery from news articles
    Zhou, Yilu
    Xue, Yuan
    INFORMATION TECHNOLOGY & PEOPLE, 2020, 33 (05) : 1357 - 1380
  • [40] Efficient multi-word lock-free synchronization algorithm based on hardware CAS primitive
    Wu, Hao
    Ji, Zhen-Zhou
    Zhu, Su-Xia
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2013, 41 (11): : 2127 - 2134