Unsupervised learning of mDTD extraction patterns for Web text mining

被引:7
|
作者
Kim, D [1 ]
Jung, HM [1 ]
Lee, GG [1 ]
机构
[1] Pohang Univ Sci & Technol, Dept Comp Sci & Engn, Pohang 790784, South Korea
关键词
Web text mining; information extraction; extraction pattern; document type definition; sequential covering algorithm;
D O I
10.1016/S0306-4573(03)00004-9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a new extraction pattern, called modified Document Type Definition (mDTD), which relies on analytical interpretation to identify extraction target from the contents of the Web documents. From conventional DTD in XML documents, we develop two major extensions: first, we introduce an extended content model with type-specific operators and keywords, and second, we refine the way to interpret the conventional DTD rules. As the result of the two, bur mDTD becomes freely represent HTML structures and extraction targets. The goal of mDTD is to overcome the current major barriers, that is, domain portability (with minimal human intervention) and high performance, on information extraction. The human experts compose an mDTD as seed rules, and then our system automatically extracts a set of instances by the mDTD from structured documents on the Web. We use the extracted instances as Sequential mDTD Learner (SmL) inputs to generate new mDTD rules based on part-of-speech tags and features for lexical similarity. This process does not require any hand-annotated corpus. We have experimented with 330 Korean and 220 English Web documents on audio and video shopping sites. The average extraction precision is 91.3% for Korean and 81.9% for English. (C) 2003 Elsevier Science Ltd. All rights reserved.
引用
收藏
页码:623 / 637
页数:15
相关论文
共 50 条
  • [1] The feature extraction of text mining based on Web
    Liu, LZ
    Chen, JJ
    Song, HT
    [J]. ICEMI'2003: PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE ON ELECTRONIC MEASUREMENT & INSTRUMENTS, VOLS 1-3, 2003, : 547 - 550
  • [2] Unsupervised Descriptive Text Mining for Knowledge Graph Learning
    Frisoni, Giacomo
    Moro, Gianluca
    Carbonaro, Antonella
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT (KDIR), VOL 1, 2020, : 316 - 324
  • [3] Research and realization of extraction algorithm on web text mining
    Yin, Shiqun
    Qu, Yuhui
    Ge, Jike
    Lan, Xiaohong
    [J]. IITA 2007: WORKSHOP ON INTELLIGENT INFORMATION TECHNOLOGY APPLICATION, PROCEEDINGS, 2007, : 278 - +
  • [4] News item extraction for text mining in web newspapers
    Norvåg, K
    Oyri, R
    [J]. International Workshop on Challenges in Web Information Retrieval and Integration, Proceedings, 2005, : 195 - 204
  • [5] Text Summarization by Sentence Extraction Using Unsupervised Learning
    Garcia-Hernandez, Rene Arnulfo
    Montiel, Romyna
    Ledeneva, Yulia
    Rendon, Erendira
    Gelbukh, Alexander
    Cruz, Rafael
    [J]. MICAI 2008: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2008, 5317 : 133 - +
  • [6] An Unsupervised Text Mining Method for Relation Extraction from Biomedical Literature
    Quan, Changqin
    Wang, Meng
    Ren, Fuji
    [J]. PLOS ONE, 2014, 9 (07):
  • [7] Text Mining in Qualitative Research Application of an Unsupervised Learning Method
    Janasik, Nina
    Honkela, Timo
    Bruun, Henrik
    [J]. ORGANIZATIONAL RESEARCH METHODS, 2009, 12 (03) : 436 - 460
  • [8] On knowledgeable unsupervised text mining
    Hotho, A
    Maedche, A
    Staab, S
    Zacharias, V
    [J]. TEXT MINING: THEORETICAL ASPECTS AND APPLICATIONS, 2003, : 131 - 152
  • [9] INFORMATION EXTRACTION VERSUS TEXT SEGMENTATION FOR WEB CONTENT MINING
    Fragkou, Pavlina
    [J]. INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2013, 23 (08) : 1109 - 1137
  • [10] Unsupervised Learning of Text Line Segmentation by Differentiating Coarse Patterns
    Barakat, Berat Kurar
    Droby, Ahmad
    Saabni, Raid
    El-Sana, Jihad
    [J]. DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT II, 2021, 12822 : 523 - 537