Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

被引:0
|
作者
Qing Li
Jing Chen
Yipu Wu
机构
[1] City University of Hong Kong,Department of Computer Science
来源
World Wide Web | 2009年 / 12卷
关键词
data extraction; semi-structured data; tree edit distance; content feature; loosely structured data record;
D O I
暂无
中图分类号
学科分类号
摘要
Extracting loosely structured data records (LSDRs) has wide applications in many domains, such as forum pattern recognition, Weblogs data analysis, and books and news review analysis. Yet currently existing methods only work well for strongly structured data records (SDRs). In this paper, we propose to address the problem of extracting LSDRs through mining strict patterns. In our method, we utilize both content feature and tag tree feature to recognize the LSDRs, and propose a new algorithm to extract the Data Records (DRs) automatically. The experimental results demonstrate that our algorithm is able to effectively extract LSDRs with higher precision and recall.
引用
收藏
页码:263 / 284
页数:21
相关论文
共 27 条
  • [1] Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns
    Li, Qing
    Chen, Jing
    Wu, Yipu
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2009, 12 (03): : 263 - 284
  • [2] Extracting loosely structured data records through mining strict patterns
    Wu, Yipu
    Chen, Jing
    Li, Qing
    [J]. 2008 IEEE 24TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2008, : 1322 - +
  • [3] Title extraction from Loosely Structured Data Records
    Wu, Yi-Pu
    Zhang, Xue-Jie
    Li, Qing
    Chen, Jing
    [J]. PROCEEDINGS OF 2008 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2008, : 2623 - +
  • [4] Extracting semi-structured data through examples
    Ribeiro-Neto, B
    Laender, AHF
    da Silva, AS
    [J]. PROCEEDINGS OF THE EIGHTH INTERNATIONAL CONFERENCE ON INFORMATION KNOWLEDGE MANAGEMENT, CIKM'99, 1999, : 94 - 101
  • [5] Extracting lists of data records from semi-structured web pages
    Alvarez, Manuel
    Pan, Alberto
    Raposo, Juan
    Bellas, Fernando
    Cacheda, Fidel
    [J]. DATA & KNOWLEDGE ENGINEERING, 2008, 64 (02) : 491 - 509
  • [6] Extracting discriminative patterns from graph structured data using constrained search
    Takabayashi, Kiyoto
    Nguyen, Phu Chien
    Ohara, Kouzou
    Motoda, Hiroshi
    Washio, Takashi
    [J]. ADVANCES IN KNOWLEDGE ACQUISITION AND MANAGEMENT, 2006, 4303 : 64 - +
  • [7] Digging deep into weighted patient data through multiple-level patterns
    Baralis, Elena
    Cagliero, Luca
    Cerquitelli, Tania
    Chiusano, Silvia
    Garza, Paolo
    [J]. INFORMATION SCIENCES, 2015, 322 : 51 - 71
  • [8] An approach to extracting complex knowledge patterns among concepts belonging to structured, semi-structured and unstructured sources in a data lake
    Lo Giudice, Paolo
    Musarella, Lorenzo
    Sofo, Giuseppe
    Ursino, Domenico
    [J]. INFORMATION SCIENCES, 2019, 478 : 606 - 626
  • [9] Cl-GBI: A novel approach for extracting typical patterns from graph-structured data
    Nguyen, PC
    Ohara, K
    Motoda, H
    Washio, T
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2005, 3518 : 639 - 649
  • [10] DTM- Extracting Data Records from Search Engine Results Page using Tree Matching Algorithm
    Hong, Jer Lang
    Siew, Eugene
    Egerton, Simon
    [J]. 2009 INTERNATIONAL CONFERENCE OF SOFT COMPUTING AND PATTERN RECOGNITION, 2009, : 149 - 154