Automatically Extracting Web Data Records

被引:0
|
作者
Mundluru, Dheerendranath [1 ]
Raghavan, Vijay V. [1 ]
Wu, Zonghuan [1 ]
机构
[1] IMshopping Inc, Santa Clara, CA USA
来源
ACTIVE MEDIA TECHNOLOGY | 2010年 / 6335卷
关键词
Structured data extraction; Web content mining;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
It is essential for Web applications such as e-commerce portals to enrich their existing content offerings by aggregating relevant structured data (e.g., product reviews) from external Web resources. To meet this goal, in this paper, we present an algorithm for automatically extracting data records from Web pages. The algorithm uses a robust string matching technique for accurately identifying the records in the Webpage. Our experiments on diverse datasets (including datasets from third-party research projects) show that the proposed algorithm is highly effective and performs considerably better than two other state-of-the-art automatic data extraction systems. We made the proposed system publicly accessible in order for the readers to evaluate it.
引用
下载
收藏
页码:510 / +
页数:2
相关论文
共 50 条
  • [1] Automatically extracting Web data using tree structure
    Hu, Dong-Dong
    Meng, Xiao-Feng
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2004, 41 (10): : 1607 - 1613
  • [2] Research of Extracting Data from HTML Web Pages Automatically
    王茹
    宋瀚涛
    陆玉昌
    Journal of Beijing Institute of Technology(English Edition), 2003, (English Edition) : 104 - 108
  • [3] Research of Extracting Data from HTML Web Pages Automatically
    王茹
    宋瀚涛
    陆玉昌
    Journal of Beijing Institute of Technology, 2003, (S1) : 104 - 108
  • [4] Visually Extracting Data Records from the Deep Web
    Anderson, Neil
    Hong, Jun
    PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'13 COMPANION), 2013, : 1233 - 1238
  • [5] Finding and Extracting Data Records from Web Pages
    Manuel Álvarez
    Alberto Pan
    Juan Raposo
    Fernando Bellas
    Fidel Cacheda
    Journal of Signal Processing Systems, 2010, 59 : 123 - 137
  • [6] Finding and Extracting Data Records from Web Pages
    Alvarez, Manuel
    Pan, Alberto
    Raposo, Juan
    Bellas, Fernando
    Cacheda, Fidel
    JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2010, 59 (01): : 123 - 137
  • [7] Finding and extracting data records from web pages
    Alvarez, Manuel
    Pan, Alberto
    Raposo, Juan
    Bellas, Fernando
    Cacheda, Fidel
    EMBEDDED AND UBIQUITOUS COMPUTING, PROCEEDINGS, 2007, 4808 : 466 - 478
  • [8] A Pure Visual Approach for Automatically Extracting and Aligning Structured Web Data
    Estuka, Fadwa
    Miller, James
    ACM TRANSACTIONS ON INTERNET TECHNOLOGY, 2019, 19 (04)
  • [9] NET - A system for extracting Web data from flat and nested data records
    Liu, B
    Zhai, YH
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2005, 2005, 3806 : 487 - 495
  • [10] Automatically extracting personal name aliases from the web
    Bollegala, Danushka
    Honma, Taiki
    Matsuo, Yutaka
    Ishizuka, Mitsuru
    ADVANCES IN NATURAL LANGUAGE PROCESSING, PROCEEDINGS, 2008, 5221 : 77 - 88