Extracting Records from the Web Using a Signal Processing Approach

被引:4
|
作者
Velloso, Roberto Panerai [1 ]
Dorneles, Carina F. [1 ]
机构
[1] Univ Fed Santa Catarina, Florianopolis, SC, Brazil
关键词
web mining; record extraction; structure detection; information retrieval; record alignment; ALGORITHM;
D O I
10.1145/3132847.3132875
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Extracting records from web pages enables a number of important applications and has immense value due to the amount and diversity of available information that can be extracted. This problem, although vastly studied, remains open because it is not a trivial one. Due to the scale of data, a feasible approach must be both automatic and efficient (and of course effective). We present here a novel approach, fully automatic and computationally efficient, using signal processing techniques to detect regularities and patterns in the structure of web pages. Our approach segments the web page, detects the data regions within it, identifies the records boundaries and aligns the records. Results show high f-score and linearithmic time complexity behaviour.
引用
收藏
页码:197 / 206
页数:10
相关论文
共 50 条
  • [21] Extracting Knowledge from Web Server Logs Using Web Usage Mining
    Eltahir, Mirghani A.
    Dafa-Alla, Anour F. A.
    2013 INTERNATIONAL CONFERENCE ON COMPUTING, ELECTRICAL AND ELECTRONICS ENGINEERING (ICCEEE), 2013, : 413 - 417
  • [22] DeepPhe: A Natural Language Processing System for Extracting Cancer Phenotypes from Clinical Records
    Savova, Guergana K.
    Tseytlin, Eugene
    Finan, Sean
    Castine, Melissa
    Miller, Timothy
    Medvedeva, Olga
    Harris, David
    Hochheiser, Harry
    Lin, Chen
    Chavan, Girish
    Jacobson, Rebecca S.
    CANCER RESEARCH, 2017, 77 (21) : E115 - E118
  • [23] A STRUCTURAL APPROACH TO EXTRACTING CHINESE POSITION RELATIONS FROM WEB PAGES
    Jin, Peiquan
    Yang, Jia
    Zhao, Jie
    Liu, Yanhong
    JOURNAL OF WEB ENGINEERING, 2013, 12 (05): : 363 - 382
  • [24] Adaptive Signal Processing Techniques for Extracting Fetal Electrocardiograms from Noninvasive Measurements
    Jenkins, W. K.
    Ding, H.
    Zenaldin, M.
    Salvia, A. D.
    Collins, R. M.
    2014 IEEE 57TH INTERNATIONAL MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS (MWSCAS), 2014, : 639 - 642
  • [25] A scalable hybrid approach for extracting head components from Web tables
    Jung, SW
    Kwon, HC
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2006, 18 (02) : 174 - 187
  • [26] A Practical Approach to Extracting Names of Geographical Entities and Their Relations from the Web
    Cao, Cungen
    Wang, Shi
    Jiang, Lin
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2014, 2014, 8793 : 210 - 221
  • [27] EXTRACTING CONFORMATIONAL INFORMATION FROM MOLECULAR-DYNAMICS TRAJECTORIES USING DIGITAL SIGNAL-PROCESSING TECHNIQUES
    OSGUTHORPE, DJ
    JOURNAL OF MOLECULAR GRAPHICS, 1988, 6 (04): : 221 - 222
  • [28] A METHOD OF EXTRACTING NEURONAL-ACTIVITY FROM BACKGROUND-NOISE USING DIGITAL SIGNAL-PROCESSING
    ANSON, M
    ANTONIADES, C
    CHUNG, SH
    DHANJAL, SS
    HAMMOND, BJ
    KEATING, MJ
    KING, MC
    KNOTT, C
    JOURNAL OF PHYSIOLOGY-LONDON, 1988, 398 : P7 - P7
  • [29] Using ontologies for extracting product features from Web pages
    Holzinger, Wolfgang
    Kruepl, Bernhard
    Herzog, Marcus
    SEMANTIC WEB - ISEC 2006, PROCEEDINGS, 2006, 4273 : 286 - +
  • [30] Extracting knowledge from Ontology using Jena for Semantic Web
    Ameen, Ayesha
    Khan, Khaleel Ur Rahman
    Rani, B. Padmaja
    2014 INTERNATIONAL CONFERENCE FOR CONVERGENCE OF TECHNOLOGY (I2CT), 2014,