DOM Tree Based Approach for Web Content Extraction

被引:0
|
作者
Mehta, Bhavdeep [1 ]
Narvekar, Meera [1 ]
机构
[1] DJ Sanghvi Coll Engn, Dept Comp Engn, Bombay, Maharashtra, India
关键词
DOM tree; Information extraction; Content extraction techniques etc;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The World Wide Web plays an important role while searching for information in the data network. Users are constantly exposed to an ever-growing flood of information. Our approach will help in searching for the exact user relevant content from multiple search engines thus, making the search more efficient and reliable. Our framework will extract the relevant result records based on two approaches i.e. Stored URL list and Run time Generated URL list. Finally, the unique set of records is displayed in a common framework's search result page. The extraction is performed using the concepts of Document Object Model (DOM) tree. The paper comprises of a concept of threshold and data filters to detect and remove irrelevant & redundant data from the web page. The data filters will also be used to further improve the similarity check of data records. Our system will be able to extract 75%-80% user relevant content by eliminating noisy content from the different structured web pages like blogs, forums, articles etc. in the dynamic environment. Our approach shows significant advantages in both precision and recall.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] Web Content Information Extraction Based on DOM Tree and Statistical Information
    Yu, Xin
    Jin, Zhengping
    [J]. 2017 17TH IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT 2017), 2017, : 1308 - 1311
  • [2] An Adaptive Web Information Extraction Approach Based on STU-DOM Tree
    Wu, Songpu
    Wang, Qing
    [J]. ADVANCED DESIGN AND MANUFACTURING TECHNOLOGY III, PTS 1-4, 2013, 397-400 : 1972 - 1978
  • [3] Using the DOM Tree for Content Extraction
    Lopez, Sergio
    Silva, Josep
    Insa, David
    [J]. ELECTRONIC PROCEEDINGS IN THEORETICAL COMPUTER SCIENCE, 2012, (98): : 46 - 59
  • [4] Learning Web Content Extraction with DOM Features
    Utiu, Nichita
    Ionescu, Vlad-Sebastian
    [J]. 2018 IEEE 14TH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTER COMMUNICATION AND PROCESSING (ICCP), 2018, : 5 - 11
  • [5] An Approach of Information Extraction Based on Dom Tree and Weight Value
    Wang, Haitao
    Liu, Shufen
    [J]. INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2016, 9 (10): : 311 - 319
  • [6] Web Article Extraction for Web Printing: a DOM plus Visual based Approach
    Luo, Ping
    Fan, Jian
    Liu, Sam
    Lin, Fen
    Xiong, Yuhong
    Liu, Jerry
    [J]. DOCENG'09: PROCEEDINGS OF THE 2009 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, 2009, : 66 - 69
  • [7] The Technology of Extracting Content Information from Web Page Based on DOM Tree
    Yuan, Dingrong
    Mo, Zhuoying
    Xie, Bing
    Xie, Yangcai
    [J]. ADVANCED RESEARCH ON ELECTRONIC COMMERCE, WEB APPLICATION, AND COMMUNICATION, PT 2, 2011, 144 : 271 - 278
  • [8] Using the words/leafs ratio in the DOM tree for content extraction
    Insa, David
    Silva, Josep
    Tamarit, Salvador
    [J]. JOURNAL OF LOGIC AND ALGEBRAIC PROGRAMMING, 2013, 82 (08): : 311 - 325
  • [9] Extracting Content for News Web Pages based on DOM
    Geng, Hua
    Gao, Qiang
    Pan, Jingui
    [J]. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2007, 7 (02): : 124 - 129
  • [10] SVM-based Web Content Mining with Leaf Classification Unit from DOM-tree
    Kim, Yeongsu
    Lee, Seungwoo
    [J]. 2017 9TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SMART TECHNOLOGY (KST), 2017, : 359 - 364