DOM Tree Based Approach for Web Content Extraction

被引：0

作者：

Mehta, Bhavdeep ^{[1
]}

Narvekar, Meera ^{[1
]}

机构：

[1] DJ Sanghvi Coll Engn, Dept Comp Engn, Bombay, Maharashtra, India

来源：

2015 International Conference on Communication, Information & Computing Technology (ICCICT) | 2015年

关键词：

DOM tree; Information extraction; Content extraction techniques etc;

D O I：

暂无

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

The World Wide Web plays an important role while searching for information in the data network. Users are constantly exposed to an ever-growing flood of information. Our approach will help in searching for the exact user relevant content from multiple search engines thus, making the search more efficient and reliable. Our framework will extract the relevant result records based on two approaches i.e. Stored URL list and Run time Generated URL list. Finally, the unique set of records is displayed in a common framework's search result page. The extraction is performed using the concepts of Document Object Model (DOM) tree. The paper comprises of a concept of threshold and data filters to detect and remove irrelevant & redundant data from the web page. The data filters will also be used to further improve the similarity check of data records. Our system will be able to extract 75%-80% user relevant content by eliminating noisy content from the different structured web pages like blogs, forums, articles etc. in the dynamic environment. Our approach shows significant advantages in both precision and recall.

引用

页数：6

共 50 条

[1] Web Content Information Extraction Based on DOM Tree and Statistical Information
Yu, Xin
Jin, Zhengping
[J]. 2017 17TH IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT 2017), 2017, : 1308 - 1311
[2] An Adaptive Web Information Extraction Approach Based on STU-DOM Tree
Wu, Songpu
Wang, Qing
[J]. ADVANCED DESIGN AND MANUFACTURING TECHNOLOGY III, PTS 1-4, 2013, 397-400 : 1972 - 1978
[3] Using the DOM Tree for Content Extraction
Lopez, Sergio
Silva, Josep
Insa, David
[J]. ELECTRONIC PROCEEDINGS IN THEORETICAL COMPUTER SCIENCE, 2012, (98): : 46 - 59
[4] Learning Web Content Extraction with DOM Features
Utiu, Nichita
Ionescu, Vlad-Sebastian
[J]. 2018 IEEE 14TH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTER COMMUNICATION AND PROCESSING (ICCP), 2018, : 5 - 11
[5] An Approach of Information Extraction Based on Dom Tree and Weight Value
Wang, Haitao
Liu, Shufen
[J]. INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2016, 9 (10): : 311 - 319
[6] Web Article Extraction for Web Printing: a DOM plus Visual based Approach
Luo, Ping
Fan, Jian
Liu, Sam
Lin, Fen
Xiong, Yuhong
Liu, Jerry
[J]. DOCENG'09: PROCEEDINGS OF THE 2009 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, 2009, : 66 - 69
[7] The Technology of Extracting Content Information from Web Page Based on DOM Tree
Yuan, Dingrong
Mo, Zhuoying
Xie, Bing
Xie, Yangcai
[J]. ADVANCED RESEARCH ON ELECTRONIC COMMERCE, WEB APPLICATION, AND COMMUNICATION, PT 2, 2011, 144 : 271 - 278
[8] Using the words/leafs ratio in the DOM tree for content extraction
Insa, David
Silva, Josep
Tamarit, Salvador
[J]. JOURNAL OF LOGIC AND ALGEBRAIC PROGRAMMING, 2013, 82 (08): : 311 - 325
[9] Extracting Content for News Web Pages based on DOM
Geng, Hua
Gao, Qiang
Pan, Jingui
[J]. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2007, 7 (02): : 124 - 129
[10] SVM-based Web Content Mining with Leaf Classification Unit from DOM-tree
Kim, Yeongsu
Lee, Seungwoo
[J]. 2017 9TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SMART TECHNOLOGY (KST), 2017, : 359 - 364

← 1 2 3 4 5 →