The Research of Web Parallel Information Extraction Based on Hadoop

被引:0
|
作者
Ma, Songyu [1 ]
Shi, Quan [1 ]
Xu, Lu [1 ]
机构
[1] Nantong Univ, Sch Comp Sci & Technol, Nantong 226019, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
Hadoop; Web information extraction; Crawler; Parallel indexing;
D O I
10.1007/978-81-322-1759-6_41
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Big data that are driven by three major trends such as cloud computing, social computing, and mobile computing are reshaping the business process, IT infrastructure and our capture of the enterprise, customer and Internet information and use. To extract the big data in the Internet, the enterprise needs a scalable, flexible, and manageable data infrastructure. Therefore, this paper is based on the Hadoop framework, to analyze and design the large data information extraction system. Measurement shows that the huge amounts of data extraction on the basis of cluster have great improvement in performance compared with single extraction, with high reliability and scalability. What is more? The research of this paper will provide better technical solutions to Web information extraction and sensitive information.
引用
收藏
页码:341 / 348
页数:8
相关论文
共 50 条
  • [1] Research on Web Information Extraction Based on XML
    Hu, Yan
    Xuan, Yanyan
    [J]. SECOND INTERNATIONAL CONFERENCE ON GENETIC AND EVOLUTIONARY COMPUTING: WGEC 2008, PROCEEDINGS, 2008, : 201 - 204
  • [2] Hadoop Based Parallel Deduplication Method for Web Documents
    Song, Junjie
    Liu, Jin
    Zheng, Yuhui
    [J]. ADVANCES IN COMPUTER SCIENCE AND UBIQUITOUS COMPUTING, 2018, 474 : 499 - 504
  • [3] Research on web character information extraction based on semantic similarity
    Wang, Bao-Cheng
    Huang, Wei
    Li, Zhong-Ren
    Xiao, Ke
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMMUNICATION AND ELECTRONIC INFORMATION ENGINEERING (CEIE 2016), 2016, 116 : 663 - 670
  • [4] Research on Information Extraction Based on Web Table Structure and Ontology
    Wang, Xiaofeng
    [J]. MECHATRONICS AND INDUSTRIAL INFORMATICS, PTS 1-4, 2013, 321-324 : 2254 - 2259
  • [5] A Research of the Internet Based on Web Information Extraction and Data Fusion
    Jiang, Yajun
    Wu, Zaoliang
    Zhan, Zengrong
    Xu, Lingyu
    [J]. NEW HORIZONS IN WEB-BASED LEARNING: ICWL 2010 WORKSHOPS, 2011, 6537 : 195 - 206
  • [6] Research of Web information extraction MAS model based on KPS
    Duan Longzhen
    Qian Jun
    Huang Shuiyuan
    Yu Jing
    Zhang Hejiang
    [J]. ADVANCED COMPUTER TECHNOLOGY, NEW EDUCATION, PROCEEDINGS, 2007, : 520 - 524
  • [7] Research on PageRank Algorithm parallel computing Based on Hadoop
    Yang, Pengfei
    Zhou, Liqing
    [J]. Proceedings of the 2016 4th International Conference on Mechanical Materials and Manufacturing Engineering (MMME 2016), 2016, 79 : 182 - 185
  • [8] Research on the System of Public Opinion-Monitoring for Internet Based on Hadoop and Information Extraction Technology
    Nie, Peiyao
    Hu, Yaobin
    Geng, Changxin
    Lin, Peiguang
    [J]. EMERGING RESEARCH IN ARTIFICIAL INTELLIGENCE AND COMPUTATIONAL INTELLIGENCE, 2012, 315 : 234 - 242
  • [9] Research on method of learning web information extraction rule based on XPATH
    Hu, Yan
    Xuan, Yanyan
    [J]. DCABES 2007 PROCEEDINGS, VOLS I AND II, 2007, : 897 - 899
  • [10] Research on the Application of Web Information Extraction Based On Semi Structured XML
    Yang, Guo-Jun
    [J]. 2016 INTERNATIONAL CONFERENCE ON SERVICE SCIENCE, TECHNOLOGY AND ENGINEERING (SSTE 2016), 2016, : 317 - 323