Multi-category web object extraction based on relation schema

被引:0
|
作者
Chen, Xiaowu [1 ]
Ma, Yongtao [1 ]
Zhao, Qinping [1 ]
机构
[1] Beihang Univ, Sch Comp Sci & Engn, VRLab, Beijing, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Multi-category Web Objects; Information Extraction; Information Classification; Web Segmentation; Relation Schema;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Web object extraction technology has already been widely applied to the object-oriented search engine to improve the search service in specific domain. However, there is a lack of methods to extract multi-category Web objects, which may belong to kinds of domains and hundreds categories. If there are some categories described in structured Web pages and some others described in unstructured Web pages, it's difficult to find a method to extract record-level Web objects. On the other hand, while hundreds categories belong to kinds of domains, it is also hard to predefine attribute schemas to extract attribute-level Web objects. Aiming at resolving this problem, we propose a method of multi-category Web object extraction. Firstly, this method transforms Web page into HTML tag tree, in which the node size is set by its text amount. Node's text-support degree is calculated on the basis of the node size, and used for finding and extracting the unstructured node. In the same way, sibling nodes' size similarity is worked out and used for finding and extracting the structured parent node. Then the extracted node having the biggest node size is selected to be the Web object record. Secondly, it utilizes raw data of Wikipedia to construct a relation warehouse of multi-category Web objects, and extracts a core relation schema of 400 categories with relations' weight calculation and iteration. Finally, it assigns the Web object record to a corresponding category by schema matching, and extracts the core Web object and its related objects in the record with a voting strategy and the core relation schema of the corresponding category. In experiments, we have tested 1000 Web pages of 20 categories belonged to 3 domains (including Computer, Art, medicine) and demonstrated that this method is able to effectively extract multi-category Web objects from structured and unstructured Web pages in an acceptable performance. The core Web object extraction is 0.724 in precision and 0.600 in recall, and the related Web object extraction is 0.932 in precision and 0.886 in recall.
引用
收藏
页码:439 / 452
页数:14
相关论文
共 50 条
  • [1] Applying the multi-category learning to multiple video object extraction
    Liu, Yi
    Zheng, Yuan F.
    Shen, Xiaotong
    [J]. PATTERN RECOGNITION, 2008, 41 (09) : 2777 - 2788
  • [2] Multiple video object extraction using multi-category ψ-learning
    Liu, Yi
    Zheng, Yuan F.
    Shen, Xiaotong
    [J]. 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, Vols 1-13, 2006, : 5767 - 5770
  • [3] Deep learning based multi-category object detection in aerial images
    Sommer, Lars W.
    Schuchert, Tobias
    Beyerer, Juergen
    [J]. AUTOMATIC TARGET RECOGNITION XXVII, 2017, 10202
  • [4] Extracting Web Content by Exploiting Multi-Category Characteristics
    Wang, Qian
    Yang, Qing
    Zhang, Jingwei
    Zhou, Rui
    Zhang, Yanchun
    [J]. WEB INFORMATION SYSTEMS ENGINEERING, WISE 2017, PT II, 2017, 10570 : 229 - 244
  • [5] The Algorithm of Multi-Category Object Recognition in Road Scene Based on Voxel Network
    Gong, Zhangpeng
    Wang, Guoye
    Yu, Shi
    [J]. Qiche Gongcheng/Automotive Engineering, 2021, 43 (04): : 469 - 477
  • [6] Random Forest Classifier for Multi-category Classification of Web Pages
    Aung, Win Thanda
    Hla, Khin Hay Mar Saw
    [J]. 2009 IEEE ASIA-PACIFIC SERVICES COMPUTING CONFERENCE (APSCC 2009), 2009, : 330 - 334
  • [7] MeronymNet: A Hierarchical Approach for Unified and Controllable Multi-Category Object Generation
    Baghel, Rishabh
    Trivedi, Abhishek
    Ravichandran, Tejas
    Sarvadevabhatla, Ravi Kiran
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 318 - 326
  • [8] Customer loyalty segmentation based on multi-category SVM
    Zou, Peng
    Yu, Bo
    [J]. Journal of Computational Information Systems, 2011, 7 (02): : 403 - 410
  • [9] Worth-based multi-category quality-of-service negotiation in distributed object infrastructures
    Koistinen, J
    Seetharaman, A
    [J]. ENTERPRISE DISTRIBUTED OBJECT COMPUTING - PROCEEDINGS SECOND INTERNATIONAL WORKSHOP, 1998, : 239 - 249
  • [10] Multi-Category RFID Estimation
    Liu, Xiulong
    Li, Keqiu
    Liu, Alex X.
    Guo, Song
    Shahzad, Muhammad
    Wang, Ann L.
    Wu, Jie
    [J]. IEEE-ACM TRANSACTIONS ON NETWORKING, 2017, 25 (01) : 264 - 277