Fine-grained Web Content Classification via Entity-level Analytics: The Case of Semantic Fingerprinting

被引:3
|
作者
Govind [1 ]
Alec, Celine [1 ]
Spaniol, Marc [1 ]
机构
[1] Univ Caen Normandie, Dept Comp Sci, Campus Cote Nacre, F-14032 Caen, France
来源
JOURNAL OF WEB ENGINEERING | 2018年 / 17卷 / 6-7期
关键词
Fine-grained Web Content Classification; Entity-level Web Analytics; Advanced Web Engineering; Web Semantics; Semantic Fingerprinting; WORDNET;
D O I
10.13052/jwe1540-9589.17673
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Approaching three decades of Web contents being created, the amount of heterogeneous data of diverse provenance becomes seemingly over-whelming and its organization is a "continuous battle" against time. In parallel, business, sociological, political, and media analysts require a structured access to these contents in order to conduct their studies. To this end, concise and - at the same time - efficient engineering methods are required to classify Web contents accordingly. However, the whole task is not as simple as classifying something as A or B, but to assign the most suitable (sub-)category for each Web content based on a fine-grained classification scheme. In practice, the underlying type hierarchies are commonly excerpts of large scale ontologies containing several hundreds or even thousands of (sub-) types decomposed into a few top-level types. Having such a fine-grained type hierarchy, the engineering task of Web content classification becomes out-most challenging. Our main objective in this work is to investigate whether entity-level analytics can be utilized to characterize a Web content and align it onto a fine-grained hierarchy. We hypothesize that "You know a document by the named entities it contains". To this end, we present a novel concept, called "Semantic Fingerprinting" that allows Web content classification solely based on the information derived from the named entities contained in a Web document. It encodes the semantic nature of a Web content into a concise vector, namely the semantic fingerprint. Thus, we expect that semantic fingerprints, when utilized in combination with machine learning, will enable a fine-grained classification of Web contents. In order to empirically validate the effectiveness of semantic fingerprinting, we perform a case study on the classification of Wikipedia documents. Even further, we thoroughly examine the results obtained by analyzing the performance of Semantic Fingerprinting with respect to the characteristics of the data set used for the experiments. In addition, we also investigate performance aspects of the engineered approach by discussing the run-time in comparison with its competitor baselines. We observe that the semantic fingerprinting approach outperforms the state-of-the-art baselines as it raises Web contents to the entity-level and captures their core essence. Moreover, our approach achieves a superior run time performance on the test data in comparison to competitors.
引用
收藏
页码:449 / 482
页数:34
相关论文
共 50 条
  • [21] Fine-grained semantic web service discovery based on service operation
    Qin, Mingwen
    Wen, Junhao
    Yi, Juan
    [J]. Journal of Information and Computational Science, 2011, 8 (09): : 1577 - 1592
  • [22] Identifying Extension-Based Ad Injection via Fine-Grained Web Content Provenance
    Arshad, Sajjad
    Kharraz, Amin
    Robertson, William
    [J]. RESEARCH IN ATTACKS, INTRUSIONS, AND DEFENSES, RAID 2016, 2016, 9854 : 415 - 436
  • [23] Aggregating Rich Deep Semantic Features for Fine-Grained Place Classification
    Wei, Tingyu
    Hu, Wenxin
    Wu, Xingjiao
    Zheng, Yingbin
    Ye, Hao
    Yang, Jing
    He, Liang
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2019: IMAGE PROCESSING, PT III, 2019, 11729 : 55 - 67
  • [24] Semi-Supervised Fine-Grained Classification with Web Data via Noisy Sample Selection
    Li, Meng-Xuan
    Liu, Yan
    Liu, Qi
    Chen, Song-Lu
    Chen, Feng
    Yin, Xu-Cheng
    [J]. 2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 5024 - 5030
  • [25] ELEVATE-Live: Assessment and Visualization of Online News Virality via Entity-Level Analytics
    Govind
    Alec, Celine
    Spaniol, Marc
    [J]. WEB ENGINEERING, ICWE 2018, 2018, 10845 : 482 - 486
  • [26] Fine-grained entity type classification using GRU with self-attention
    Dhrisya K.
    Remya G.
    Mohan A.
    [J]. International Journal of Information Technology, 2020, 12 (3) : 869 - 878
  • [27] Fine-Grained Entity Type Classification by Jointly Learning Representations and Label Embeddings
    Abhishek
    Anand, Ashish
    Awekar, Amit
    [J]. 15TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2017), VOL 1: LONG PAPERS, 2017, : 797 - 807
  • [28] A Joint Neural Model for Fine-Grained Named Entity Classification of Wikipedia Articles
    Suzuki, Masatoshi
    Matsuda, Koji
    Sekine, Satoshi
    Okazaki, Naoaki
    Inui, Kentaro
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2018, E101D (01) : 73 - 81
  • [29] Fine-grained Image Classification via Combining Vision and Language
    He, Xiangteng
    Peng, Yuxin
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 7332 - 7340
  • [30] Fine-grained Image Classification via Spatial Saliency Extraction
    Zhang, Juntan
    Sun, Feng-Wen
    Song, Jie
    Von Ancken, Adam
    Zhai, Richard
    [J]. 2018 17TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2018, : 249 - 255