Incorporating World Knowledge to Document Clustering via Heterogeneous Information Networks

被引:28
|
作者
Wang, Chenguang [1 ]
Song, Yangqiu [2 ]
El-Kishky, Ahmed [2 ]
Roth, Dan [2 ]
Zhang, Ming [1 ]
Han, Jiawei [2 ]
机构
[1] Peking Univ, Sch EECS, Beijing, Peoples R China
[2] Univ Illinois, Dept Comp Sci, Urbana, IL USA
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
World Knowledge; Heterogeneous Information Network; Document Clustering; Knowledge Base; Knowledge Graph;
D O I
10.1145/2783258.2783374
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
One of the key obstacles in making learning protocols realistic in applications is the need to supervise them, a costly process that often requires hiring domain experts. We consider the framework to use the world knowledge as indirect supervision. World knowledge is general-purpose knowledge, which is not designed for any specific domain. Then the key challenges are how to adapt the world knowledge to domains and how to represent it for learning. In this paper, we provide an example of using world knowledge for domain dependent document clustering. We provide three ways to specify the world knowledge to domains by resolving the ambiguity of the entities and their types, and represent the data with world knowledge as a heterogeneous information network. Then we propose a clustering algorithm that can cluster multiple types and incorporate the sub-type information as constraints. In the experiments, we use two existing knowledge bases as our sources of world knowledge. One is Freebase, which is collaboratively collected knowledge about entities and their organizations. The other is YAGO2, a knowledge base automatically extracted from Wikipedia and maps knowledge to the linguistic knowledge base, Word Net. Experimental results on two text benchmark datasets (20news-groups and RCV1) show that incorporating world knowledge as indirect supervision can significantly outperform the state-of-the-art clustering algorithms as well as clustering algorithms enhanced with world knowledge features.
引用
收藏
页码:1215 / 1224
页数:10
相关论文
共 50 条
  • [31] Identification of Important Nodes in Multilayer Heterogeneous Networks Incorporating Multirelational Information
    Wan, Liangtian
    Zhang, Mingyue
    Li, Xiaona
    Sun, Lu
    Wang, Xianpeng
    Liu, Kaihui
    [J]. IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2022, 9 (06) : 1715 - 1724
  • [32] DOCUMENT CLUSTERING WITH BURSTY INFORMATION
    Hoonlor, Apirak
    Szymanski, Boleslaw K.
    Zaki, Mohammed J.
    Chaoji, Vineet
    [J]. COMPUTING AND INFORMATICS, 2012, 31 (06) : 1533 - 1555
  • [33] Clustering on heterogeneous networks
    Huang, Yue
    Gao, Xuedong
    [J]. WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2014, 4 (03) : 213 - 233
  • [34] Evaluation of Knowledge Acquisition from Document Clustering Based on Information Retrieval Scales
    Ochikubo, Shu
    Komiya, Kano
    Saitoh, Fumiaki
    Ishizu, Syohei
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL ENGINEERING AND ENGINEERING MANAGEMENT (IEEM), 2017, : 220 - 224
  • [35] Triangular clustering in document networks
    Cheng, Xue-Qi
    Ren, Fu-Xin
    Zhou, Shi
    Hu, Mao-Bin
    [J]. NEW JOURNAL OF PHYSICS, 2009, 11
  • [36] Document clustering with neural networks
    Lencses, R
    [J]. STATE OF THE ART IN COMPUTATIONAL INTELLIGENCE, 2000, : 296 - 301
  • [37] Incorporating temporal information for document classification
    Luo, Xiao
    Zincir-Heywood, Nur
    [J]. 2007 IEEE 23RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOP, VOLS 1-2, 2007, : 780 - +
  • [38] A fast clustering algorithm based on embedding technology for heterogeneous information networks
    Chen, Li-Min
    Yang, Jing
    Zhang, Jian-Pei
    [J]. Dianzi Yu Xinxi Xuebao/Journal of Electronics and Information Technology, 2015, 37 (11): : 2634 - 2641
  • [39] Social Influence Based Clustering and Optimization over Heterogeneous Information Networks
    Zhou, Yang
    Liu, Ling
    [J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2015, 10 (01)
  • [40] Structure and Semantic Contrastive Learning for Nodes Clustering in Heterogeneous Information Networks
    Yu, Yiwei
    Zhou, Lihua
    Liu, Chao
    Wang, Lizhen
    Chen, Hongmei
    [J]. SPATIAL DATA AND INTELLIGENCE, SPATIALDI 2024, 2024, 14619 : 57 - 65