Extracting Information Networks from the Blogosphere

被引:10
|
作者
Merhav, Yuval [1 ]
Mesquita, Filipe [2 ]
Barbosa, Denilson [2 ]
Yee, Wai Gen [3 ]
Frieder, Ophir [4 ]
机构
[1] IIT, Dept Comp Sci, Informat Retrieval Lab, Chicago, IL 60616 USA
[2] Univ Alberta, Dept Comp Sci, Edmonton, AB T6G 2M7, Canada
[3] Orbitz Worldwide, Chicago, IL 60661 USA
[4] Georgetown Univ, Washington, DC 20057 USA
关键词
Algorithms; Experimentation; Performance; open information extraction; relation extraction; named entities; domain frequency; clustering;
D O I
10.1145/2344416.2344418
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We study the problem of automatically extracting information networks formed by recognizable entities as well as relations among them from social media sites. Our approach consists of using state-of-the-art natural language processing tools to identify entities and extract sentences that relate such entities, followed by using text-clustering algorithms to identify the relations within the information network. We propose a new term-weighting scheme that significantly improves on the state-of-the-art in the task of relation extraction, both when used in conjunction with the standard tf.df scheme and also when used as a pruning filter. We describe an effective method for identifying benchmarks for open information extraction that relies on a curated online database that is comparable to the hand-crafted evaluation datasets in the literature. From this benchmark, we derive a much larger dataset which mimics realistic conditions for the task of open information extraction. We report on extensive experiments on both datasets, which not only shed light on the accuracy levels achieved by state-of-the-art open information extraction tools, but also on how to tune such tools for better results.
引用
收藏
页数:33
相关论文
共 50 条
  • [1] Extracting information from multiplex networks
    Iacovacci, Jacopo
    Bianconi, Ginestra
    [J]. CHAOS, 2016, 26 (06)
  • [2] BlogBuster: A tool for extracting corpora from the blogosphere
    Petasis, Georgios
    Petasis, Dimitrios
    [J]. LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 3644 - 3649
  • [3] Extracting hidden information from knowledge networks
    Maslov, S
    Zhang, YC
    [J]. PHYSICAL REVIEW LETTERS, 2001, 87 (24) : 248701 - 1
  • [4] EXTRACTING MORE INFORMATION FROM DYNAMICS IN NEURAL NETWORKS
    WONG, KYM
    [J]. JOURNAL OF THE KOREAN PHYSICAL SOCIETY, 1993, 26 : S387 - S391
  • [5] Extracting Information from Negative Interactions in Multiplex Networks Using Mutual Information
    Hajibagheri, Alireza
    Sukthankar, Gita
    Lakkaraju, Kiran
    [J]. SOCIAL, CULTURAL, AND BEHAVIORAL MODELING, 2017, 10354 : 322 - 328
  • [6] Extracting Information from Gene Coexpression Networks of Rhizobium leguminosarum
    Pardo-diaz, Javier
    Beguerisse-diaz, Mariano
    Poole, Philip S.
    Deane, Charlotte M.
    Reinert, Gesine
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2022, 29 (07) : 752 - 768
  • [7] Maximum likelihood: Extracting unbiased information from complex networks
    Garlaschelli, Diego
    Loffredo, Maria I.
    [J]. PHYSICAL REVIEW E, 2008, 78 (01):
  • [8] Information Retrieval on the Blogosphere
    Santos, Rodrygo L. T.
    Macdonald, Craig
    McCreadie, Richard
    Ounis, Iadh
    Soboroff, Ian
    [J]. FOUNDATIONS AND TRENDS IN INFORMATION RETRIEVAL, 2012, 6 (01): : 1 - 125
  • [9] Extracting spatial information from networks with low-order eigenvectors
    Cucuringu, Mihai
    Blondel, Vincent D.
    Van Dooren, Paul
    [J]. PHYSICAL REVIEW E, 2013, 87 (03)
  • [10] Extracting Information from Weighted Contact Networks via Genetic Algorithms
    Rutkowski, Emilia
    Houghten, Sheridan
    Brown, Joseph Alexander
    [J]. 2020 IEEE CONFERENCE ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY (CIBCB), 2020, : 228 - 235