Information Extraction in Illicit Web Domains

被引:20
|
作者
Kejriwal, Mayank [1 ]
Szekely, Pedro [1 ]
机构
[1] USC Viterbi Sch Engn, Informat Sci Inst, Los Angeles, CA 90089 USA
关键词
Information Extraction; Named Entity Recognition; Illicit Domains; Feature-agnostic; Distributional Semantics;
D O I
10.1145/3038912.3052642
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Extracting useful entities and attribute values from illicit domains such as human trafficking is a challenging problem with the potential for widespread social impact. Such domains employ atypical language models, have 'long tails' and suffer from the problem of concept drift. In this paper, we propose a lightweight, feature-agnostic Information Extraction (IE) paradigm specifically designed for such domains. Our approach uses raw, unlabeled text from an initial corpus, and a few (12-120) seed annotations per domain-specific attribute, to learn robust IE models for unobserved pages and websites. Empirically, we demonstrate that our approach can outperform feature-centric Conditional Random Field baselines by over 18% F-Measure on five annotated sets of real-world human trafficking datasets in both low-supervision and high-supervision settings. We also show that our approach is demonstrably robust to concept drift, and can be efficiently bootstrapped even in a serial computing environment.
引用
收藏
页码:997 / 1006
页数:10
相关论文
共 50 条
  • [1] Web Services for information extraction from the Web
    Habegger, B
    Quafafou, M
    [J]. IEEE INTERNATIONAL CONFERENCE ON WEB SERVICES, PROCEEDINGS, 2004, : 279 - 286
  • [2] A method for web information extraction
    Lam, Man I.
    Gong, Zhiguo
    Muyeba, Maybin
    [J]. PROGRESS IN WWW RESEARCH AND DEVELOPMENT, PROCEEDINGS, 2008, 4976 : 383 - +
  • [3] Information extraction for the semantic web
    Baumgartner, R
    Eiter, T
    Gottlob, G
    Herzog, M
    Koch, C
    [J]. REASONING WEB, 2005, 3564 : 275 - 289
  • [4] Automating extraction of logical domains in a web site
    Ayan, NF
    Li, WS
    Kolak, O
    [J]. DATA & KNOWLEDGE ENGINEERING, 2002, 43 (02) : 179 - 205
  • [5] The Web-OEM approach to Web information extraction
    Iocchi, L
    [J]. JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 1999, 22 (04) : 259 - 269
  • [6] Extraction Rule Language for Web Information Extraction and Integration
    Wei, Wu
    Shi, Shengsheng
    Liu, Yulong
    Wang, Haitao
    Yuan, Chunfeng
    Huang, Yihua
    [J]. 2013 10TH WEB INFORMATION SYSTEM AND APPLICATION CONFERENCE (WISA 2013), 2013, : 65 - +
  • [7] Web information extraction by competing classification
    Li, Xiang-Yang
    Lu, Jian-Jiang
    Zhang, Ya-Fei
    [J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2004, 32 (11): : 1915 - 1917
  • [8] A hybrid approach for web information extraction
    Xiao, Ji-Yi
    Zhu, Dao-Hui
    Zou, La-Mei
    [J]. PROCEEDINGS OF 2008 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2008, : 1560 - 1563
  • [9] Services orchestration for web information extraction
    Quafafou, Mohamed
    Jarir, Zahi
    Erradi, Mohammed
    [J]. NWESP 2007: THIRD INTERNATIONAL CONFERENCE ON NEXT GENERATION WEB SERVICES PRACTICES, PROCEEDINGS, 2007, : 85 - +
  • [10] Building web information extraction tasks
    Habegger, B
    Quafafou, M
    [J]. IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2004), PROCEEDINGS, 2004, : 349 - 355