Information Extraction in Illicit Web Domains

被引：20

作者：

Kejriwal, Mayank ^{[1
]}

Szekely, Pedro ^{[1
]}

机构：

[1] USC Viterbi Sch Engn, Informat Sci Inst, Los Angeles, CA 90089 USA

来源：

PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'17) | 2017年

关键词：

Information Extraction; Named Entity Recognition; Illicit Domains; Feature-agnostic; Distributional Semantics;

D O I：

10.1145/3038912.3052642

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Extracting useful entities and attribute values from illicit domains such as human trafficking is a challenging problem with the potential for widespread social impact. Such domains employ atypical language models, have 'long tails' and suffer from the problem of concept drift. In this paper, we propose a lightweight, feature-agnostic Information Extraction (IE) paradigm specifically designed for such domains. Our approach uses raw, unlabeled text from an initial corpus, and a few (12-120) seed annotations per domain-specific attribute, to learn robust IE models for unobserved pages and websites. Empirically, we demonstrate that our approach can outperform feature-centric Conditional Random Field baselines by over 18% F-Measure on five annotated sets of real-world human trafficking datasets in both low-supervision and high-supervision settings. We also show that our approach is demonstrably robust to concept drift, and can be efficiently bootstrapped even in a serial computing environment.

引用

页码：997 / 1006

页数：10

共 50 条

[1] Web Services for information extraction from the Web
Habegger, B
Quafafou, M
[J]. IEEE INTERNATIONAL CONFERENCE ON WEB SERVICES, PROCEEDINGS, 2004, : 279 - 286
[2] A method for web information extraction
Lam, Man I.
Gong, Zhiguo
Muyeba, Maybin
[J]. PROGRESS IN WWW RESEARCH AND DEVELOPMENT, PROCEEDINGS, 2008, 4976 : 383 - +
[3] Information extraction for the semantic web
Baumgartner, R
Eiter, T
Gottlob, G
Herzog, M
Koch, C
[J]. REASONING WEB, 2005, 3564 : 275 - 289
[4] Automating extraction of logical domains in a web site
Ayan, NF
Li, WS
Kolak, O
[J]. DATA & KNOWLEDGE ENGINEERING, 2002, 43 (02) : 179 - 205
[5] The Web-OEM approach to Web information extraction
Iocchi, L
[J]. JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 1999, 22 (04) : 259 - 269
[6] Extraction Rule Language for Web Information Extraction and Integration
Wei, Wu
Shi, Shengsheng
Liu, Yulong
Wang, Haitao
Yuan, Chunfeng
Huang, Yihua
[J]. 2013 10TH WEB INFORMATION SYSTEM AND APPLICATION CONFERENCE (WISA 2013), 2013, : 65 - +
[7] Web information extraction by competing classification
Li, Xiang-Yang
Lu, Jian-Jiang
Zhang, Ya-Fei
[J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2004, 32 (11): : 1915 - 1917
[8] A hybrid approach for web information extraction
Xiao, Ji-Yi
Zhu, Dao-Hui
Zou, La-Mei
[J]. PROCEEDINGS OF 2008 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2008, : 1560 - 1563
[9] Services orchestration for web information extraction
Quafafou, Mohamed
Jarir, Zahi
Erradi, Mohammed
[J]. NWESP 2007: THIRD INTERNATIONAL CONFERENCE ON NEXT GENERATION WEB SERVICES PRACTICES, PROCEEDINGS, 2007, : 85 - +
[10] Building web information extraction tasks
Habegger, B
Quafafou, M
[J]. IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2004), PROCEEDINGS, 2004, : 349 - 355

← 1 2 3 4 5 →