Towards Unveiling Dark Web Structured Data

被引:0
|
作者
Shams, Montasir [1 ]
Pavia, Sophie [1 ]
Khan, Rituparna [1 ]
Pyayt, Anna [2 ]
Gubanov, Michael [1 ]
机构
[1] Florida State Univ, Dept Comp Sci, Tallahassee, FL 32306 USA
[2] Univ S Florida, Dept Med Engn, Tampa, FL 33620 USA
关键词
D O I
10.1109/BigData52589.2021.9671367
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Anecdotal evidence suggests that Web-search engines, together with the Knowledge Graphs and Bases, such as YAGO [46], DBPedia [13], Freebase [16], Google Knowledge Graph [52] provide rapid access to most structured information on the Web. However, taking a closer look reveals a so called "knowledge gap" [18] that is largely in the dark. For example, a person searching for a relevant job opening has to spend at least 3 hours per week for several months [2] just searching job postings on numerous online job-search engines and the employer websites. The reason why this seemingly simple task cannot be completed by typing in a few keyword queries into a search-engine and getting all relevant results in seconds instead of hours is because access to structured data on the Web is still rudimentary. While searching for a job we have many parameters in mind, not just the job title, but also, usually location, salary range, remote work option, given a recent shift to hybrid work places, and many others. Ideally, we would like to write a SQL-style query, selecting all job postings satisfying our requirements, but it is currently impossible, because job postings (and all other) Web tables are structured in many different ways and scattered all over the Web. There is neither a Web-scale generalizable algorithm nor a system to locate and normalize all relevant tables in a category of interest from millions of sources. Here we describe and evaluate on a corpus having hundreds of millions of Web tables [39], a new scalable iterative training data generation algorithm, producing high quality training data required to train Deep- and Machine-learning models, capable of generalizing to Web scale. The models, trained on such enriched training data efficiently deal with Web scale heterogeneity compared to poor generalization performance of models, trained without enrichment [20], [25], [38]. Such models are instrumental in bridging the knowledge gap for structured data on the Web.
引用
收藏
页码:5275 / 5282
页数:8
相关论文
共 50 条
  • [1] Unveiling the dark web
    [J]. Bradbury, D., 1600, Elsevier Ltd (2014):
  • [2] Recent Progress Towards an Ecosystem of Structured Data on the Web
    Gupta, Nitin
    Halevy, Alon Y.
    Harb, Boulos
    Lam, Heidi
    Lee, Hongrae
    Madhavan, Jayant
    Wu, Fei
    Yu, Cong
    [J]. 2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2013, : 5 - 8
  • [3] Structured Data on the Web
    Cafarella, Michael J.
    Halevy, Alon
    Madhavan, Jayant
    [J]. COMMUNICATIONS OF THE ACM, 2011, 54 (02) : 72 - 79
  • [4] Dark Web: Exploring and Data Mining the Dark Side of the Web
    Cloete, Linda
    [J]. ONLINE INFORMATION REVIEW, 2012, 36 (06) : 932 - 933
  • [5] Structured Data in Web Search
    Halevy, Alon
    [J]. PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 7 - 7
  • [6] An Analysis of Structured Data on the Web
    Dalvi, Nilesh
    Machanavajjhala, Ashwin
    Pang, Bo
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (07): : 680 - 691
  • [7] Towards Entity Summarisation on Structured Web Markup
    Yu, Ran
    Gadiraju, Ujwal
    Zhu, Xiaofei
    Fetahu, Besnik
    Dietze, Stefan
    [J]. SEMANTIC WEB, ESWC 2016, 2016, 9989 : 69 - 73
  • [8] Analysis of approaches to structured data on the web
    Pohorec, Sandi
    Zorman, Milan
    Kokol, Peter
    [J]. COMPUTER STANDARDS & INTERFACES, 2013, 36 (01) : 256 - 262
  • [9] Annotating structured data of the deep Web
    Lu, Yiyao
    He, Hai
    Zhao, Hongkun
    Meng, Weiyi
    Yu, Clement
    [J]. 2007 IEEE 23RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2007, : 351 - +
  • [10] A comprehensive data quality methodology for web and structured data
    Batini, Carlo
    Cabitza, Federico
    Cappiello, Cinzia
    Francalanci, Chiara
    [J]. International Journal of Innovative Computing and Applications, 2008, 1 (03) : 205 - 218