Extracting information from newspaper archives in Africa

被引：1

作者：

Zeni, M. ^{[1
]}

Weldemariam, K. ^{[2
]}

机构：

[1] Univ Trento, I-38123 Trento, TN, Italy

[2] IBM Res, Nairobi, Kenya

来源：

IBM JOURNAL OF RESEARCH AND DEVELOPMENT | 2017年 / 61卷 / 06期

关键词：

Alternative source - Digital archives - Digital sources - Extracting information - Proof of concept - Public services - Research problems - Sub-saharan africa;

D O I：

10.1147/JRD.2017.2742706

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

In sub-Saharan Africa, lack of useful information for the public good is one obstacle to the development of public services (public safety, education, healthcare, etc.). This makes the extraction of data from digital archives (e.g., analog sources such as printed newspaper archives and born-digital sources like native PDF) an interesting alternative source of data to increase the amount and diversity of potentially useful information. Printed newspapers contain various multiarticle page layouts, wherein articles in the newspaper are designed to allow readers to define their own reading. The title of an article, the introductory story of the title, and related images are mostly grouped together. However, subsequent paragraphs and images are spread across various pages of the newspaper in a somewhat unpredictable manner. This, together with the poor quality of existing archives, makes the extracting of data from archived newspapers a daunting research problem. To solve these challenges, we present a system that extracts, detects, and clusters articles in newspapers from digital archives (mainly containing scanned newspaper archives from which the information is extracted). Finally, we also describe our proof-of-concept service using the extracted data.

引用

页数：12

共 50 条

[1] Extracting structured subject information from digital document archives
Liu, Jyi-Shane
Lee, Ching-Ying
[J]. Digital Libraries: Achievements, Challenges and Opportunities, Proceedings, 2006, 4312 : 141 - 150
[2] Extracting information about proper nouns from Arabic newspaper text
Abuleil, S
Evens, M
[J]. COMPUTERS AND THEIR APPLICATIONS, 2001, : 374 - 378
[3] Extracting brief note from Internet Newspaper
Karale, Suraj B.
Patil, G. A.
[J]. PROCEEDINGS OF THE 10TH INDIACOM - 2016 3RD INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT, 2016, : 401 - 406
[4] Newspaper archives in Germany
Pankratz, M
[J]. ZEITSCHRIFT FUR BIBLIOTHEKSWESEN UND BIBLIOGRAPHIE, 1999, 46 (01): : 12 - 20
[5] Extracting an Arabic lexicon from Arabic newspaper text
Abuleil, S
Evens, M
[J]. COMPUTERS AND THE HUMANITIES, 2002, 36 (02): : 191 - 221
[6] Extracting an Arabic Lexicon from Arabic Newspaper Text
Saleem Abuleil
Martha Evens
[J]. Computers and the Humanities, 2002, 36 : 191 - 221
[7] Newspaper archives on the semantic web
Castells, P
Perdrix, F
Pulido, E
Rico, M
Fuentes, JM
Benjamins, R
Contreras, J
Piqué, E
Cal, J
Lorés, J
Granollers, T
[J]. HCI RELATED PAPERS OF INTERACCION 2004, 2006, : 267 - +
[8] Microfilms: The future for newspaper archives
Wo der film fuer die zeitung noch zukunft bedeutet
[J]. 2005, Deutscher Drucker Verlag International (41):
[9] Extracting and visualizing knowledge from film and video archives
Wactlar, HD
[J]. JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2002, 8 (06) : 602 - 612
[10] Extracting prehistories of software refactorings from version archives
Hayashi, Shinpei
Saeki, Motoshi
[J]. LARGE-SCALE KNOWLEDGE RESOURCES: CONSTRUCTION AND APPLICATION, 2008, 4938 : 82 - 89

← 1 2 3 4 5 →