Extracting information from newspaper archives in Africa

被引:1
|
作者
Zeni, M. [1 ]
Weldemariam, K. [2 ]
机构
[1] Univ Trento, I-38123 Trento, TN, Italy
[2] IBM Res, Nairobi, Kenya
关键词
Alternative source - Digital archives - Digital sources - Extracting information - Proof of concept - Public services - Research problems - Sub-saharan africa;
D O I
10.1147/JRD.2017.2742706
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In sub-Saharan Africa, lack of useful information for the public good is one obstacle to the development of public services (public safety, education, healthcare, etc.). This makes the extraction of data from digital archives (e.g., analog sources such as printed newspaper archives and born-digital sources like native PDF) an interesting alternative source of data to increase the amount and diversity of potentially useful information. Printed newspapers contain various multiarticle page layouts, wherein articles in the newspaper are designed to allow readers to define their own reading. The title of an article, the introductory story of the title, and related images are mostly grouped together. However, subsequent paragraphs and images are spread across various pages of the newspaper in a somewhat unpredictable manner. This, together with the poor quality of existing archives, makes the extracting of data from archived newspapers a daunting research problem. To solve these challenges, we present a system that extracts, detects, and clusters articles in newspapers from digital archives (mainly containing scanned newspaper archives from which the information is extracted). Finally, we also describe our proof-of-concept service using the extracted data.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Extracting structured subject information from digital document archives
    Liu, Jyi-Shane
    Lee, Ching-Ying
    [J]. Digital Libraries: Achievements, Challenges and Opportunities, Proceedings, 2006, 4312 : 141 - 150
  • [2] Extracting information about proper nouns from Arabic newspaper text
    Abuleil, S
    Evens, M
    [J]. COMPUTERS AND THEIR APPLICATIONS, 2001, : 374 - 378
  • [3] Extracting brief note from Internet Newspaper
    Karale, Suraj B.
    Patil, G. A.
    [J]. PROCEEDINGS OF THE 10TH INDIACOM - 2016 3RD INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT, 2016, : 401 - 406
  • [4] Newspaper archives in Germany
    Pankratz, M
    [J]. ZEITSCHRIFT FUR BIBLIOTHEKSWESEN UND BIBLIOGRAPHIE, 1999, 46 (01): : 12 - 20
  • [5] Extracting an Arabic lexicon from Arabic newspaper text
    Abuleil, S
    Evens, M
    [J]. COMPUTERS AND THE HUMANITIES, 2002, 36 (02): : 191 - 221
  • [6] Extracting an Arabic Lexicon from Arabic Newspaper Text
    Saleem Abuleil
    Martha Evens
    [J]. Computers and the Humanities, 2002, 36 : 191 - 221
  • [7] Newspaper archives on the semantic web
    Castells, P
    Perdrix, F
    Pulido, E
    Rico, M
    Fuentes, JM
    Benjamins, R
    Contreras, J
    Piqué, E
    Cal, J
    Lorés, J
    Granollers, T
    [J]. HCI RELATED PAPERS OF INTERACCION 2004, 2006, : 267 - +
  • [8] Microfilms: The future for newspaper archives
    Wo der film fuer die zeitung noch zukunft bedeutet
    [J]. 2005, Deutscher Drucker Verlag International (41):
  • [9] Extracting and visualizing knowledge from film and video archives
    Wactlar, HD
    [J]. JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2002, 8 (06) : 602 - 612
  • [10] Extracting prehistories of software refactorings from version archives
    Hayashi, Shinpei
    Saeki, Motoshi
    [J]. LARGE-SCALE KNOWLEDGE RESOURCES: CONSTRUCTION AND APPLICATION, 2008, 4938 : 82 - 89