Insights to the State-of-the-Art PDF Extraction Techniques

被引:0
|
作者
Hashmi, Ahmer Maqsood [1 ]
Qayyum, Faiza [1 ]
Afzal, Muhammad Tanvir [1 ]
机构
[1] Capital Univ Sci & Technol, Dept Comp Sci, Islamabad, Pakistan
来源
关键词
Key Information extraction; Research papers; PDF parser; Regular expression; XML and plain-text formats;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Digitized documents have become the omnipresent medium of information. A plethora of scholarly documents on the web is excessively being increased. Various digital libraries such as Google scholar, Citeseer, MAS etc. store this plethora in different formats. Most of the scientific literature is stored in Portal Document Format (PDF). PDF documents hold a complex structure due to which their comprehension and extraction of useful information from them is a challenging task. In this regard, research community has been proposing different rule based and machine learning based techniques in the past several years. We believe that accurate and efficient information extraction form the PDF files is an important issue as major portion of scholarly literature is stored in PDF. This study presents a rigorous analysis of the contemporary state-of-the-art in PDF data extraction. The contemporary approaches from the window of past few years are recapitulated with the primary objective to assist the scientific community by providing them knowledge about current trend in PDF extraction techniques. The study also presents critical analysis and suggests future dimensions of some of the approaches.
引用
收藏
页码:60 / 67
页数:8
相关论文
共 50 条
  • [1] Green Extraction Techniques of Bioactive Compounds: A State-of-the-Art Review
    Martins, Rodrigo
    Barbosa, Ana
    Advinha, Barbara
    Sales, Helia
    Pontes, Rita
    Nunes, Joao
    [J]. PROCESSES, 2023, 11 (08)
  • [2] Cryptography and state-of-the-art techniques
    Ahmed, Mohiuddin
    Sazzad, T.M. Shahriar
    Mollah, Md. Elias
    [J]. International Journal of Computer Science Issues, 2012, 9 (2 2-3): : 583 - 586
  • [3] TOLERANCING TECHNIQUES - THE STATE-OF-THE-ART
    ZHANG, HC
    HUQ, ME
    [J]. INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH, 1992, 30 (09) : 2111 - 2135
  • [4] ERCP extraction of stones in situs inversus patients; state-of-the-art techniques
    Gunsahin, Deniz
    Ilie, Madalina
    Plotogea, Oana
    Paduraru, Dan Nicolae
    Bolocan, Alexandra
    Andronic, Octavian
    Musat, Florentina
    Baleanu, Vlad
    Davitoiu, Dragos
    Pahomeanu, Mihai
    Dumbrava, Bogdan
    Enciu, Vlad
    Constantinescu, Alexandru
    [J]. JOURNAL OF MIND AND MEDICAL SCIENCES, 2024, 11 (01): : 256 - 260
  • [5] State-of-the-Art Predictive Maintenance Techniques
    Hashemian, H. M.
    Bean, Wendell C.
    [J]. IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2011, 60 (10) : 3480 - 3492
  • [6] State-of-the-Art Predictive Maintenance Techniques
    Hashemian, H. M.
    [J]. IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2011, 60 (01) : 226 - 236
  • [7] STATE-OF-THE-ART IN PROTON THERAPY TECHNIQUES
    KHOROSHKOV, VS
    ONOSOVSKY, KK
    [J]. INSTRUMENTS AND EXPERIMENTAL TECHNIQUES, 1995, 38 (02) : 149 - 158
  • [8] Critical insights into the state-of-the-art NDE data fusion techniques for the inspection of structural systems
    Nsengiyumva, Walter
    Zhong, Shuncong
    Luo, Manting
    Zhang, Qiukun
    Lin, Jiewen
    [J]. STRUCTURAL CONTROL & HEALTH MONITORING, 2022, 29 (01):
  • [9] State-of-the-art: Insights from the Ross Registry
    Fujita, Buntaro
    Aboud, Anas
    Sievers, Hans-Hinrich
    Ensminger, Stephan
    [J]. JTCVS TECHNIQUES, 2021, 10 : 396 - 400
  • [10] Performance investigation of state-of-the-art metaheuristic techniques for parameter extraction of solar cells/module
    Abhishek Sharma
    Abhinav Sharma
    Moshe Averbukh
    Vibhu Jately
    Shailendra Rajput
    Brian Azzopardi
    Wei Hong Lim
    [J]. Scientific Reports, 13