Data File Layout Inference Using Content-Based Oracles

被引:0
|
作者
Phillips, Reid A. [1 ]
Li, Wing-Ning [1 ]
Thompson, Craig [1 ]
Deneke, Wesley [1 ]
机构
[1] Univ Arkansas, Comp Sci & Comp Engn Dept, Fayetteville, AR 72701 USA
关键词
domain-specific software architecture; file processing; extract-transform-load (ETL); file layout inference; content type; combinatoric approach; sampling; meta-data discovery;
D O I
10.1109/CSE.2013.150
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data file layout inference refers to the problem of identifying the organizational characteristics associated with a structured text file, where every record in a text file shares the same structural properties. These properties include: character encoding, record length, field length ( indicated by delimiting characters or fixed length), field position, and field semantic content. Within this paper, the above information is referred to as the layout of a file. This structural layout information is required to extract, transform, and load files into workflows within various data warehouse and data mining applications. A common need, layout inference is a manual, labor intensive process requiring human expertise whenever a file's layout is unavailable, miscommunicated, or changed. This paper proposes an automated methodology for solving the layout inference problem by discovering the metadata of a structured text file and reports the results of a prototype system for real data files from customer data integration and management application.
引用
收藏
页码:1029 / 1035
页数:7
相关论文
共 50 条
  • [1] Content-based File Type Identification
    Bhat, Kireet
    Lam, Jason T.
    Zulkernine, Farhana
    [J]. 2018 10TH INTERNATIONAL CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING (ICECE), 2018, : 277 - 280
  • [2] Using concepts of content-based image retrieval to implement graphical testing oracles
    Delamaro, Marcio Eduardo
    dos Santos Nunes, Fatima de Lourdes
    Paes de Oliveira, Rafael Alves
    [J]. SOFTWARE TESTING VERIFICATION & RELIABILITY, 2013, 23 (03): : 171 - 198
  • [3] MUCH: Multithreaded Content-Based File Chunking
    Won, Youjip
    Lim, Kyeongyeol
    Min, Jaehong
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2015, 64 (05) : 1375 - 1388
  • [4] Framework for Content-Based Image Retrieval using Knowledge Based Inference Engine
    Khodaskar, Anuja
    Ladhake, S. A.
    [J]. 2013 INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND EMBEDDED SYSTEMS (ICICES), 2013, : 35 - 39
  • [5] Content-based file sharing in peer-to-peer networks using threshold
    Bhagat, Amol
    Chaudhari, Radhika
    Dongre, Kiran
    [J]. PROCEEDINGS OF INTERNATIONAL CONFERENCE ON COMMUNICATION, COMPUTING AND VIRTUALIZATION (ICCCV) 2016, 2016, 79 : 53 - 60
  • [6] A New Approach to Content-based File Type Detection
    Amirani, Mehdi Chehel
    Toorani, Mohsen
    Shirazi, Ali Asghar Beheshti
    [J]. 2008 IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS, VOLS 1-3, 2008, : 148 - 153
  • [7] Using content-based multimedia data retrieval for multimedia content adaptation
    Reveiu, Adriana
    Dardala, Marian
    Furtuna, Felix
    [J]. HUMAN-COMPUTER INTERACTION, PT 3, PROCEEDINGS, 2007, 4552 : 486 - +
  • [8] Content-based unsupervised segmentation of recurrent TV programs using grammatical inference
    Bingqing Qu
    Félicien Vallet
    Jean Carrive
    Guillaume Gravier
    [J]. Multimedia Tools and Applications, 2017, 76 : 22569 - 22597
  • [9] Content-based unsupervised segmentation of recurrent TV programs using grammatical inference
    Qu, Bingqing
    Vallet, Felicien
    Carrive, Jean
    Gravier, Guillaume
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2017, 76 (21) : 22569 - 22597
  • [10] Integrating content-based access mechanisms with hierarchical file systems
    Gopal, B
    Manber, U
    [J]. USENIX ASSOCIATION PROCEEDINGS OF THE THIRD SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '99), 1999, : 265 - 278