A heuristic approach to handling missing data in biologics manufacturing databases

被引：0

作者：

Jeanet Mante

Nishanthi Gangadharan

David J. Sewell

Richard Turner

Ray Field

Stephen G. Oliver

Nigel Slater

Duygu Dikicioglu

机构：

[1] Pembroke College,Department of Chemical Engineering and Biotechnology

[2] University of Cambridge,Cell Sciences, Biopharmaceutical Development

[3] MedImmune,Cambridge Systems Biology Centre

[4] University of Cambridge,Department of Biochemistry

[5] University of Cambridge,undefined

来源：

Bioprocess and Biosystems Engineering | 2019年 / 42卷

关键词：

Biologics manufacturing data; Missing data; Imputation; Parameter recurrence; Data pre-processing;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

The biologics sector has amassed a wealth of data in the past three decades, in line with the bioprocess development and manufacturing guidelines, and analysis of these data with precision is expected to reveal behavioural patterns in cell populations that can be used for making predictions on how future culture processes might behave. The historical bioprocessing data likely comprise experiments conducted using different cell lines, to produce different products and may be years apart; the situation causing inter-batch variability and missing data points to human- and instrument-associated technical oversights. These unavoidable complications necessitate the introduction of a pre-processing step prior to data mining. This study investigated the efficiency of mean imputation and multivariate regression for filling in the missing information in historical bio-manufacturing datasets, and evaluated their performance by symbolic regression models and Bayesian non-parametric models in subsequent data processing. Mean substitution was shown to be a simple and efficient imputation method for relatively smooth, non-dynamical datasets, and regression imputation was effective whilst maintaining the existing standard deviation and shape of the distribution in dynamical datasets with less than 30% missing data. The nature of the missing information, whether Missing Completely At Random, Missing At Random or Missing Not At Random, emerged as the key feature for selecting the imputation method.

引用

页码：657 / 663

页数：6

共 50 条

[21] Imputation of missing data in industrial databases
Lakshminarayan, K
Harp, SA
Samad, T
[J]. APPLIED INTELLIGENCE, 1999, 11 (03) : 259 - 275
[22] The problem of missing data in geoscience databases
Henley, Stephen
[J]. COMPUTERS & GEOSCIENCES, 2006, 32 (09) : 1368 - 1377
[23] MISSING DATA IN LARGE ICU DATABASES
Fialho, Andre
Cismondi, Federico
Vieira, Susana
Sousa, Joao
Reti, Shane
Welsch, Roy
Howell, Michael
Finkelstein, Stan
[J]. CRITICAL CARE MEDICINE, 2010, 38 (12) : U6 - U6
[24] Imputation of Missing Data in Industrial Databases
Kamakshi Lakshminarayan
Steven A. Harp
Tariq Samad
[J]. Applied Intelligence, 1999, 11 : 259 - 275
[25] Handling large databases in data mining
Owrang, MM
[J]. CHALLENGES OF INFORMATION TECHNOLOGY MANAGEMENT IN THE 21ST CENTURY, 2000, : 121 - 125
[26] An Intelligent Approach for Handling Complexity by Migrating from Conventional Databases to Big Data
Ramzan, Shabana
Bajwa, Imran Sarwar
Kazmi, Rafaqut
[J]. SYMMETRY-BASEL, 2018, 10 (12):
[27] Handling Missing Data with Markov Boundary
Mohammed, Azhar
Nguyen, Dang
Duong, Bao
Nichols, Melanie
Nguyen, Thin
[J]. ADVANCED DATA MINING AND APPLICATIONS (ADMA 2022), PT I, 2022, 13725 : 319 - 333
[28] SOME AIDS IN HANDLING OF MISSING DATA
SEIGLE, D
[J]. SOCIAL SCIENCE INFORMATION, 1967, 6 (05): : 133 - 150
[29] Handling missing data in numeric analyses
Gorard, Stephen
[J]. INTERNATIONAL JOURNAL OF SOCIAL RESEARCH METHODOLOGY, 2020, 23 (06) : 651 - 660
[30] Best Practices for Handling Missing Data
Srijan, Shukla
Rajagopalan, Iyer R.
[J]. ANNALS OF SURGICAL ONCOLOGY, 2024, 31 (01) : 12 - 13

← 1 2 3 4 5 →