A practical tool for public health surveillance: Semi-automated coding of short injury narratives from large administrative databases using Naive Bayes algorithms
被引:32
|
作者:
Marucci-Wellman, Helen R.
论文数: 0引用数: 0
h-index: 0
机构:
Liberty Mutual Res Inst Safety, Ctr Injury Epidemiol, Hopkinton, MA 01748 USALiberty Mutual Res Inst Safety, Ctr Injury Epidemiol, Hopkinton, MA 01748 USA
Marucci-Wellman, Helen R.
[1
]
Lehto, Mark R.
论文数: 0引用数: 0
h-index: 0
机构:
Purdue Univ, Sch Ind Engn, W Lafayette, IN 47907 USALiberty Mutual Res Inst Safety, Ctr Injury Epidemiol, Hopkinton, MA 01748 USA
Lehto, Mark R.
[2
]
Corns, Helen L.
论文数: 0引用数: 0
h-index: 0
机构:
Liberty Mutual Res Inst Safety, Ctr Injury Epidemiol, Hopkinton, MA 01748 USALiberty Mutual Res Inst Safety, Ctr Injury Epidemiol, Hopkinton, MA 01748 USA
Corns, Helen L.
[1
]
机构:
[1] Liberty Mutual Res Inst Safety, Ctr Injury Epidemiol, Hopkinton, MA 01748 USA
[2] Purdue Univ, Sch Ind Engn, W Lafayette, IN 47907 USA
Public health surveillance programs in the U.S. are undergoing landmark changes with the availability of electronic health records and advancements in information technology. Injury narratives gathered from hospital records, workers compensation claims or national surveys can be very useful for identifying antecedents to injury or emerging risks. However, classifying narratives manually can become prohibitive for large datasets. The purpose of this study was to develop a human-machine system that could be relatively easily tailored to routinely and accurately classify injury narratives from large administrative databases such as workers compensation. We used a semi-automated approach based on two Naive Bayesian algorithms to classify 15,000 workers compensation narratives into two-digit Bureau of Labor Statistics (BLS) event (leading to injury) codes. Narratives were filtered out for manual review if the algorithms disagreed or made weak predictions. This approach resulted in an overall accuracy of 87%, with consistently high positive predictive values across all two-digit BLS event categories including the very small categories (e.g., exposure to noise, needle sticks). The Naive Bayes algorithms were able to identify and accurately machine code most narratives leaving only 32% (4853) for manual review. This strategy substantially reduces the need for resources compared with manual review alone. (C) 2015 Published by Elsevier Ltd.