Combining population-based administrative health records and electronic medical records for disease surveillance

BackgroundAdministrative health records (AHRs) and electronic medical records (EMRs) are two key sources of population-based data for disease surveillance, but misclassification errors in the data can bias disease estimates. Methods that combine information from error-prone data sources can build on the strengths of AHRs and EMRs. We compared bias and error for four data-combining methods and applied them to estimate hypertension prevalence.MethodsOur study included rule-based OR and AND methods that identify disease cases from either or both data sources, respectively, rule-based sensitivity-specificity adjusted (RSSA) method that corrects for inaccuracies using a deterministic rule, and probabilistic-based sensitivity-specificity adjusted (PSSA) method that corrects for error using a statistical model. Computer simulation was used to estimate relative bias (RB) and mean square error (MSE) under varying conditions of population disease prevalence, correlation amongst data sources, and amount of misclassification error. AHRs and EMRs for Manitoba, Canada were used to estimate hypertension prevalence using validated case definitions and multiple disease markers.ResultsThe OR method had the lowest RB and MSE when population disease prevalence was 10%, and the RSSA method had the lowest RB and MSE when population prevalence increased to 20%. As the correlation between data sources increased, the OR method resulted in the lowest RB and MSE. Estimates of hypertension prevalence for AHRs and EMRs alone were 30.9% (95% CI: 30.6–31.2) and 24.9% (95% CI: 24.6–25.2), respectively. The estimates were 21.4% (95% CI: 21.1–21.7), for the AND method, 34.4% (95% CI: 34.1–34.8) for the OR method, 32.2% (95% CI: 31.8–32.6) for the RSSA method, and ranged from 34.3% (95% CI: 34.1–34.5) to 35.9% (95% CI, 35.7–36.1) for the PSSA method, depending on the statistical model.ConclusionsThe OR and AND methods are influenced by correlation amongst the data sources, while the RSSA method is dependent on the accuracy of prior sensitivity and specificity estimates. The PSSA method performed well when population prevalence was high and average correlations amongst disease markers was low. This study will guide researchers to select a data-combining method that best suits their data characteristics.

[1]  C. Bennett,et al.  Ascertainment of chronic diseases using population health data: a comparison of health administrative data and patient self-report , 2013, BMC Public Health.

[2]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[3]  Yulei He,et al.  Combining information from two data sources with misreporting and incompleteness to assess hospice‐use among cancer patients: a multiple imputation approach , 2014, Statistics in medicine.

[4]  M. Kretzschmar,et al.  Measuring underreporting and under-ascertainment in infectious disease datasets: a comparison of methods , 2014, BMC Public Health.

[5]  L. Lix,et al.  Construction and validation of a simplified fracture risk assessment tool for Canadian women and men: results from the CaMos and Manitoba cohorts , 2011, Osteoporosis International.

[6]  S. Derksen,et al.  Age-specific education and income gradients in morbidity and mortality in a Canadian province. , 1997, Social science & medicine.

[7]  Johannes B Reitsma,et al.  Bias due to composite reference standards in diagnostic accuracy studies , 2016, Statistics in medicine.

[8]  Qiong Zhao,et al.  Recent development of risk-prediction models for incident hypertension: An updated systematic review , 2017, PloS one.

[9]  F. McAlister,et al.  Hospitalization for uncomplicated hypertension: an ambulatory care sensitive condition. , 2013, The Canadian journal of cardiology.

[10]  Chung-Yi Li,et al.  Validation of algorithms to identify stroke risk factors in patients with acute ischemic stroke, transient ischemic attack, or intracerebral hemorrhage in an administrative claims database. , 2016, International journal of cardiology.

[11]  A. Hadgu,et al.  Evaluating Diagnostic Tests for Chlamydia trachomatis in the Absence of a Gold Standard: A Comparison of Three Statistical Methods , 2011 .

[12]  Aki Vehtari,et al.  Understanding predictive information criteria for Bayesian models , 2013, Statistics and Computing.

[13]  Mika Kivimäki,et al.  Risk Models to Predict Hypertension: A Systematic Review , 2013, PloS one.

[14]  T. Quinn,et al.  Use of Multiple Nucleic Acid Amplification Tests To Define the Infected-Patient “Gold Standard” in Clinical Trials of New Diagnostic Tests for Chlamydia trachomatis Infections , 2004, Journal of Clinical Microbiology.

[15]  G. Casella,et al.  Explaining the Gibbs Sampler , 1992 .

[16]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[17]  R. Rosenman,et al.  Systematically misclassified binary dependent variables , 2016, Communications in statistics: theory and methods.

[18]  Shaowu Tang,et al.  Dual composite reference standards (dCRS) in molecular diagnostic research: A new approach to reduce bias in the presence of Imperfect reference , 2018, Journal of biopharmaceutical statistics.

[19]  Bradley P. Carlin,et al.  Bayesian measures of model complexity and fit , 2002 .

[20]  M S Pepe,et al.  Using a combination of reference tests to assess the accuracy of a new diagnostic test. , 1999, Statistics in medicine.

[21]  H. Bøtker,et al.  Positive predictive value of cardiovascular diagnoses in the Danish National Patient Registry: a validation study , 2016, BMJ Open.

[22]  Loes C M Bertens,et al.  Value of composite reference standards in diagnostic research , 2013, BMJ.

[23]  T. Williamson,et al.  From patient care to research: a validation study examining the factors contributing to data quality in a primary care electronic medical record database , 2015, BMC Family Practice.

[24]  Josip Juras,et al.  Application of tetrachoric and polychoric correlation coefficients to forecast verification , 2006 .

[25]  M. Cullen,et al.  Further validation that claims data are a useful tool for epidemiologic research on hypertension , 2013, BMC Public Health.

[26]  F. McAlister,et al.  Epidemiology of Hypertension in Canada: An Update. , 2016, The Canadian journal of cardiology.

[27]  William W. Thompson,et al.  Utility of Composite Reference Standards and Latent Class Analysis in Evaluating the Clinical Accuracy of Diagnostic Tests for Pertussis , 2007, Clinical and Vaccine Immunology.

[28]  Alexander Singer,et al.  Data quality of electronic medical records in Manitoba: do problem lists accurately reflect chronic disease billing diagnoses? , 2016, J. Am. Medical Informatics Assoc..

[29]  L. Joseph,et al.  Bayesian Approaches to Modeling the Conditional Dependence Between Multiple Diagnostic Tests , 2001, Biometrics.

[30]  Rand R. Wilcox,et al.  Fundamentals of Modern Statistical Methods , 2001 .

[31]  Ernesto Schirmacher Multivariate Dependence Modeling using Pair-Copulas , 2008 .

[32]  Tyler Williamson,et al.  Validating the 8 CPCSSN Case Definitions for Chronic Disease Surveillance in a Primary Care Database of Electronic Health Records , 2014, The Annals of Family Medicine.

[33]  Patrick Bélisle,et al.  Bayesian modelling of imperfect ascertainment methods in cancer studies , 2005, Statistics in medicine.

[34]  Hude Quan,et al.  Diagnosed hypertension in Canada: incidence, prevalence and associated mortality , 2012, Canadian Medical Association Journal.

[35]  Arthur Lewbel,et al.  IDENTIFICATION OF THE BINARY CHOICE MODEL WITH MISCLASSIFICATION , 2000, Econometric Theory.

[36]  V. Salomaa,et al.  The validity of heart failure diagnoses obtained from administrative registers , 2013, European journal of preventive cardiology.

[37]  Joslin L. Moore,et al.  The concepts of bias, precision and accuracy, and their use in testing the performance of species richness estimators, with a literature review of estimator performance , 2005 .

[38]  D. Feeny,et al.  Self-reported hypertension prevalence and income among older adults in Canada and the United States. , 2010, Social science & medicine.

[39]  A. Schott,et al.  Breast cancer incidence using administrative data: correction with sensitivity and specificity. , 2009, Journal of clinical epidemiology.

[40]  Kaberi Dasgupta,et al.  Validity of Health Administrative Database Definitions for Hypertension: A Systematic Review. , 2017, The Canadian journal of cardiology.

[41]  Andrew Gelman,et al.  General methods for monitoring convergence of iterative simulations , 1998 .

[42]  R. Écochard,et al.  Method of correction to assess the number of hospitalized incident breast cancer cases based on claims databases. , 2002, Journal of clinical epidemiology.

[43]  Qingxia Chen,et al.  Missing covariate data in medical research: to impute is better than to ignore. , 2010, Journal of clinical epidemiology.

[44]  V. Kaplan,et al.  Prevalence of chronic medical conditions in Switzerland: exploring estimates validity by comparing complementary data sources , 2014, BMC Public Health.

[45]  Tyler Williamson,et al.  Validation of the Diagnostic Algorithms for 5 Chronic Conditions in the Canadian Primary Care Sentinel Surveillance Network (CPCSSN): A Kingston Practice-based Research Network (PBRN) Report , 2013, The Journal of the American Board of Family Medicine.

[46]  L. Lix,et al.  Refining Hypertension Surveillance to Account for Potentially Misclassified Cases , 2015, PloS one.

[47]  C. Robitaille,et al.  Comparison of diagnosed, self-reported, and physically-measured hypertension in Canada. , 2013, The Canadian journal of cardiology.

[48]  R. Wilcox Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy , 2001 .

[49]  Y. Kuo,et al.  Validation of claims-based algorithms for pulmonary arterial hypertension , 2018, Pulmonary circulation.

[50]  Janina Frank Comparing nationwide prevalences of hypertension and depression based on claims data and survey data: An example from Germany. , 2016, Health policy.

[51]  Peter Diem,et al.  Role of diuretics, β blockers, and statins in increasing the risk of diabetes in patients with impaired glucose tolerance: reanalysis of data from the NAVIGATOR study , 2013, BMJ.

[52]  A. Hadgu,et al.  Evaluation of Nucleic Acid Amplification Tests in the Absence of a Perfect Gold-Standard Test: A Review of the Statistical and Epidemiologic Issues , 2005, Epidemiology.

[53]  U. Haque,et al.  Bias in logistic regression due to imperfect diagnostic test results and practical correction approaches , 2015, Malaria Journal.

[54]  Karen Tu,et al.  Accuracy of administrative databases in identifying patients with hypertension , 2007, Open medicine : a peer-reviewed, independent, open-access journal.

[55]  H. Quan,et al.  Coding Algorithms for Defining Comorbidities in ICD-9-CM and ICD-10 Administrative Data , 2005, Medical care.

[56]  J. Ford,et al.  The Accuracy of Reporting of the Hypertensive Disorders of Pregnancy in Population Health Data , 2008, Hypertension in pregnancy.

[57]  T. Williamson,et al.  Prevalence and management of hypertension in primary care practices with electronic medical records: a report from the Canadian Primary Care Sentinel Surveillance Network. , 2015, CMAJ open.

[58]  Organización Mundial de la Salud Guidelines for ATC classification and DDD assignment , 1996 .

[59]  Johannes B. Reitsma,et al.  A review of solutions for diagnostic accuracy studies with an imperfect or missing reference standard. , 2009, Journal of clinical epidemiology.

[60]  Paul C Tang,et al.  Research Paper: Comparison of Methodologies for Calculating Quality Measures Based on Administrative Data versus Clinical Data from an Electronic Health Record System: Implications for Performance Measures , 2007, J. Am. Medical Informatics Assoc..

[61]  Hude Quan,et al.  Validation of a Case Definition to Define Hypertension Using Administrative Data , 2009, Hypertension.

[62]  Martijn J Schuemie,et al.  Chronic disease prevalence from Italian administrative databases in the VALORE project: a validation through comparison of population estimates with general practice databases and national survey , 2013, BMC Public Health.