New algorithms for disease outbreak detection are being developed to take advantage of full electronic medical records (EMRs) that contain a wealth of patient information. However, due to privacy concerns, even anonymized EMRs cannot be shared among researchers, resulting in great difficulty in comparing the effectiveness of these algorithms. To bridge the gap between novel bio-surveillance algorithms operating on full EMRs and the lack of non-identifiable EMR data, a method for generating complete and synthetic EMRs was developed.
This paper describes a novel methodology for generating complete synthetic EMRs both for an outbreak illness of interest (tularemia) and for background records. The method developed has three major steps: 1) synthetic patient identity and basic information generation; 2) identification of care patterns that the synthetic patients would receive based on the information present in real EMR data for similar health problems; 3) adaptation of these care patterns to the synthetic patient population.
We generated EMRs, including visit records, clinical activity, laboratory orders/results and radiology orders/results for 203 synthetic tularemia outbreak patients. Validation of the records by a medical expert revealed problems in 19% of the records; these were subsequently corrected. We also generated background EMRs for over 3000 patients in the 4-11 yr age group. Validation of those records by a medical expert revealed problems in fewer than 3% of these background patient EMRs and the errors were subsequently rectified.
A data-driven method was developed for generating fully synthetic EMRs. The method is general and can be applied to any data set that has similar data elements (such as laboratory and radiology orders and results, clinical activity, prescription orders). The pilot synthetic outbreak records were for tularemia but our approach may be adapted to other infectious diseases. The pilot synthetic background records were in the 4-11 year old age group. The adaptations that must be made to the algorithms to produce synthetic background EMRs for other age groups are indicated.
Despite the current push to adopt electronic medical records (EMRs) as the standard for patient records, research concerned with utilizing all the information in EMRs may be compromised because legal restrictions and privacy concerns limit access to EMRs in academic and industrial research settings to a small number of institutions that have access to the full records. For example, any algorithm designed to work on EMRs can only be tested on a specific set of records so that there is no consistent set of test data that can be used by all interested parties to compare the efficacy of different algorithms. To avoid compromising patient privacy and hospital proprietary concerns, the intent of the Synthetic Electronic Medical Records Generator (EMERGE) project is to develop a methodology for creating synthetic EMRs from a set of real EMRs. Using EMERGE, test beds could be synthesized, creating EMRs for both background records and artificial outbreaks or emergencies that are not present in the real data. The availability of a standardized set of test data would allow comparison of different algorithms and procedures that operate on EMRs as well as provide a set of records for the development of such algorithms.
There are privacy and proprietary concerns with the dissemination of medical record data [
The first EMERGE product provided us with valuable lessons that were used in the methodology described in this paper. This earlier product was a set of synthetic EMRs of infected patients who were exposed to a fictitious bioterrorism event, the release of airborne tularemia in the restrooms at a sporting event [
We tested, analyzed, and refined methods to extract meaningful information from real EMRs to produce synthetic EMRs, as well as ways to represent the information in a mathematically consistent fashion. After this analysis, we devised a three-part method that would allow automation of the synthesis process with as much fidelity as possible in information content but without compromising details that defined any original patient's identity. This three-part-method consists of: 1) the synthesis of patient identities; 2) the identification of models-of-care in real EMRs; and 3) the adaptation of models-of-care to the synthetic patients. We will describe the processes in detail both for the set of injected tularemia patients and the set of synthetic background patients. Because the pilot age group for the background records was 4-11 yr olds, we will suggest adaptations to this method that we believe are necessary for the creation of data sets beyond this single age group.
There has been considerable research utilizing the information contained in EMRs. Some recent results occur in bio-surveillance [
There have been various efforts at de-identification of medical record information, including the Realistic But Not Real (RBNR) project [
As a recent effort to model the progression of chronic disease in an individual [
There are many different models for the spread of infectious disease through a population. From a simple lognormal curve [
The injection of artificial disease outbreaks into real time-series data, so-called hybrid data, is commonly used to test the effectiveness of disease outbreak detection or clustering algorithms. Typically a series of outbreaks or events is calculated according to an epidemic model, and case counts are added to the real data to simulate the additional cases that would occur as a result of the outbreak. Algorithms can then be tested with and without this outbreak data to gauge their sensitivity (for a general reference see, e.g. [
The addition of EMR to the available data sources has fueled recent advances in surveillance methods [
The rest of this paper is organized as follows: Section II presents an overview of the method developed, a description of the data set used, and details of the major steps of the methodology: Synthetic Patient Identities and basic information generation, Identification of Closest Patient Care Models and Descriptors, and Adaptation of Patient Care Models. Section III presents the results, and we finish with discussion and conclusions in Section IV.
We have developed two methods, one for generating synthetic EMRs for patients included in an infectious disease outbreak (Figure
The synthesis of EMRs for either an injected disease outbreak or for a set of background medical records begins with the determination of who becomes ill (or injured), when they become ill (or injured), and what diseases (or injuries) are the underlying causes of their seeking medical care. In this paper, we will refer to these simulated patients with the disease of interest as the victims. Therefore, the two techniques may be called respectively the
Once the basic information about patients has been synthesized (date of birth, gender, etc.), the next step is the identification of the care patterns that the synthetic patients would receive. This care pattern is defined as the sequence of health-care events that the patient experiences and it is used to create entries in the synthetic EMR. These EMR entries may include laboratory test orders/results, radiology orders/results, and prescription orders, as well as the clinical history such as working and final diagnoses.
After an appropriate care pattern is identified from the care patterns present in the real EMR data set - the method is described in detail in [
The simulated disease outbreak is called an
The synthetic patients who are infected with the outbreak disease of interest will be called
The EMRs of patients other than those with the disease of interest will be called
A Care Model is defined as the sequence of health-care events that a patient experiences. These care models are identified from real EMR data and described in detail in section II.e.
The abbreviation
We developed the synthetic EMR creation techniques using a dataset that contained 14 months of EMRs from the BioSense [
These BioSense data included seven tables defined as follows: 1) the Analysis Visit Table, which includes patient and visit identifier numbers, age and gender information, a summary of clinical activity, and syndrome and sub-syndrome; 2) the Clinical Activity Table, which includes patient identifiers and detailed records of chief complaint/reason for visit, working diagnoses and final diagnoses; 3) Laboratory Orders; 4) Laboratory Results; 5) Radiology Orders; 6) Radiology Results; and 7) Rx (defined as prescription) Orders. All tables for a particular patient were linked via a patient identification number and a visit identification number. The tables were in the format used by the SAS system [
Different subsets of the BioSense data set were used for
For the
The first major step of our method (see Figure
In a simulated disease outbreak scenario, the disease is typically modeled to infect a target population, which may be defined geographically, demographically, and by the scenario itself. The method of infection and type of disease dictates the epidemic curve, the type and severity of symptoms, and the disease progression. For example, in the simulated bioterrorist release of tularemia, the release of airborne tularemia occurred in the restrooms near the luxury boxes at a summer sporting event. The timelines for the initial prodrome and then full-blown illness were taken from values in the literature [
The timelines and illness progression for the synthetic tularemia victims were dependent on the dose of bacterial particles to which each victim was exposed, age, and other disease information found in the literature [
Using the syndromes and sub-syndromes assigned to the injected patients and a distribution of additional patient attributes that we compiled from the literature, we produced a victim descriptor for each patient. The victim descriptor included the basic characteristics of the injected patient: age, gender, race, ethnic group, certain syndromes and sub-syndromes related to tularemia (described in the previous paragraph). These victim descriptors were then passed to the next step (identification of closest care models) of the EMR synthesis procedure.
It is worth noting that the real EMRs we used did not contain any diagnoses of tularemia. Thus, using expert medical opinion, we searched for patterns of care that matched the sequence of symptoms that were generated in the course of simulating the illness progression in each synthetic victim. This procedure is described in detail in section II.e. In some cases patterns of care could be found that were quite close to what was expected for tularemia and in some cases there were discrepancies; we describe the required adjustments to the data in section II.f.
There is no single scenario or target population when the intention is to produce synthetic background records. The underlying causes of symptoms and reasons for seeking care are not known and need to be imposed on the synthetic patients in an approximation of those that appear in the real EMRs. However, the real EMR is an inexact reflection of the underlying condition of a patient. Even the most carefully entered EMR has been filtered by medical personnel subject to insurance rules and hospital or office protocols (see, e.g. [
In order to implement our data-driven approach to create synthetic EMRs from real EMRs, we needed to select a driving data element (i.e., an independent variable).that could be used to determine the other linked patient information for the background patients. For this driving data element, we considered using syndrome classification, sub-syndrome classification, 3-digit ICD-9 final diagnosis code [
Data elements present in the real data set.
| Data Element | % of visits in model data |
|---|---|
| Chief Complaint or Reason for Visit | 85.16 |
| Sub-syndrome | 83.52 |
| Syndrome | 54.41 |
| Final diagnosis ICD-9 code | 99.72 |
Possible ICD-9 codes associated with the chief complaint "sore throat."
| ICD-9 Code | Description |
|---|---|
| 034.0 | 034.0 STREP SORE THROAT |
| 079.99 | 079.99 VIRAL INFECTION NOS |
| 382.9 | 382.9 OTITIS MEDIA NOS |
| 462 | 462 ACUTE PHARYNGITIS |
| 463 | 463 ACUTE TONSILLITIS |
| 465.9 | 465.9 ACUTE URI NOS |
| 466.0 | 466.0 ACUTE BRONCHITIS |
| 473.9 | 473.9 CHRONIC SINUSITIS NOS |
| 486 | 486 PNEUMONIA, ORGANISM NOS |
| 528.0 | 528.0 STOMATITIS |
| 786.07 | 786.07 WHEEZING |
This does not include misspellings or abbreviations.
Another, even more relevant, consideration is the definition of the mapping from the driving data element to illness. By the mapping from data element to illness, we mean the identification of a single value of a data element with a particular illness, injury or condition. Ideally, the data element used to define the patients would have a clear and easily recognized relationship to underlying illness or injury. The mapping from data element to illness is "well defined" in the real data if the patient visits are grouped by a single value of this data element and if it differs very little in the underlying illness or condition that can be inferred from the information in the medical record. This is akin but not equal to the notion of specificity; a well-defined mapping minimizes the inclusion of dissimilar records associated to the data element. This mapping also has to be well-defined in the inverse sense, meaning that there is little
First let us consider the use of chief complaint, syndrome or sub-syndrome as the driving data element. Because of the variations in the spelling and abbreviations in chief complaints, the additional step of natural language processing (e.g. [
There are different possible illnesses or injuries that may be associated with a chief complaint, syndrome or sub-syndrome and the inverse is also true: there are many possible chief complaints for each underlying illness (or injury). To illustrate this, we extracted all patient visits with a single final diagnosis code of strep throat (ICD-9 code 034.0) from both emergency and outpatient cases in the subset of 4-11 year old patients from the real data set. Among emergency cases, there were 49 different chief complaint strings, taken from a list of 28 different single complaints (e.g. fever, nausea, neck pain, cold, cough, stuffy nose, abdominal pain, vomiting, sore throat or variations such as throat pain, throat sore, etc. as well as others). Although sore throat (and variations) appeared quite often in the strings, there were 26 complaints (53%) that did not contain sore throat, throat pain, or other such variations anywhere in the complaint text. In the outpatient cases, there were 62 different reasons-for-visit, with 30 different single complaints. Of these, 9 (14.5%) did not contain sore throat, throat pain or other such variations anywhere in the reason-for-visit text. This ill-defined mapping translates to the following problem with using chief complaint as the driving data element. If patients were synthesized only from timelines of particular chief complaints, the variety of underlying causes would have to be synthesized independent from, but consistent with, the chief complaints. This would require either considerable expert input on the possible conditions associated with those chief complaints or the categorization of ICD-9 codes and/or syndromes and sub-syndromes by chief complaint.
An additional difficulty with using chief complaints, syndromes or sub-syndromes is that temporal variations in specific conditions would be inherited from the general category of data elements and may not match those for specific illnesses in the real data. For example,
Next, let us consider using ICD-9 code as the driving data element. Although the final diagnosis ICD-9 code does not always completely describe or define the underlying illness or injury of a patient [
The BioSense dataset contained 6426 clinical activity records for Emergency (ER) patients and 3248 records for outpatient (OP) patients. Of the 6426 ER visit records, 65% of the visits had 29 different final diagnosis ICD-9 codes with 100 or more patient visits. These 29 included ICD-9 codes 873., 382., 493., 465., and 034 as the codes with the most patient visits; those 5 ICD-9 codes comprised 28% of the patient visit records. However, we reconstructed timelines for 295 ICD-9 final diagnosis codes for the synthetic ER patients. Of the 3248 OP visits, 43% of the visits had 14 different final diagnosis ICD-9 codes with 60 or more patient visits. These 14 included codes 382, 079, 462, and 034.
Because the primary diagnosis and any secondary diagnoses were not differentiated in the BioSense dataset we received, we used a convention to define the primary diagnosis. For each patient in the real data set, we sorted the final diagnosis codes in numeric order, with the exception that specific injury codes (prefixed E) were excluded from occurring as primary diagnoses. The first code (after alphanumerically sorting) was defined to be the primary diagnosis. The time series of healthcare events related to this primary diagnosis were mimicked to produce the synthetic patient timelines. Furthermore, we verified that this set of primary diagnosis ICD-9 codes did not contain diagnoses that were either considered rare by a subject matter expert or never appeared as a single primary diagnosis. This procedure thereby assured that sorting the ICD-9 codes and using the first code as the primary diagnosis did not omit any other potential primary diagnoses.
Rare diagnoses, diagnoses that are unusual in this age group, and diagnoses that included multiple congenital conditions were excluded from consideration because an individual could be identified from such a diagnosis together with knowledge of the particular region. Producing multiple records from these ICD-9 codes in order to remove the possibility of identification would yield a higher-than-usual incidence of rare or unusual conditions in the synthetic data. Although it could be argued that exclusion of rare diagnoses reduces the fidelity and realism of the synthetic data, a data set without the
We extracted the timelines of truncated (without detail codes after the decimal point) final diagnosis ICD-9 codes from the clinical activity records of the real data and reconstructed similar (but not identical) timelines in two ways, dependent on the sparseness or richness of the data stream. For time series with on average more than 1 case per week, we performed a Haar wavelet-2 [
Step Two performs the identification of the medical care that each of the synthetic patients would receive. Basic characteristics of the synthetic patient are coming from the
Step 2 for
Before the above steps can be executed, Patient Care Models, Patient Care Descriptors, and Analysis Visit Descriptors need to be extracted from the EMR data set. This is a process that is performed only once for the whole data set. It is described in detail later in this section. The computation of Patient Care Models and Patient Care descriptors is computationally expensive: it takes about 30 sec per patient. For 10,000 patients such a computation takes about 80 hours on a 32-bit PC.
A
The AVisit contains the information about a given patient visit including patient's demographic data, visit identification number, visit date, syndromes, and sub-syndromes. For
Example Analysis Visit Descriptor (for Visit 2307262).
| 127151 | |
| 2307262 | |
| 12AUG2006:20:15:00.000 | |
| E | |
| 4-11 Years | |
| F | |
| White | |
| Not Hispanic or Latino | |
| Respiratory | |
| ... | |
| Otitis media | |
| ... | |
| 388.7 | |
| 382.7 | |
| 1 | |
| 1 | |
Example Analysis Visit Descriptor (for Visit 3102841).
| 127151 | |
| 3102841 | |
| 10JAN2007:08:41:07.000 | |
| E | |
| 4-11 Years | |
| F | |
| White | |
| Not Hispanic or Latino | |
| Fever | |
| Respiratory | |
| ... | |
| Fever | |
| Headache | |
| ... | |
| 462 | |
| 462 | |
| 465.9 | |
| 1 | |
| 1 | |
| 1 | |
| 1 | |
| 1 | |
| 1 | |
| 1 | |
| 2 |
For
Patient Care Descriptor for Patient 127151.
| 127151 | |
| 4-11 Years | |
| F | |
| White | |
| Not Hispanic or Latino | |
| 2 | |
| 1 | |
| 1 | |
| 1 | |
| 1 | |
| 1 | |
| 1 | |
| 1 | |
| 1 | |
| 2 | |
The goal of the methodology developed is to derive, from the available real EMR data, a care model of how the patients are treated. This method has the following main steps (Figures
1) Build
2) Build
Figure
For Synthetic Inject Records Generation (Figure
The information coming from Step One into Step Two is different for the Injected Victim Generation (Figure
In the case of
For
A distance measure is used to identify the closest (minimum distance) Patient Care Descriptor to the desired inject. Gower's General Similarity Coefficient [
The Euclidean distance operates on the following 23 attributes: Syndrome attributes - Fever, Gastrointestinal, Rash, Respiratory; and Sub-syndrome attributes - Abdominal pain, Alteration of consciousness, Chest pain, Convulsions, Cough, Diarrhea, Dyspnea, Headache, Hemoptysis, Hemorrhage, Influenza-like illness, Lymphadenopathy, Neoplasms, Malaise and fatigue, Nausea and vomiting, Respiratory failure, Septicemia and bacteremia, Upper respiratory infections, Severe illness or death.
The Euclidean distance,
Where
For the Background Patient Generation, the procedure described below is performed for every synthetic visit of every synthetic patient. The goal is to find, for each synthetic visit, a visit that is as similar as possible in the real data. Figure
We define a distance measure between a synthetic visit and an Analysis Visit Descriptor that is a combination of weighted Euclidean distance and Jaccard distance [
The Jaccard index is a useful measure of similarity in cases of sets with binary attributes because it takes into account not only how many attributes agree but also how many disagree. If two sets, A and B, have
We defined and used the following distance measure to identify the closest Analysis Visit Descriptor to a given synthetic visit descriptor:
where
Next we examine the search methods and rationale behind assignment of a specific care model to a synthetic patient. These processes differed both in automation level and in specific procedures for the synthetic tularemia victims and for the synthetic background patients.
The third step in the process (see Figure
A medical expert inspected the closest care models that were identified for the synthetic tularemia victims and made recommendations for tests that either needed to be deleted or added, with the assumption that the attending physician might not realize the underlying illness were tularemia. We added either rapid strep tests or influenza tests to some records, but adjusted any results to reflect the absence of either strep or influenza. If respiratory or blood cultures were taken, we modified results to indicate no growth, reflecting the fastidious nature of tularemia in routine cultures. In some cases, we added chest x-rays to the records of patients whose care models did not include them to suggest a physician's possible response to an unexplained severity or persistence of respiratory symptoms. Chest x-ray results were modified to include evidence of pathology that would be common for pneumonic tularemia patients. Because the care models had to be inspected and edited individually and the patients assessed based on the symptoms generated by the injection algorithm, little automation was possible for this phase of the synthetic tularemia injection method. This is the biggest drawback of the methodology proposed.
In the case of Background Patient Generation, the distance measure used to find care models identified up to ten care model candidates for each synthetic background patient. We automated the process of choosing the most appropriate care model from those identified by examining specific ICD-9 codes and exact syndromes and sub-syndromes in the patient visit records and in the care models. The hierarchy was first to try to match all exact ICD-9 final diagnosis codes in the patient visit record. If multiple care models still matched the patient visit record, the algorithm chose by specific sub-syndrome and syndrome codes. If multiple care models still matched after this step, a care model was chosen randomly from the remaining candidates, with equal probability given to each.
The final part of Step Three of the methodology is the adaptation of the entire EMR to the particular patient to assure that there is no exact match between a synthetic and a real patient. This procedure is described in the next section.
The injection algorithm unites the patient visit record and the care model to produce a consistent EMR for a synthetic patient so that the EMR has visit-linked entries in the six tables - analysis visit, clinical activity, radiology orders, radiology results, laboratory orders and laboratory results. Because the Rx orders in the real data set were incomplete and sporadic, we did not generate the synthetic Rx orders table. The injection algorithm time-stamps the entries subsequent to the visit date by using the time intervals found in the care model with randomly chosen (uniform) variation. Any radiology or laboratory orders are assigned unique identification numbers, and these numbers are carried through to the radiology and laboratory results if present. The algorithm also produces summaries of clinical activity and sub-syndrome/syndrome information in the format found in the model data and writes these in the analysis visit record. All formats found in the original records are duplicated by the injection algorithm. However, patient zip code, health department identifier, and location are not synthesized for the synthetic background records. The algorithm writes the tables to comma-separated-value (csv) files, preserving leading zeros on ICD-9 codes and producing SAS date formats as well as MS-EXCEL-readable formats.
The simulated tularemia outbreak resulted in 203 synthetic victims. Of these victims, 19 individuals sought care for prodrome symptoms but none of these synthetic patients were admitted to a hospital at that time. All 203 victims sought care for a severe respiratory illness after 18-22 days; there were 57 synthetic patient admissions and 17 synthetic victim deaths.
As part of the validation process, a medical expert reviewed all the synthetic records and determined that 42 records (i.e. 19% of the inject records) had content problems or inconsistencies. The predominant problem was a string of ICD-9 codes that did not match any of the expected symptoms of tularemia. For example, a patient would be assigned ICD-9 codes of "250.00 DMII WO CMP NT ST UNCNTR", "272.4 HYPERLIPIDEMIA NEC/NOS", "410.71 SUBENDO INFARCT, INITIAL" as well as the ICD-9 code of "486 PNEUMONIA, NOS." Because no cases of tularemia were found in the real data, it was expected that some difficulty would arise with matching the synthetic patient descriptors to an appropriate care model. Of the 221 visit records (for 203 patients), 36 were deemed unsuitable for this reason. Other problems included incompatible ICD-9 codes (2 records), for example, both "786.50 CHEST PAIN NOS" and "786.52 PAINFUL RESPIRATION," exact duplication of real EMR information (2 records), and odd or incorrect ICD-9 codes (2 records) such as a diagnosis of "263.9 PROTEIN-CAL MALNUTR NOS." These 42 records were adjusted manually by editing the fields that were deemed inconsistent, odd, or erroneous. It was noted by the medical expert that the presentation of tularemia in the remaining synthetic records was as expected.
The synthesized electronic medical records included syndrome and sub-syndrome classifications for the injected patients. The severe respiratory illness imposed on the victims was manifested in EMRs that included multiple syndrome and sub-syndrome classifications. Of the 221 visits for 203 victims, over 92% exhibited the fever syndrome, nearly 80% exhibited the respiratory syndrome, and 12% had severe illness or death. There were also over 10% of patients with hemorrhagic illness and over 12% with gastrointestinal syndrome. Sub-syndrome classifications for the injected patients included over 44% with cough, nearly 40% with dyspnea, over 47% with fever, and 19% with pneumonia or lung abscess. Many of the age 50+ patients also exhibited cardiac dysrhythmias and mental disorders. There were also more than 12% with respiratory failure and nearly 10% with shock (see Table
Syndromes assigned to synthetic tularemia injects.
| Syndrome | % of 221 visits with listed syndrome |
|---|---|
| Fever | 92.76 |
| Gastrointestinal | 12.22 |
| Hemorrhagic Illness | 10.86 |
| Localized Cutaneous Lesion | .45 |
| Lymphadenitis | 1.81 |
| Neurological | 12.67 |
| Rash | 4.52 |
| Respiratory | 79.64 |
| Severe Illness or Death | 11.76 |
| Specific Infection | 7.24 |
Of the 203 patients, 107 had from 1 to 28 laboratory orders. The most common test was a blood culture, for nearly 40% of the laboratory tests. The next most common test was respiratory culture and smear at nearly 13% of the tests (see Table
Sub-syndromes assigned to synthetic tularemia injects.
| Sub-Syndrome | % of 221 visits with listed Sub-Syndrome | Sub-Syndrome | % of 221 visits with listed Sub-Syndrome |
|---|---|---|---|
| Abdominal Pain | 8.14 | Heart disease, ischemic | 2.26 |
| Alteration of Consciousness | 12.67 | Hempotysis | 7.24 |
| Anemia | 10.86 | Hypotension | 1.81 |
| Asthma | 3.62 | Influenza-like Illness | 8.14 |
| Bronchitis and Bronchiolitis | 11.31 | Intestinal infections, ill-defined | .45 |
| Cardiac dysrythmias | 14.48 | Lymphadenopathy | 1.81 |
| Chest pain | 9.95 | Malaise and fatigue | .45 |
| Coagulation defects | .45 | Mental disorders | 10.86 |
| Coma | 8.14 | Migraine | .45 |
| COPD | 1.36 | Nausea and vomiting | 9.95 |
| Cough | 44.34 | Pleurisy | .45 |
| Cyanosis and hypoxemia | 2.71 | Pneumonia and lung abscess | 19.00 |
| Death | .45 | Pupurae and petechiae | 1.36 |
| Dehydration | 1.81 | Rash | 10.41 |
| Diabetes mellitus | 1.81 | Respiratory failure | 12.67 |
| Diarrhea | 8.60 | Septicemia and bacteremia | 4.52 |
| Dizziness | .45 | Shock | 9.95 |
| Dyspnea | 39.82 | Skin infection | .45 |
| Edema | 2.71 | Syncope and collapse | .45 |
| Fever | 47.51 | Upper respiratory infections | 2.26 |
| Gastrointestinal hemorrhage | 2.71 | Urinary tract infections | 1.81 |
| Headache | .90 | Viral infection, unspecified | 11.31 |
Of the 112 synthetic patients who had radiology orders, over 80% had chest x-rays (see Table
Laboratory Tests for Synthetic Tularemia Injects.
| Ordered Test Name Local (Laboratory Test) | Percent of Tests Ordered | Ordered Test Name Local (Laboratory Test) | Percent of Tests Ordered |
|---|---|---|---|
| ASO Titer(ASO) | .66 | Influenza Antigen(FLUAG) | 3.45 |
| Aerobic | .16 | Legionella Ag Urine Culture(LEGEIA) | .82 |
| Blood Culture (BLC) | 39.90 | Lyme IgG | .66 |
| Blood Culture Isolator(BLDC) | .66 | Mono Test (MONO) | .16 |
| C Difficile Toxin A | 11.33 | Ova and Parasite Exam (OVAP) | 2.46 |
| C Reactive Protein (CRP) | 3.28 | Prealbumin (PAB) | 1.64 |
| CMV AB IgG (CMVGAB) | .66 | Reproductive Culture | .66 |
| CMV AB IgM(CMVMAB) | .66 | Respiratory Culture and Smear(RTCS) | 12.97 |
| Chlamydia/GC by Amplified | .66 | Respiratory Viral Panel Acute(RVPA) | .66 |
| Probe(CGPT) | |||
| Enteric Pathogen Culture(ENPC) | 2.46 | Strep Group A Screen Rapid(RSTREP) | 2.63 |
| Epstein Barr Virus Antibody | .66 | Urine Culture (URC) | 10.67 |
| Screen (EBVSRN) | |||
| Gram Smear (GRAS) | .16 | Urine Culture and Smear (URCS) | .16 |
| Haptoglobin (HAPT) | 1.15 | ||
| Herpes Virus Six Culture(HHV6Q) | .66 |
There were 609 tests ordered.
Radiology Orders for synthetic tularemia injects.
| Ordered Test Name Local (Radiology Order) | Percent of Tests Ordered |
|---|---|
| DX Abdomen 2 View | .64 |
| DX Abdomen AP | 13.74 |
| DX Abdomen Acute | .64 |
| DX Chest 1 View AP | 34.50 |
| DX Chest 1 View NR | 1.60 |
| DX Chest 2 View | 23.64 |
| DX Chest Special Vi | 32 |
| DX Small Bowel Series | .64 |
| PX Abdomen Portable | 2.88 |
| PX Chest 1 V Portable | 20.77 |
| PX Cholangiogram In | .64 |
For illustrative purposes, we will now follow two of the synthetic tularemia patients through the various records of the EMR, as seen in Table
Electronic medical records for two synthetic tularemia inject patients.
| A) Analysis Visit Table | |||||
|---|---|---|---|---|---|
| Analysis Visit ID | Patient ID | Analysis Visit Date | Analysis Visit End Date | AVPatClass | |
| 214973 | 03AUG2006:05:27:01 | 03AUG 2006:05:7:01 | E | ||
| 210042 | 18AUG2006:07:49:21 | 27AUG2006:18:00:33 | O | ||
| ( Analysis Visit Table continued...) | |||||
| 25 | 20-49 | Male | 2106-3 | White | |
| 47 | 20-49 | Female | 2106-3 | White | |
| (Analysis Visit Table continued...) | |||||
| Not Hispanic or Latino | 1 | Discharged to home or self care (routine discharge) | |||
| Not Hispanic or Latino | 1 | Discharged to home or self care (routine discharge) | |||
| (Analysis Visit Table continued...) | |||||
| ER | 03AUG2006:05:27:01 | E | |||
| RA3 | 19AUG2006:02:41:11 | E | |||
| (Analysis Visit Table continued...) | |||||
| |Emergency - Chief Complaint| | |Fever |Influenza-like illness| | |Fever|Respiratory | |||
| |Outpatient - Final Diagnosis|Outpatient - Reason for Visit|Outpatient - Working Diagnosis | |Cough|Fever| | |Respiratory|Fever| | |||
| B) Clinical Activity Table | |||||
| 214973 | 03AUG2006:05:27:01 | E | CC | PV2 | |
| 210042 | 18AUG2006:07:49:21 | O | CC | PV2 | |
| 210042 | 18AUG2006:07:49:21 | O | DX | DG1 | |
| 210042 | 18AUG2006:07:49:21 | O | DX | DG1 | |
| 210042 | 18AUG2006:07:49:21 | O | DX | DG1 | |
| (Clinical Activity Table continued...) | |||||
| FEVER OTHER FLU LIKE SYMPTOMS | 03AUG2006:05:27:01 | Hospital | |||
| COUG AND FEVER | 18AUG2006:07:49:21 | Hospital | |||
| 786.2 COUGH | 18AUG2006:07:49:21 | F | Hospital | ||
| 780.6 FEVER | 18AUG2006:07:49:21 | F | Hospital | ||
| 786.2 COUGH | 18AUG2006:07:49:21 | A | Hospital | ||
| (Clinical Activity Table continued...) | |||||
| 03AUG2006:06:18:51 | 03AUG2006:05:27:01 | 25 | Year | ||
| 18AUG2006:22:04:34 | 18AUG2006:07:49:23 | 47 | Year | ||
| 27AUG2006:10:31:30 | 27AUG2006:03:32:34 | 47 | Year | ||
| 27AUG2006:19:47:07 | 18AUG2006:07:49:21 | 47 | Year | ||
| 27AUG2006:17:57:09 | 18AUG2006:07:49:21 | 47 | Year | ||
| (Clinical Activity Table continued...) | |||||
| |Fever|Influenza-like Illness| | |Fever|Respiratory| | |Fever|Influenza-like Illness| | |||
| |Cough|Fever| | |Respiratory|Fever| | |Cough|Fever| | |||
| |Cough| | |Respiratory| | |Cough|| | |||
| |Fever| | |Fever| | |Fever| | |||
| |Respiratory| | |Respiratory| | |Cough| | |||
| (Clinical Activity Table continued...) | |||||
| |Fever|Respiratory| | |2324|3309| | ||||
| |Respiratory|Fever| | |3298|2324| | ||||
| |Respiratory| | |2885| | ||||
| C) There are no laboratory orders for these two patients. | |||||
| D There are no laboratory results for these two patients. | |||||
| E) Radiology Order Table | |||||
| 536120 | NW | 18AUG2006:07:49:21 | RT | ||
| (Radiology Orders continued...) | |||||
| 44404327 | RAD | 19AUG2006:03:01:01 | P | ||
| (Radiology Orders continued...) | |||||
| 15488699 | DX Chest 2 View | COUGH AND FEVER | |||
| F) Radiology Results Table | |||||
| 536120 | 1 | 18AUG2006:07:49:21 | |||
| 536120 | 1 | 18AUG2006:07:49:21 | |||
| 536120 | 1 | 18AUG2006:07:49:21 | |||
| (Radiology Results continued...) | |||||
| O | 444004327 | RAD | 19AUG2006:06:06:32 | P | |
| O | 444004327 | RAD | 19AUG2006:06:55:41 | F | |
| O | 444004327 | RAD | 19AUG2006:08:59:32 | F | |
| (Radiology Results continued...) | |||||
| 15488699 | DX Chest 2 View | COUGH AND FEVER | |||
| 15488699 | DX Chest 2 View | COUGH AND FEVER | |||
| 15488699 | DX Chest 2 View | COUGH AND FEVER | |||
| (RadiologyResults continued...) | |||||
Some fields have been omitted for clarity; most of the omitted fields are blank.
Impressions
PA AND LATERAL VIEWS OF CHEST Findings: Multiple parenchymal infiltrates seen bilaterally. There are bilateral pleural effusions
PA AND LATERAL VIEWS OF CHEST Findings: Multiple parenchymal infiltrates seen bilaterally. There are bilateral pleural effusions
PA AND LATERAL VIEWS OF CHEST Findings: Multiple parenchymal infiltrates seen bilaterally. There are bilateral pleural effusions
We generated approximately 3000 synthetic background patient EMRs for the 4-11 year age-group. These were generated several times to insure that the algorithms performed consistently. There were 295 different primary diagnosis ICD-9 codes for the group of synthetic emergency patients and 294 different primary diagnosis ICD-9 codes for the group of synthetic outpatients.
These ICD-9 codes included both illness and injury. In fact, the most common primary diagnosis for the emergency room patients in this age group was head laceration. To illustrate, we will follow the two synthetic background patient EMRs described in Table
Electronic medical records for two synthetic background patients.
| A) Analysis Visit Table | ||||
|---|---|---|---|---|
| 60343 | 1 | 72835 | 15JUL2006:18:46:59 | |
| 62834 | 1 | 75326 | 23JUL2007:19:26:24 | |
| (Analysis Visit Table continued...) | ||||
| E | 23JUN2002:00:00:00 | 4 | 4-11 Years | |
| E | 12SEP1995:00:00:00 | 11 | 4-11 Years | |
| (Analysis Visit Table continued...) | ||||
| 2106-3 | White | 2186-5 | Not Hispanic or Latino | |
| 2106-3 | White | 2186-5 | Not Hispanic or Latino | |
| (Analysis Visit Table continued...) | ||||
| Discharged to home or self care (routine discharge) | 15JUL2006:20:14:07 | 15JUL2006:21:07:21 | ||
| Discharged to home or self care (routine discharge) | 23AUG2007:16:57:59 | 23AUG2007:18:03:51 | ||
| (Analysis Visit Table continued...) | ||||
| SORE THROAT;|462 |462 LEFT CLAVICULAR PAIN;|810.02|E884.0|786.59 | ||||
| (Analysis Visit Table continued...) | ||||
| Upper_respiratory_infections | Respiratory | |||
| Falls|Fractures_and_dislocation | ||||
| B) Clinical Activity Table | ||||
| 32096443 | 60343 | 15JUL2006:18:46:5 | E | |
| 32096444 | 60343 | 15JUL2006:18:46:59 | E | |
| 32096445 | 60343 | 15JUL2006:18:46:59 | E | |
| 32106099 | 62834 | 23JUL2007:19:26:24 | E | |
| 32106100 | 62834 | 23JUL2007:19:26:24 | E | |
| 32106101 | 62834 | 23JUL2007:19:26:24 | E | |
| 32106102 | 62834 | 23JUL2007:19:26:24 | E | |
| (Clinical Activity Table continued...) | ||||
| PV2 | SORE THROAT; | |||
| DG1 | 462 | 462 ACUTE PHARYNGITIS | ||
| DG1 | 462 | 462 ACUTE PHARYNGITIS | ||
| PV2 | LEFT CLAVICULAR PAIN | |||
| DG1 | 810.02 | 810.02 FX CLAVICLE SHAFT-CLOSED | ||
| DG1 | E884.0 | E884.0 FALL FROM PLAYGROUND EQUIPMENT | ||
| DG1 | 786.59 | 786.59 CHEST PAIN NEC | ||
| (Clinical Activity Table continued...) | ||||
| Chief Complaint | 4 | Year | F | |
| Final Diagnosis | 4 | Year | F | |
| Working Diagnosis | 4 | Year | F | |
| Chief Complaint | 11 | Year | M | |
| Final Diagnosis | 11 | Year | M | |
| Final Diagnosis | 11 | Year | M | |
| Working Diagnosis | 11 | Year | M | |
| C) Laboratory Orders Table | ||||
| 72835 | 1 | NW | ||
| (Laboratory Orders Table continued...) | ||||
| 609251 | 15JUL2006:00:00 | RT | E | |
| (Laboratory Orders Table continued...) | ||||
| MB | 15JUL2006:21:50:03 | |||
| (Laboratory Orders Table continued...) | ||||
| STTH | ||||
| (Laboratory Orders Table continued...) | ||||
| THT | ||||
| D) No Laboratory Results for these patients | ||||
| E) Radiology Orders Table | ||||
| 75326 | 1 | NW | ||
| (Radiology Orders Table continued...) | ||||
| 611742 | 25JUL2007:19:26:24 | RT | E | |
| (Radiology Orders Table continued...) | ||||
| RAD | 26JUL2007:19:13:30 | P | ||
| (Radiology Orders Table continued...) | ||||
| 26JUL2007:21:05:29 | DX Clavicle LEFT | Trauma | ||
| F) Radiology Results Table | ||||
| 75326 | 5887812 | 1 | RE | |
| 75326 | 5887812 | 1 | RE | |
| (Radiology Results Table continued...) | ||||
| 611742 | 25JUL2007:19:26:24 | RT | E | |
| 611742 | 25JUL2007:19:26:24 | RT | E | |
| (Radiology Results Table continued...) | ||||
| RAD | 27JUL2007:05:42:35 | F | ||
| RAD | 26JUL2007:20:41:53 | F | ||
| (Radiology Results Table continued...) | ||||
| 15488740 | DX Clavicle LEFT | Trauma | ||
| 15488740 | DX Clavicle LEFT | Trauma | ||
| (Radiology Results Table continued...) | ||||
Some fields have been omitted for clarity; most of the omitted fields are blank.
Impressions
CLINICAL HISTORY: Pain following trauma. Technique: Two views were obtained. Comparison: None. Findings: There is a superiorly apex angulated, nondisplaced fracture of the mid clavicle. No pneumothorax is seen based on the view submitted. IMPRESSION: Nondisplaced fracture of the mid clavicle.
CLINICAL HISTORY: Pain following trauma. Technique: Two views were obtained. Comparison: None. Findings: There is a superiorly apex angulated, nondisplaced fracture of the mid clavicle. No pneumothorax is seen based on the view submitted. IMPRESSION: Nondisplaced fracture of the mid clavicle.
The second synthetic patient is an 11-year old boy with a chief complaint of clavicular pain. The analysis visit record has sub-syndromes of falls and fractures and dislocation. There are 4 clinical activity records: for chief complaint, working diagnosis (chest pain) and final diagnosis (closed clavicle fracture and fall from playground equipment). This synthetic patient record has one radiology order, for a clavicle x-ray, and 2 radiology results, reflecting the diagnosis of clavicle fracture. This synthetic patient had a routine discharge.
The descriptors for the synthetic background patients were built from the truncated 3 or 4- digit ICD-9 codes. Thus, we encountered an occasional mismatch between the detailed ICD-9 codes of the synthetic patient and the detailed ICD-9 codes of the closest care model for that patient. These mismatched records accounted for the majority of the errors that were present in the synthetic EMRs before we corrected them. For example, a synthetic patient's final diagnosis ICD-9 for a dog bite was matched to the care model for a non-venomous insect bite. However, these errors were present in fewer than 3% of the synthetic patient records (see Table
Validation of Synthetic background electronic medical records.
| Error | Number of Errors found | Percent of total Records |
|---|---|---|
| Care Model did not match ICD-9 | 80 | 2.4% |
| codes completely. | ||
| ICD-9 Codes contradictory | 8 | .24% |
| Gender or Age ICD-9 mismatch | 6 | .18% |
| Sub-syndrome or syndrome | 4 | .12% |
| assignment inconsistent with ICD- | ||
| 9 or chief complaint | ||
| Typographical or formatting error | 4 | .12% |
| Total | 91 | 2.8% |
There were 3272 visit records.
Overall, there were few errors in the synthetic background patients. As we see from Table
We discussed the reproduction of various statistical and temporal patterns in the synthetic data in [
The complete synthetic tularemia and background data sets can be found on the public health grid [
Due to privacy concerns, even sanitized and anonymized EMRs cannot be shared among researchers developing bio-surveillance algorithms, methods for improving the quality of patients' care or investigating adverse drug effects. The work presented in this paper aims at removing this obstacle (i.e., the lack of non-identifiable EMR data). These new areas of research can only thrive when abundant, shareable and complete EMR data are available for use.
We have developed a data-driven method for generating full synthetic EMRs of tularemia patients as well as of background patients. The method has three major steps: 1) synthetic patient identity and basic information generation; 2) identification of care patterns that the synthetic patients would receive; and 3) adaptation of patient care models. The techniques described herein s are data-driven, meaning that these techniques mine the data in existing real EMRs in order to extract information about the patients' patterns of care, the frequencies of ICD-9 codes, syndromes, and sub-syndromes. The synthetic EMRs need to mimic rather than duplicate the real EMRs. That is, no synthetic patient can match a real patient exactly in age, gender, demographic variables and visit information, although as a group the age, gender, demographic variables and diagnoses need to display the same statistical distributions as the real EMRs.
In the case of tularemia inject generation, 203 synthetic victim records were synthesized. 19 victims sought care for the prodrome, and all the victims sought care for the full blown illness. Examples of full EMRs of two synthetic patients were described in detail. The method developed lends itself to generating EMRs of patients with illnesses other than tularemia. If such illnesses are present in the data set, the methodology will be the same as the one developed for the synthetic background data. For illnesses not present in the data set (e.g., those which are bioterrorism related), the illness needs to be studied using case reports and other information found in the medical literature. Also, expert medical opinion needs to be taken into consideration (similar to what we present here for tularemia) in order to find patterns of care in the data that match the sequence of symptoms that will be generated in the course of modeling the illness's progression in each synthetic victim. For each new illness not present in the data, this may be a time consuming process.
For the most part, patients 4-11 years old become sick or injured, get treated, and get well. For this reason, the 4-11 age group was the least complicated with which to develop the synthesis algorithms. Although there are quite a few patients in this age group with chronic conditions such as asthma, there were few with co-morbid conditions that cause repeated visits and extended hospital stays. Thus the visits are rarely related one to another, which allowed us to develop a methodology that performs matching on visits, instead of matching on the full Patient Care Model.
The pilot synthetic background data set was a starting point for the evolution of ideas; other methods will be necessary to synthesize patients in other age groups. It will be necessary to separate patients into categories of one-time visits that concern one or a few transient illnesses or injuries, and multiple related visits with dependent causes and co-morbid conditions. The methodology of matching Visit Care Models that we developed for the pilot age group will be insufficient to create reasonable synthetic records for the older groups of patients (especially 50+). For these patients, Patient Care Models will need to be used for matching instead of Visit Care Models. Thus the methodology for the older age groups will resemble the method developed for synthetic injects in which the matching was done on full Patient Care Models. We will need to develop care models for coexistent illnesses (e.g., diabetes and hypertension) that vary in manifestations of both or either underlying illness. In this case, the publications based on the Archimedes model [
We highlight the usefulness of this method with regard to the injection of electronic medical records of victims of bioterrorism or naturally occurring outbreaks of infectious disease. Our three-step method of producing visit records of the synthetic victims, matching them to the closest model of care, and adapting the closest model of care to the specific disease (or injury) can be used to test and develop many classes of algorithms and monitoring systems that operate on the entire electronic medical record.
The authors declare that they have no competing interests.
ALB created the concepts of patient care model, patient care descriptor, and analysis visit descriptor, as well as the algorithms for creation of all the descriptors. She defined the distance measures and run the algorithms for retrieving the closest ten descriptors. She co-authored the manuscript.
SB served as a medical expert on the creation of synthetic tularemia records and assisted in the validation of the synthetic tularemia records. He performed the validation of the synthetic background records and assisted in the editing of the manuscript.
LM was the project's technical lead. She created the algorithms and executed the majority of the programs for creation of synthetic patient identities for both injected tularemia patients and synthetic background patients. She wrote and executed the care model adaptation and injection algorithms. She performed validation of statistical properties of the records. She co-authored the manuscript.
All authors read and approved the final manuscript.
The pre-publication history for this paper can be accessed here:
This work was supported by Grant Number P01-HK000028-02 from the US Centers for Disease Control and Prevention (CDC). Its contents are solely the responsibility of the authors and do not necessarily represent the official views of CDC. We would like to thank Brian Feighner, Joe Lombardo, Lang Hung and Michael Dorko of the Johns Hopkins University Applied Physics Laboratory for their contributions to this project and Jerome Tokars and John Copeland of the US Centers for Disease Control and Prevention for assistance with understanding of the data.