<!DOCTYPE article
PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD with MathML3 v1.2 20190208//EN" "JATS-archivearticle1-mathml3.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article"><?properties open_access?><?properties manuscript?><front><journal-meta><journal-id journal-id-type="nlm-journal-id">9711271</journal-id><journal-id journal-id-type="pubmed-jr-id">20660</journal-id><journal-id journal-id-type="nlm-ta">Pac Symp Biocomput</journal-id><journal-id journal-id-type="iso-abbrev">Pac Symp Biocomput</journal-id><journal-title-group><journal-title>Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing</journal-title></journal-title-group><issn pub-type="ppub">2335-6928</issn><issn pub-type="epub">2335-6936</issn></journal-meta><article-meta><article-id pub-id-type="pmid">31797589</article-id><article-id pub-id-type="pmc">7043281</article-id><article-id pub-id-type="manuscript">HHSPA1061138</article-id><article-categories><subj-group subj-group-type="heading"><subject>Article</subject></subj-group></article-categories><title-group><article-title>Automated phenotyping of patients with non-alcoholic fatty liver disease reveals clinically relevant disease subtypes</article-title></title-group><contrib-group><contrib contrib-type="author"><name><surname>Vandromme</surname><given-names>Maxence</given-names></name><xref ref-type="aff" rid="A1">1</xref><xref ref-type="aff" rid="A3">3</xref><xref rid="FN1" ref-type="author-notes">*</xref></contrib><contrib contrib-type="author"><name><surname>Jun</surname><given-names>Tomi</given-names></name><xref ref-type="aff" rid="A2">2</xref><xref rid="FN1" ref-type="author-notes">*</xref></contrib><contrib contrib-type="author"><name><surname>Perumalswami</surname><given-names>Ponni</given-names></name><xref ref-type="aff" rid="A1">1</xref></contrib><contrib contrib-type="author"><name><surname>Dudley</surname><given-names>Joel T.</given-names></name><xref ref-type="aff" rid="A3">3</xref></contrib><contrib contrib-type="author"><name><surname>Branch</surname><given-names>Andrea</given-names></name><xref ref-type="aff" rid="A1">1</xref><xref rid="CR1" ref-type="corresp">&#x02020;</xref></contrib><contrib contrib-type="author"><name><surname>Li</surname><given-names>Li</given-names></name><xref ref-type="aff" rid="A3">3</xref><xref ref-type="aff" rid="A4">4</xref><xref rid="CR1" ref-type="corresp">&#x02020;</xref></contrib></contrib-group><aff id="A1"><label>1</label>Division of Liver Diseases, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA</aff><aff id="A2"><label>2</label>Division of Hematology and Medical Oncology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA</aff><aff id="A3"><label>3</label>Institute for Next Generation Healthcare, Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA</aff><aff id="A4"><label>4</label>Sema4, a Mount Sinai Venture, Stamford, CT 06902, USA</aff><author-notes><fn fn-type="equal" id="FN1"><label>*</label><p id="P1">These authors contributed equally to this work</p></fn><corresp id="CR1"><label>&#x02020;</label>Corresponding author <email>li.li@mssm.edu</email> or <email>andrea.branch@mssm.edu</email></corresp></author-notes><pub-date pub-type="nihms-submitted"><day>27</day><month>12</month><year>2019</year></pub-date><pub-date pub-type="ppub"><year>2020</year></pub-date><pub-date pub-type="pmc-release"><day>26</day><month>2</month><year>2020</year></pub-date><volume>25</volume><fpage>91</fpage><lpage>102</lpage><permissions><license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/"><license-p>Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License.</license-p></license></permissions><abstract id="ABS1"><p id="P2">Non-alcoholic fatty liver disease (NAFLD) is a complex heterogeneous disease which affects more than 20% of the population worldwide. Some subtypes of NAFLD have been clinically identified using hypothesis-driven methods. In this study, we used data mining techniques to search for subtypes in an unbiased fashion. Using electronic signatures of the disease, we identified a cohort of 13,290 patients with NAFLD from a hospital database. We gathered clinical data from multiple sources and applied unsupervised clustering to identify five subtypes among this cohort. Descriptive statistics and survival analysis showed that the subtypes were clinically distinct and were associated with different rates of death, cirrhosis, hepatocellular carcinoma, chronic kidney disease, cardiovascular disease, and myocardial infarction. Novel disease subtypes identified in this manner could be used to risk-stratify patients and guide management.</p></abstract><kwd-group><kwd>clustering</kwd><kwd>subtypes definition</kwd><kwd>survival analysis</kwd><kwd>NAFLD</kwd></kwd-group></article-meta></front><body><sec id="S1"><label>1.</label><title>Introduction</title><p id="P3">Non-alcoholic fatty liver disease (NAFLD) is estimated to affect 25% of the global population.<sup><xref rid="R1" ref-type="bibr">1</xref></sup> NAFLD is a chronic liver disease associated with the metabolic syndrome that can progress to cirrhosis and hepatocellular carcinoma (HCC). In the United States, NAFLD-related liver failure has become the second most common indication for liver transplants, after chronic hepatitis C.<sup><xref rid="R2" ref-type="bibr">2</xref>,<xref rid="R3" ref-type="bibr">3</xref></sup> This trend is expected to continue, with NAFLD prevalence rising to 33.5% of the adult US population by 2030, and driving increases in both cirrhosis and HCC.<sup><xref rid="R4" ref-type="bibr">4</xref></sup></p><p id="P4">NAFLD is a heterogeneous disease which has been associated with a variety of adverse outcomes. Besides cirrhosis and HCC, NAFLD has also been associated with cardiovascular disease (CVD)<sup><xref rid="R5" ref-type="bibr">5</xref>,<xref rid="R6" ref-type="bibr">6</xref></sup> and chronic kidney disease (CKD).<sup><xref rid="R7" ref-type="bibr">7</xref></sup> In some cohorts, CVD is the leading cause of death among NAFLD patients, followed by malignancy and liver-related mortality.<sup><xref rid="R8" ref-type="bibr">8</xref>&#x02013;<xref rid="R10" ref-type="bibr">10</xref></sup></p><p id="P5">Some NAFLD subtypes and prognostic factors have been identified. Patients with both steatosis and inflammation (i.e. nonalcoholic steatohepatitis, NASH) have worse outcomes than those with bland steatosis.<sup><xref rid="R11" ref-type="bibr">11</xref>,<xref rid="R12" ref-type="bibr">12</xref></sup> Similarly, patients with NAFLD-associated cirrhosis have worse outcomes than those who do not.<sup><xref rid="R8" ref-type="bibr">8</xref></sup> Interestingly, although cirrhosis strongly predicts HCC, some NAFLD patients develop HCC in the absence of cirrhosis.<sup><xref rid="R13" ref-type="bibr">13</xref></sup> Hispanic populations tend to have higher rates of NAFLD;<sup><xref rid="R14" ref-type="bibr">14</xref></sup> a variant in <italic>PNPLA3</italic> associated with hepatic steatosis and NASH has been identified and is more common among Hispanic individuals.<sup><xref rid="R15" ref-type="bibr">15</xref></sup></p><p id="P6">Given the clinical variability among NAFLD patients, we hypothesized that there may be clinically relevant patient subtypes which could be identified using unbiased machine learning algorithms. The identification of such subtypes could enable more precise prognostication and management for NAFLD patients.</p></sec><sec id="S2"><label>2.</label><title>Methods</title><sec id="S3"><label>2.1.</label><title>NAFLD definition</title><p id="P7">In order to define NAFLD, we developed an algorithm based on two published electronic medical record (EMR)-based algorithms.<sup><xref rid="R16" ref-type="bibr">16</xref>,<xref rid="R17" ref-type="bibr">17</xref></sup> First, we identified patients with liver disease based on persistent ALT elevation or ICD codes for chronic non-specific or non-alcoholic liver disease (ICD-9: 571.5, 571.8, 571.9; ICD-10: K75.81, K76.0, K76.9). Persistent ALT elevation was defined as two or more instances of ALT &#x02265; 40 IU/mL for men, or &#x02265; 31 IU/mL for women in the ambulatory setting, more than 6 months apart. Then, we excluded patients with viral hepatitis, alcoholic liver disease, or other chronic liver disease. These conditions were identified via ICD codes, as enumerated in the eMerge algorithm. Viral hepatitis cases were also identified using lab values (HBV surface antigen, HCV RNA). Next, we excluded patients on steatogenic medications (defined in eMerge). Finally, patients must have had evidence of hepatic steatosis on imaging, biopsy, or documented in a clinical note. These instances were identified using natural language processing (NLP) to identify mentions of hepatic steatosis and related terms.</p></sec><sec id="S4"><label>2.2.</label><title>Natural language processing</title><p id="P8">The eMerge algorithm requires mention of hepatic steatosis in a free-form text document (imagery or biopsy result, or clinical note). We developed a tool to get this information from the database, using the following steps:</p><list list-type="bullet" id="L1"><list-item><p id="P9">build a list of synonyms for the term of interest, e.g. <italic>steatohepatitis</italic>, <italic>fatty liver</italic></p></list-item><list-item><p id="P10">query the SQL database for documents containing any of these terms</p></list-item><list-item><p id="P11">parse the documents to remove negative results (e.g. <italic>absence of steatohepatitis</italic>), occurrences in family and other false positive patterns</p></list-item></list><p id="P12">This process was adapted to look for mentions of deceased patients (see <xref rid="S6" ref-type="sec">Section 2.4</xref>), to find patients with cirrhosis (see <xref rid="S12" ref-type="sec">Section 2.6</xref>), and to gather MELD scores (see <xref rid="T1" ref-type="table">Table 1</xref>).</p></sec><sec id="S5"><label>2.3.</label><title>Data collection</title><p id="P13">The cohort for this study was created using the criteria defined in <xref rid="S3" ref-type="sec">Section 2.1</xref>. These EMR data were obtained from the database of a large metropolitan hospital in New York City. We choose to only consider patients who met the criteria for NAFLD after December 31, 2012, up to January 31, 2019. We called <italic>NAFLD diagnosis date</italic> the earliest such date for each patient.</p><p id="P14">13,290 patients matching these criteria were found in the database. In the rest of this section, we describe, for different types of information, the data collection and pre-processing steps that were taken. In order to build a dataset usable by machine learning algorithms, we transformed the information contained in the database into binary features. When possible, we reduced the number of resulting features. Feature selection has been shown to improve the quality of results in machine learning applications.<sup><xref rid="R18" ref-type="bibr">18</xref></sup> This process is usually done using statistics- or heuristics-based algorithms. However, in the case of practical applications, we can use domain knowledge instead. We took advantage of established knowledge to reduce the number of features by mapping to higher-level concepts, or discarding infrequent features.</p></sec><sec id="S6"><label>2.4.</label><title>Clinical feature standardization and quality control</title><sec id="S7"><label>2.4.1.</label><title>Demographic data</title><list list-type="bullet" id="L2"><list-item><p id="P15">Age: ten mutually exclusive binary attributes corresponding to the following age groups:</p><p id="P16">[18&#x02013;20],[21&#x02013;30],[31&#x02013;40],[41&#x02013;50],[51&#x02013;60],[61&#x02013;70],[71&#x02013;80],[81&#x02013;90],[91&#x02013;100],[101 and more].</p></list-item><list-item><p id="P17">Race: Asian, Black, Indian/Native, Pacific Islander, White, Hispanic, Other, Unknown</p></list-item><list-item><p id="P18">Ethnicity: Hispanic or not</p></list-item><list-item><p id="P19">Deceased: obtained through patient records and parsing clinical notes for mentions of death</p></list-item></list></sec><sec id="S8"><label>2.4.2.</label><title>Diagnoses, procedures, medications</title><p id="P20">A large proportion of clinical data overall can be described through standardized coding systems: diagnoses, procedures, medications. We applied the following preprocessing steps:</p><list list-type="bullet" id="L3"><list-item><p id="P21">Diagnoses used the International Classification of Diseases, versions 9 and 10 (ICD-9 and ICD-10) systems. These systems contain a tens of thousands of different codes, often describing the same disease with minor variations. In order to reduce the number of features, we used the <italic>phecode</italic> system from the Phenome Wide Association Studies (PheWAS).<sup><xref rid="R19" ref-type="bibr">19</xref></sup> We kept only phecodes with at least 0.1% prevalence, which left 148 features for ICD codes.</p></list-item><list-item><p id="P22">Procedures used the Current Procedural Terminology (CPT) coding system. We mapped the CPT codes to their respective second-level group code. For example, the group containing all CPT codes from 33010 to 37799 describes surgeries of the cardiovascular system. This process grouped the codes into 115 categories that translated directly into features.</p></list-item><list-item><p id="P23">Medication prescriptions or administrations. We mapped the medication names to the corresponding RxNorm drug concepts, and again kept those that occurred in at least 0.1% of the cohort. We only considered drugs which had at least two prescriptions separated by 6 months or more, in order to discard drugs only used acutely (e.g. post-surgery) which do not reflect a patient&#x02019;s regular medications. Using this process, we obtained 293 clinical drugs.</p></list-item></list></sec><sec id="S9"><label>2.4.3.</label><title>Laboratory tests</title><p id="P24">As opposed to the previous data types, which were well-formatted and standardized, laboratory tests could be either qualitative or quantitative, and were often reported in free-text form. For qualitative tests, we parsed the result and searched for terms that indicated if it was abnormal, such as <italic>abnormal</italic>, <italic>low</italic>, <italic>below average</italic>, <italic>reactive</italic>. For quantitative tests, we searched the results for numeric values that fell outside the normal range.</p><p id="P25">We obtained 533 distinct laboratory tests, which translated to as many binary features. For example, feature <italic>platelets</italic> means <italic>abnormal result for platelets test</italic>. A shortcoming of this approach is that abnormally low and high values are grouped in the same feature, even though they have different medical significance. However, since one laboratory test can use different units, and thus different normal ranges (e.g. normal and log scales), automatically assigning a value to <italic>low</italic> or <italic>high</italic> is not always reliably doable.</p></sec><sec id="S10"><label>2.4.4.</label><title>Vital signs</title><p id="P26">Similar to laboratory tests, we searched for abnormal values for the standard vital signs collected in clinical settings, using the following criteria:</p><list list-type="bullet" id="L4"><list-item><p id="P27">body temperature: &#x0003e; 39&#x000b0;C (Celsius) or 102<sup>&#x025e6;</sup>F (Fahrenheit).</p></list-item><list-item><p id="P28">blood pressure: systolic/diastolic blood pressure (SBP/DBP) &#x0003e; 130/80</p></list-item><list-item><p id="P29">heart rate: &#x0003e; 130 bpm.</p></list-item><list-item><p id="P30">respiratory rate: &#x0003e; 40 bpm.</p></list-item><list-item><p id="P31">pain: values of 9 or 10 on a [1&#x02013;10] pain scale.</p></list-item></list></sec></sec><sec id="S11"><label>2.5.</label><title>Patient pairwise distance and clustering</title><p id="P32">In order to identify different subtypes, we computed the patient distance matrix and applied an algorithm of unsupervised clustering to the data obtained. Unsupervised clustering is well-suited for exploratory tasks in applied research.<sup><xref rid="R20" ref-type="bibr">20</xref></sup> First, validation of the results obtained using expert knowledge is possible. In the present study, the findings were reviewed and interpreted by medical experts. Second, the &#x0201c;unsupervised&#x0201d; aspect allows discovery of new, potentially unexpected insight from the analysis of a large number of features.</p><p id="P33">Many clustering algorithms have been developed. Finding the &#x0201c;best one&#x0201d; remains an open problem,<sup><xref rid="R21" ref-type="bibr">21</xref></sup> since unsupervised learning tasks lack objective measures to assess their performance. Several measures have been proposed to evaluate the quality of a set of clusters,<sup><xref rid="R22" ref-type="bibr">22</xref></sup> but the general guideline is that the best algorithm and parameters are different for each data set.</p><p id="P34">We chose a hierarchical clustering algorithm using the Manhattan distance for pairwise similarity of patients, and minimizing the increase in variance during cluster merging as linkage criterion (also known as Ward&#x02019;s criterion). Hierarchical clustering is a standard algorithm, and it has been used previously in a study looking for comorbidity clusters in autism disorders.<sup><xref rid="R23" ref-type="bibr">23</xref></sup> We used the R <italic>hclust</italic> implementation of this algorithm, with <italic>ward.D2</italic> as parameter for agglomeration criterion.<sup><xref rid="R24" ref-type="bibr">24</xref></sup> We chose to have 5 subtypes (clusters) as a balance between granularity and size. These parameters were chosen empirically, after qualitative validation of the results obtained with various combinations.</p></sec><sec id="S12"><label>2.6.</label><title>Statistical analysis</title><sec id="S13"><label>2.6.1.</label><title>Descriptive statistics</title><p id="P35">Categorical features were summarized as proportions and compared using the chi-squared test. Continuous features were summarized as means &#x000b1; standard deviation and compared using ANOVA, or as medians and interquartile ranges compared using the Wilcoxon rank-sum test. Comparisons for each subtype were made against patients in all remaining subtypes. Significance was defined as a false discovery rate <italic>&#x0003c;</italic>0.001.</p></sec><sec id="S14"><label>2.6.2.</label><title>Survival analysis</title><p id="P36">The primary outcome was overall survival. Secondary outcomes were HCC, cirrhosis, CKD, CVD, and acute myocardial infarction (MI). In all cases survival was defined as the time from NAFLD diagnosis to the earliest evidence of the outcome. HCC cases were first identified using ICD codes (ICD-9 155.0,155.2; ICD-10 C22.0,C22.7-C22.9), then confirmed through chart review. Cirrhosis was defined using natural language processing looking for mentions of cirrhosis in clinical notes, imaging reports or biopsy reports. Chronic kidney disease was defined using corresponding ICD codes (ICD-9 585&#x02013;586; ICD-10 N18-N19) and CPT codes for dialysis (90935 to 90999). Cardiovascular disease was defined using ICD codes for any ischemic heart disease (ICD-9 410&#x02013;414; ICD-10 I20-I25). Acute MI was a subset of the CVD outcome (ICD-9 410; ICD-10 I21-I22).</p><p id="P37">The primary predictor in survival analyses was subtype. Secondary predictors included age, gender, race and FIB-4 category. Race and ethnicity were combined for the purposes of this analysis, with Hispanic ethnicity given precedence and mapped to the Hispanic race category. The primary outcome was overall survival. Secondary outcomes were onset of cirrhosis, HCC, CVD, MI, and CKD. All survival analyses were done in R 3.6.0. For the outcome of overall survival, Kaplan-Meier curves were created using the <italic>ggplot2</italic><sup><xref rid="R25" ref-type="bibr">25</xref></sup> and <italic>survminer</italic><sup><xref rid="R26" ref-type="bibr">26</xref></sup> packages; univariate and multivariate Cox proportional hazards models were constructed using the <italic>survival</italic> package.<sup><xref rid="R27" ref-type="bibr">27</xref></sup> For non-death outcomes, only incident cases were included in the analysis. Cases diagnosed prior to or within 6 months of NAFLD diagnosis were treated as prevalent. Death was treated as competing hazard. The cumulative incidence function was calculated for each outcome using the <italic>cmprsk</italic> package<sup><xref rid="R28" ref-type="bibr">28</xref></sup> and plotted using <italic>ggplot2</italic>. The <italic>cmprsk</italic> package was also used to fit univariate and multivariate Fine-Gray proportional subdistribution hazards regression models for the non-death outcomes.</p><p id="P38">This study was reviewed and approved by the Mount Sinai Hospital institutional review board (GCO 10&#x02013;0032 and 16&#x02013;1437).</p></sec></sec></sec><sec id="S15"><label>3.</label><title>Results</title><sec id="S16"><label>3.1.</label><title>Descriptive statistics for the cohort</title><p id="P39">Merging the data from the different sources described above, we obtained a data set containing 13,290 patients with NAFLD, described by 1,145 binary features (<xref rid="T1" ref-type="table">Table 1</xref>). The mean age at NAFLD diagnosis is 53 &#x000b1; 14.7 (median = 53.9), with 50.6% female patients. The cohort was racially and ethnically diverse: 41.4% Caucasian, 17% Hispanic ethnicity, 9.6% African American, 5.9% Asian, and 27.3% unknown/other. Metabolic comorbidities such as obesity (53.8%), diabetes (32.9%), and hypertension (53.5%) were common. Median length of follow up was 1.6 years (IQR 0.6&#x02013;2.9).</p></sec><sec id="S17"><label>3.2.</label><title>Identification of NAFLD subtypes</title><p id="P40">The two largest subtypes (1 and 3) encompassed 87% of patients, while the remaining patients are divided among 3 smaller subtypes (<xref rid="T1" ref-type="table">Table 1</xref>). All findings reported below were for the comparison of subtype members versus all other patients, and were significant after correction for multiple hypothesis testing at a level of p<italic>&#x0003c;</italic>0.001. Values associated with medications are omitted for concision.</p><p id="P41">Patients in subtype 1 were more likely to be female and either Hispanic or African American. Obesity, hypertension, and hyperlipidemia (30.05 vs 24.8%) were more common among subtype 1 patients, while diabetes was less common. Subtype 1 patients had low MELD and FIB-4 scores at NAFLD diagnosis. Other diagnoses more common in subtype 1 patients included: vitamin D deficiency (14.2% vs 9.2%), asthma (11.4 vs 7.5%), gastroesophageal reflux (18.7% vs 12.7%). Medications that were more common in this subtype included: omeprazole, metformin, atorvastatin, and fluticasone. Overall, subtype 1 patients had metabolic comorbidities, with some evidence of liver inflammation, but minimal liver fibrosis.</p><p id="P42">Patients in subtype 2 were more likely to be Hispanic or African American. They did not have significantly higher MELD or FIB-4 scores at baseline, but they were more likely than other patients to have labs suggestive of liver inflammation and dysfunction, such as elevated ALT, low platelets, elevated bilirubin, elevated INR and low albumin. Notable comorbidities included: diabetes, hypertension, hyperlipidemia (37.2% vs 27.8%), obstructive sleep apnea (11.9% vs 6.0%), gastroesophageal reflux (27.2% vs 16.1%), tobacco use (19.5% vs 4.8%), asthma (22.1 vs 9.5%), anxiety (13.0% vs 5.6%), depression (17.0% vs 6.8%), urinary tract infection (11.5% vs 3.9%), and respiratory infection (10.6% vs 3.6%). Medications more commonly prescribed in this subtype included cardiac medications such as aspirin, lisinopril, amlodipine, metoprolol, and atorvastatin; diabetes medications such as metformin and insulin; pain medications such as acetaminophen, gabapentin, oxycodone, and morphine; respiratory medications such as albuterol and fluticasone; antacid medications such as omeprazole and famotidine, and also vitamin D. Subtype 2 patients were also more likely to have had digestive surgery (40.1% vs. 16.8%). Overall, subtype 2 patients had metabolic syndrome with signs of developing liver dysfunction and were high healthcare utilizers.</p><p id="P43">Patients in subtype 3 tended to be younger, Caucasian and had the fewest inpatient admissions and the fewest prescriptions on average. Subtype 3 patients had fewer comorbidities than other patients, and were unlikely to have abnormal lab values associated with liver dysfunction. Subtype 3 patients were relatively healthy compared to the rest of the cohort.</p><p id="P44">Patients in subtype 4 were more likely to be older, male and Caucasian. They had high FIB-4 scores at baseline and were likely to have abnormal labs suggesting liver synthetic dysfunction. These patients were less likely to be obese or to have hyperlipidemia (20.8% vs 28.7%), though diabetes and hypertension were common. Overall, subtype 4 patients likely had liver fibrosis at baseline and had labs suggesting progression to cirrhosis.</p><p id="P45">Patients in subtype 5 were more likely to be older, and Hispanic or African American. They had high FIB-4 and MELD scores at baseline, and had high rates of abnormal lab values consistent with liver inflammation and dysfunction. Obesity was less common in this group, but diabetes and hypertension were prevalent. Other comorbidities included: malignancy (15.2% vs 2.0%), atrial fibrillation (11.4% vs 1.6%), tobacco use (28.7% vs 4.7%), depression (17.1% vs 6.9%), urinary tract infection (16.8% vs 3.8%), pneumonia (10.3% vs 1.9%), and sepsis (25.2% vs 0.3%). Commonly prescribed medications included: cardiac medications such as aspirin, metoprolol, and furosemide; pain medications such as acetaminophen, oxycodone, hydromorphone, fentanyl, and morphine; antacid medications such as pantoprazole and famotidine; and insulin. Subtype 5 patients were also more likely to have had cardiovascular (31.4% vs 7.4%), respiratory (16.5% vs 4.6%) or digestive surgery (50.0% vs 16.9%). Overall, subtype 5 patients had significant liver disease at baseline, had significant cardiac, infectious and neoplastic comorbidities, and were high healthcare utilizers.</p></sec><sec id="S18"><label>3.3.</label><title>Identification of distinct outcomes by NAFLD subtype</title><p id="P46">Univariate analyses showed that risk of outcomes varied by subtype membership (<xref rid="F1" ref-type="fig">Figures 1</xref> and <xref rid="F2" ref-type="fig">2</xref>). Subtype 1 was chosen as the reference group since it was the largest. Compared to subtype 1, subtype 5 was significantly and strongly associated with an increased risk of all outcomes; risk of death was particularly high (HR 139; 95% CI 86&#x02013;226, p<italic>&#x0003c;</italic>0.001). Subtype 4 was strongly associated with both cirrhosis (HR 42; 95% CI 12&#x02013;154, p<italic>&#x0003c;</italic>0.001) and HCC (HR 91; 95% CI 27&#x02013;302, p<italic>&#x0003c;</italic>0.001). Subtype 2 was associated with MI (HR 6.6; 95% CI 3.3&#x02013;13.3, p<italic>&#x0003c;</italic>0.001) and CKD (HR 3.4; 95% CI 2.3&#x02013;5.1, p<italic>&#x0003c;</italic>0.001). Subtype 3 was associated with a lower risk of CVD (HR 0.19; 95% CI 0.10&#x02013;0.37, p<italic>&#x0003c;</italic>0.001), and CKD (HR 0.51; 95% CI 0.31&#x02013;0.86, p=0.01). There were no incident cirrhosis or HCC events in group 3.</p><p id="P47">In multivariate analyses accounting for age, gender, race and baseline FIB-4, subtype membership remained an independent predictor of outcomes (<xref rid="F3" ref-type="fig">Figure 3</xref>). With subtype 1 as the reference, Subtype 5 was independently associated with the highest risks for death (HR 46.7; 95% CI 33.3&#x02013;65.3, p<italic>&#x0003c;</italic>0.001), CKD (HR 4.3; 95% CI 2.7&#x02013;6.7, p<italic>&#x0003c;</italic>0.001), CVD (HR 2.2; 95% CI 1.1&#x02013;4.1, p=0.02 ), MI (HR 5.9; 95% CI 2.3&#x02013;15.0, p<italic>&#x0003c;</italic>0.001) and cirrhosis (HR 36.2; 95% CI 5.8&#x02013;224.4, p<italic>&#x0003c;</italic>0.001) among all subtypes, while subtype 4 was independently associated with a high risk for cirrhosis (HR 14.0; 95% CI 1.9&#x02013;105.6, p=0.01) and the highest risk for HCC (HR 28.0; 95% CI 4.8&#x02013;164.8, p<italic>&#x0003c;</italic>0.001). Subtype 2 was also independently associated with an elevated risk of death (HR3.7; 95% CI 2.4&#x02013;5.6, p<italic>&#x0003c;</italic>0.001), MI (HR 4.7; 95% CI 1.8&#x02013;12.1, p<italic>&#x0003c;</italic>0.001) and CKD (HR 2.5; 95% CI 1.6&#x02013;3.7, p<italic>&#x0003c;</italic>0.001). Subtype 2 was the only other subtype aside from subtype 5 to be independently associated with MI and CKD.</p></sec><sec id="S19"><label>3.4.</label><title>Internal cross-validation of the subtypes discovered</title><p id="P48">Formal validation of the results is inherently complicated for unsupervised clustering, where no &#x0201c;true label&#x0201d; exist for any patient. In order to assess the robustness of our results, we have performed internal cross-validation on our dataset, as we have no access to EMR in other medical centers. We have randomly selected 90% of samples, run the clustering process on this new training set, and repeated the process 10 times. We have identified similar enriched clinical features and disease comorbidities in the subtypes that we have discovered previously. We reported the full results in the <xref rid="SD1" ref-type="supplementary-material">supplementary table 1</xref> hosted at <ext-link ext-link-type="uri" xlink:href="https://github.com/mv50/psb20_mat">https://github.com/mv50/psb20_mat</ext-link>.</p></sec></sec><sec id="S20"><label>4.</label><title>Conclusion</title><p id="P49">In this study, we combined two existing signatures of NAFLD and used them to gather a cohort of 13,290 patients with confirmed NAFLD. We used unsupervised clustering to identify five subtypes of patients. These subtypes had different clinical characteristics and different outcomes: the two larger groups had fewer comorbidities and more positive outcomes, while a minority of the cohort (in the three smaller subtypes) had more serious comorbidities and worse outcomes. To our knowledge, this study is the first to use an artificial intelligence approach to delineate clinically relevant subtypes of NAFLD.</p><p id="P50">Our findings are consistent with prior studies reporting higher rates of NAFLD among Hispanic patients.<sup><xref rid="R14" ref-type="bibr">14</xref></sup> In addition, the subtypes reveal that Hispanic patients with NAFLD are on a continuum of risk, with some exhibiting the metabolic syndrome but having good outcomes (subtype 1), others experiencing predominantly non-liver adverse outcomes (subtype 2) and some with severe liver disease and at risk for multiple adverse outcomes (subtype 5).</p><p id="P51">Our study of heterogeneity among NAFLD patients was strengthened by the diverse patient population within Mount Sinai&#x02019;s catchment area and the comprehensive use of EMR records. We gathered data from various sources to build the features: vital signs, diagnoses, procedures, prescriptions, laboratory results, radiology and pathology reports. Our approach is generalizable and could be applied by local or regional healthcare systems to define disease subtypes within their own patient populations. Such efforts could help guide resource allocation at the local level, in contrast to national or international guidelines which may not be relevant to all localities and patient populations.</p><p id="P52">The limitations of our study are common to EMR-based projects. ICD codes are prone to miscoding and may not accurately represent a patient&#x02019;s medical condition. We used phecodes to map ICD codes to higher-level disease concepts in order to improve power and simplify instances where there are multiple related ICD codes. The pre-processing and cleaning of the data remains open to improvements. Additionally, more systematic incorporation of data from unstructured clinical notes could bring valuable new information.</p><p id="P53">In conclusion, we defined an EMR-based algorithm for identifying NAFLD patients and showed that unsupervised clustering can be used to identify clinically relevant disease subtypes with distinct patterns of adverse outcomes. If prospectively validated, these disease subtypes could help guide patient management and screening initiatives.</p></sec><sec sec-type="supplementary-material" id="SM1"><title>Supplementary Material</title><supplementary-material content-type="local-data" id="SD1"><label>1</label><media xlink:href="NIHMS1061138-supplement-1.pdf" orientation="portrait" id="d36e650" position="anchor"/></supplementary-material></sec></body><back><ref-list><label>5.</label><title>References</title><ref id="R1"><label>1.</label><mixed-citation publication-type="journal"><name><surname>Younossi</surname><given-names>ZM</given-names></name>, <name><surname>Koenig</surname><given-names>AB</given-names></name>, <name><surname>Abdelatif</surname><given-names>D</given-names></name>, <name><surname>Fazel</surname><given-names>Y</given-names></name>, <name><surname>Henry</surname><given-names>L</given-names></name> and <name><surname>Wymer</surname><given-names>M</given-names></name>, <article-title>Global epidemiology of nonalcoholic fatty liver disease&#x02014;meta-analytic assessment of prevalence, incidence, and outcomes</article-title>, <source>Hepatology</source>
<volume>64</volume>, <fpage>73</fpage> (<year>2016</year>).<pub-id pub-id-type="pmid">26707365</pub-id></mixed-citation></ref><ref id="R2"><label>2.</label><mixed-citation publication-type="journal"><name><surname>Goldberg</surname><given-names>D</given-names></name>, <name><surname>Ditah</surname><given-names>IC</given-names></name>, <name><surname>Saeian</surname><given-names>K</given-names></name>, <name><surname>Lalehzari</surname><given-names>M</given-names></name>, <name><surname>Aronsohn</surname><given-names>A</given-names></name>, <name><surname>Gorospe</surname><given-names>EC</given-names></name> and <name><surname>Charlton</surname><given-names>M</given-names></name>, <article-title>Changes in the prevalence of hepatitis c virus infection, nonalcoholic steatohepatitis, and alcoholic liver disease among patients with cirrhosis or liver failure on the waitlist for liver transplantation</article-title>, <source>Gastroenterology</source>
<volume>152</volume>, <fpage>1090</fpage> (<year>2017</year>).<pub-id pub-id-type="pmid">28088461</pub-id></mixed-citation></ref><ref id="R3"><label>3.</label><mixed-citation publication-type="journal"><name><surname>Wong</surname><given-names>RJ</given-names></name>, <name><surname>Aguilar</surname><given-names>M</given-names></name>, <name><surname>Cheung</surname><given-names>R</given-names></name>, <name><surname>Perumpail</surname><given-names>RB</given-names></name>, <name><surname>Harrison</surname><given-names>SA</given-names></name>, <name><surname>Younossi</surname><given-names>ZM</given-names></name> and <name><surname>Ahmed</surname><given-names>A</given-names></name>, <article-title>Nonalcoholic steatohepatitis is the second leading etiology of liver disease among adults awaiting liver transplantation in the united states</article-title>, <source>Gastroenterology</source>
<volume>148</volume>, <fpage>547</fpage> (<year>2015</year>).<pub-id pub-id-type="pmid">25461851</pub-id></mixed-citation></ref><ref id="R4"><label>4.</label><mixed-citation publication-type="journal"><name><surname>Estes</surname><given-names>C</given-names></name>, <name><surname>Razavi</surname><given-names>H</given-names></name>, <name><surname>Loomba</surname><given-names>R</given-names></name>, <name><surname>Younossi</surname><given-names>Z</given-names></name> and <name><surname>Sanyal</surname><given-names>AJ</given-names></name>, <article-title>Modeling the epidemic of nonalcoholic fatty liver disease demonstrates an exponential increase in burden of disease</article-title>, <source>Hepatology</source>
<volume>67</volume>, <fpage>123</fpage> (<year>2018</year>).<pub-id pub-id-type="pmid">28802062</pub-id></mixed-citation></ref><ref id="R5"><label>5.</label><mixed-citation publication-type="journal"><name><surname>Motamed</surname><given-names>N</given-names></name>, <name><surname>Rabiee</surname><given-names>B</given-names></name>, <name><surname>Poustchi</surname><given-names>H</given-names></name>, <name><surname>Dehestani</surname><given-names>B</given-names></name>, <name><surname>Hemasi</surname><given-names>GR</given-names></name>, <name><surname>Khonsari</surname><given-names>MR</given-names></name>, <name><surname>Maadi</surname><given-names>M</given-names></name>, <name><surname>Saeedian</surname><given-names>FS</given-names></name> and <name><surname>Zamani</surname><given-names>F</given-names></name>, <article-title>Non-alcoholic fatty liver disease (nafld) and 10-year risk of cardiovascular diseases</article-title>, <source>Clinics and research in hepatology and gastroenterology</source>
<volume>41</volume>, <fpage>31</fpage> (<year>2017</year>).<pub-id pub-id-type="pmid">27597641</pub-id></mixed-citation></ref><ref id="R6"><label>6.</label><mixed-citation publication-type="journal"><name><surname>Wu</surname><given-names>S</given-names></name>, <name><surname>Wu</surname><given-names>F</given-names></name>, <name><surname>Ding</surname><given-names>Y</given-names></name>, <name><surname>Hou</surname><given-names>J</given-names></name>, <name><surname>Bi</surname><given-names>J</given-names></name> and <name><surname>Zhang</surname><given-names>Z</given-names></name>, <article-title>Association of non-alcoholic fatty liver disease with major adverse cardiovascular events: a systematic review and meta-analysis</article-title>, <source>Scientific reports</source>
<volume>6</volume>, p. <comment>33386</comment> (<year>2016</year>).</mixed-citation></ref><ref id="R7"><label>7.</label><mixed-citation publication-type="journal"><name><surname>Musso</surname><given-names>G</given-names></name>, <name><surname>Gambino</surname><given-names>R</given-names></name>, <name><surname>Tabibian</surname><given-names>JH</given-names></name>, <name><surname>Ekstedt</surname><given-names>M</given-names></name>, <name><surname>Kechagias</surname><given-names>S</given-names></name>, <name><surname>Hamaguchi</surname><given-names>M</given-names></name>, <name><surname>Hultcrantz</surname><given-names>R</given-names></name>, <name><surname>Hagstr&#x000f6;m</surname><given-names>H</given-names></name>, <name><surname>Yoon</surname><given-names>SK</given-names></name>, <name><surname>Charatcharoenwitthaya</surname><given-names>P</given-names></name>
<etal/>, <article-title>Association of non-alcoholic fatty liver disease with chronic kidney disease: a systematic review and meta-analysis</article-title>, <source>PLoS medicine</source>
<volume>11</volume>, p. <comment>e1001680</comment> (<year>2014</year>).</mixed-citation></ref><ref id="R8"><label>8.</label><mixed-citation publication-type="journal"><name><surname>Adams</surname><given-names>LA</given-names></name>, <name><surname>Lymp</surname><given-names>JF</given-names></name>, <name><surname>Sauver</surname><given-names>JS</given-names></name>, <name><surname>Sanderson</surname><given-names>SO</given-names></name>, <name><surname>Lindor</surname><given-names>KD</given-names></name>, <name><surname>Feldstein</surname><given-names>A</given-names></name> and <name><surname>Angulo</surname><given-names>P</given-names></name>, <article-title>The natural history of nonalcoholic fatty liver disease: a population-based cohort study</article-title>, <source>Gastroenterology</source>
<volume>129</volume>, <fpage>113</fpage> (<year>2005</year>).<pub-id pub-id-type="pmid">16012941</pub-id></mixed-citation></ref><ref id="R9"><label>9.</label><mixed-citation publication-type="journal"><name><surname>Dam-Larsen</surname><given-names>S</given-names></name>, <name><surname>Becker</surname><given-names>U</given-names></name>, <name><surname>Franzmann</surname><given-names>M-B</given-names></name>, <name><surname>Larsen</surname><given-names>K</given-names></name>, <name><surname>Christoffersen</surname><given-names>P</given-names></name> and <name><surname>Bendtsen</surname><given-names>F</given-names></name>, <article-title>Final results of a long-term, clinical follow-up in fatty liver patients</article-title>, <source>Scandinavian journal of gastroenterology</source>
<volume>44</volume>, <fpage>1236</fpage> (<year>2009</year>).<pub-id pub-id-type="pmid">19670076</pub-id></mixed-citation></ref><ref id="R10"><label>10.</label><mixed-citation publication-type="journal"><name><surname>S&#x000f6;derberg</surname><given-names>C</given-names></name>, <name><surname>St&#x000e5;l</surname><given-names>P</given-names></name>, <name><surname>Askling</surname><given-names>J</given-names></name>, <name><surname>Glaumann</surname><given-names>H</given-names></name>, <name><surname>Lindberg</surname><given-names>G</given-names></name>, <name><surname>Marmur</surname><given-names>J</given-names></name> and <name><surname>Hultcrantz</surname><given-names>R</given-names></name>, <article-title>Decreased survival of subjects with elevated liver function tests during a 28-year follow-up</article-title>, <source>Hepatology</source>
<volume>51</volume>, <fpage>595</fpage> (<year>2010</year>).<pub-id pub-id-type="pmid">20014114</pub-id></mixed-citation></ref><ref id="R11"><label>11.</label><mixed-citation publication-type="journal"><name><surname>Dam-Larsen</surname><given-names>S</given-names></name>, <name><surname>Franzmann</surname><given-names>M</given-names></name>, <name><surname>Andersen</surname><given-names>I</given-names></name>, <name><surname>Christoffersen</surname><given-names>P</given-names></name>, <name><surname>Jensen</surname><given-names>L</given-names></name>, <name><surname>S&#x000f8;rensen</surname><given-names>T</given-names></name>, <name><surname>Becker</surname><given-names>U</given-names></name> and <name><surname>Bendtsen</surname><given-names>F</given-names></name>, <article-title>Long term prognosis of fatty liver: risk of chronic liver disease and death</article-title>, <source>Gut</source>
<volume>53</volume>, <fpage>750</fpage> (<year>2004</year>).<pub-id pub-id-type="pmid">15082596</pub-id></mixed-citation></ref><ref id="R12"><label>12.</label><mixed-citation publication-type="journal"><name><surname>Ekstedt</surname><given-names>M</given-names></name>, <name><surname>Franz&#x000e9;n</surname><given-names>LE</given-names></name>, <name><surname>Mathiesen</surname><given-names>UL</given-names></name>, <name><surname>Thorelius</surname><given-names>L</given-names></name>, <name><surname>Holmqvist</surname><given-names>M</given-names></name>, <name><surname>Bodemar</surname><given-names>G</given-names></name> and <name><surname>Kechagias</surname><given-names>S</given-names></name>, <article-title>Long-term follow-up of patients with nafld and elevated liver enzymes</article-title>, <source>Hepatology</source>
<volume>44</volume>, <fpage>865</fpage> (<year>2006</year>).<pub-id pub-id-type="pmid">17006923</pub-id></mixed-citation></ref><ref id="R13"><label>13.</label><mixed-citation publication-type="journal"><name><surname>Jun</surname><given-names>TW</given-names></name>, <name><surname>Yeh</surname><given-names>M-L</given-names></name>, <name><surname>Yang</surname><given-names>JD</given-names></name>, <name><surname>Chen</surname><given-names>VL</given-names></name>, <name><surname>Nguyen</surname><given-names>P</given-names></name>, <name><surname>Giama</surname><given-names>NH</given-names></name>, <name><surname>Huang</surname><given-names>C-F</given-names></name>, <name><surname>Hsing</surname><given-names>AW</given-names></name>, <name><surname>Dai</surname><given-names>C-Y</given-names></name>, <name><surname>Huang</surname><given-names>J-F</given-names></name>
<etal/>, <article-title>More advanced disease and worse survival in cryptogenic compared to viral hepatocellular carcinoma</article-title>, <source>Liver International</source>
<volume>38</volume>, <fpage>895</fpage> (<year>2018</year>).<pub-id pub-id-type="pmid">29045023</pub-id></mixed-citation></ref><ref id="R14"><label>14.</label><mixed-citation publication-type="journal"><name><surname>Browning</surname><given-names>JD</given-names></name>, <name><surname>Szczepaniak</surname><given-names>LS</given-names></name>, <name><surname>Dobbins</surname><given-names>R</given-names></name>, <name><surname>Nuremberg</surname><given-names>P</given-names></name>, <name><surname>Horton</surname><given-names>JD</given-names></name>, <name><surname>Cohen</surname><given-names>JC</given-names></name>, <name><surname>Grundy</surname><given-names>SM</given-names></name> and <name><surname>Hobbs</surname><given-names>HH</given-names></name>, <article-title>Prevalence of hepatic steatosis in an urban population in the United States: impact of ethnicity</article-title>, <source>Hepatology</source>
<volume>40</volume>, <fpage>1387</fpage> (<month>12</month>
<year>2004</year>).<pub-id pub-id-type="pmid">15565570</pub-id></mixed-citation></ref><ref id="R15"><label>15.</label><mixed-citation publication-type="journal"><name><surname>Romeo</surname><given-names>S</given-names></name>, <name><surname>Kozlitina</surname><given-names>J</given-names></name>, <name><surname>Xing</surname><given-names>C</given-names></name>, <name><surname>Pertsemlidis</surname><given-names>A</given-names></name>, <name><surname>Cox</surname><given-names>D</given-names></name>, <name><surname>Pennacchio</surname><given-names>LA</given-names></name>, <name><surname>Boerwinkle</surname><given-names>E</given-names></name>, <name><surname>Cohen</surname><given-names>JC</given-names></name> and <name><surname>Hobbs</surname><given-names>HH</given-names></name>, <article-title>Genetic variation in pnpla3 confers susceptibility to nonalcoholic fatty liver disease</article-title>, <source>Nature genetics</source>
<volume>40</volume>, p. <fpage>1461</fpage> (<year>2008</year>).<pub-id pub-id-type="pmid">18820647</pub-id></mixed-citation></ref><ref id="R16"><label>16.</label><mixed-citation publication-type="journal"><name><surname>Kanwal</surname><given-names>F</given-names></name>, <name><surname>Kramer</surname><given-names>JR</given-names></name>, <name><surname>Mapakshi</surname><given-names>S</given-names></name>, <name><surname>Natarajan</surname><given-names>Y</given-names></name>, <name><surname>Chayanupatkul</surname><given-names>M</given-names></name>, <name><surname>Richardson</surname><given-names>PA</given-names></name>, <name><surname>Li</surname><given-names>L</given-names></name>, <name><surname>Desiderio</surname><given-names>R</given-names></name>, <name><surname>Thrift</surname><given-names>AP</given-names></name>, <name><surname>Asch</surname><given-names>SM</given-names></name>
<etal/>, <article-title>Risk of hepatocellular cancer in patients with non-alcoholic fatty liver disease</article-title>, <source>Gastroenterology</source>
<volume>155</volume>, <fpage>1828</fpage> (<year>2018</year>).<pub-id pub-id-type="pmid">30144434</pub-id></mixed-citation></ref><ref id="R17"><label>17.</label><mixed-citation publication-type="journal"><name><surname>Kirby</surname><given-names>JC</given-names></name>, <name><surname>Speltz</surname><given-names>P</given-names></name>, <name><surname>Rasmussen</surname><given-names>LV</given-names></name>, <name><surname>Basford</surname><given-names>M</given-names></name>, <name><surname>Gottesman</surname><given-names>O</given-names></name>, <name><surname>Peissig</surname><given-names>PL</given-names></name>, <name><surname>Pacheco</surname><given-names>JA</given-names></name>, <name><surname>Tromp</surname><given-names>G</given-names></name>, <name><surname>Pathak</surname><given-names>J</given-names></name>, <name><surname>Carrell</surname><given-names>DS</given-names></name>
<etal/>, <article-title>Phekb: a catalog and workflow for creating electronic phenotype algorithms for transportability</article-title>, <source>Journal of the American Medical Informatics Association</source>
<volume>23</volume>, <fpage>1046</fpage> (<year>2016</year>).<pub-id pub-id-type="pmid">27026615</pub-id></mixed-citation></ref><ref id="R18"><label>18.</label><mixed-citation publication-type="journal"><name><surname>Hall</surname><given-names>MA</given-names></name>, <source>Correlation-based feature selection for machine learning</source> (<year>1999</year>).</mixed-citation></ref><ref id="R19"><label>19.</label><mixed-citation publication-type="journal"><name><surname>Denny</surname><given-names>JC</given-names></name>, <name><surname>Bastarache</surname><given-names>L</given-names></name>, <name><surname>Ritchie</surname><given-names>MD</given-names></name>, <name><surname>Carroll</surname><given-names>RJ</given-names></name>, <name><surname>Zink</surname><given-names>R</given-names></name>, <name><surname>Mosley</surname><given-names>JD</given-names></name>, <name><surname>Field</surname><given-names>JR</given-names></name>, <name><surname>Pulley</surname><given-names>JM</given-names></name>, <name><surname>Ramirez</surname><given-names>AH</given-names></name>, <name><surname>Bowton</surname><given-names>E</given-names></name>
<etal/>, <article-title>Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data</article-title>, <source>Nature biotechnology</source>
<volume>31</volume>, p. <fpage>1102</fpage> (<year>2013</year>).</mixed-citation></ref><ref id="R20"><label>20.</label><mixed-citation publication-type="journal"><name><surname>Li</surname><given-names>L</given-names></name>, <name><surname>Cheng</surname><given-names>WY</given-names></name>, <name><surname>Glicksberg</surname><given-names>BS</given-names></name>, <name><surname>Gottesman</surname><given-names>O</given-names></name>, <name><surname>Tamler</surname><given-names>R</given-names></name>, <name><surname>Chen</surname><given-names>R</given-names></name>, <name><surname>Bottinger</surname><given-names>EP</given-names></name> and <name><surname>Dudley</surname><given-names>JT</given-names></name>, <article-title>Identification of type 2 diabetes subgroups through topological analysis of patient similarity</article-title>, <source>Sci Transl Med</source>
<volume>7</volume>, <comment>p. 311ra174</comment> (<month>10</month>
<year>2015</year>).</mixed-citation></ref><ref id="R21"><label>21.</label><mixed-citation publication-type="journal"><name><surname>Estivill-Castro</surname><given-names>V</given-names></name>, <article-title>Why so many clustering algorithms: a position paper</article-title>., <source>SIGKDD explorations</source>
<volume>4</volume>, <fpage>65</fpage> (<year>2002</year>).</mixed-citation></ref><ref id="R22"><label>22.</label><mixed-citation publication-type="journal"><name><surname>Pfitzner</surname><given-names>D</given-names></name>, <name><surname>Leibbrandt</surname><given-names>R</given-names></name> and <name><surname>Powers</surname><given-names>D</given-names></name>, <article-title>Characterization and evaluation of similarity measures for pairs of clusterings</article-title>, <source>Knowledge and Information Systems</source>
<volume>19</volume>, p. <fpage>361</fpage> (<year>2009</year>).</mixed-citation></ref><ref id="R23"><label>23.</label><mixed-citation publication-type="journal"><name><surname>Doshi-Velez</surname><given-names>F</given-names></name>, <name><surname>Ge</surname><given-names>Y</given-names></name> and <name><surname>Kohane</surname><given-names>I</given-names></name>, <article-title>Comorbidity clusters in autism spectrum disorders: an electronic health record time-series analysis</article-title>, <source>Pediatrics</source>
<volume>133</volume>, <fpage>e54</fpage> (<year>2014</year>).<pub-id pub-id-type="pmid">24323995</pub-id></mixed-citation></ref><ref id="R24"><label>24.</label><mixed-citation publication-type="journal"><name><surname>Murtagh</surname><given-names>F</given-names></name> and <name><surname>Legendre</surname><given-names>P</given-names></name>, Ward&#x02019;<article-title>s hierarchical agglomerative clustering method: which algorithms implement Ward&#x02019;s criterion?</article-title>, <source>Journal of classification</source>
<volume>31</volume>, <fpage>274</fpage> (<year>2014</year>).</mixed-citation></ref><ref id="R25"><label>25.</label><mixed-citation publication-type="book"><name><surname>Wickham</surname><given-names>H</given-names></name>, <source>ggplot2: Elegant Graphics for Data Analysis</source> (<publisher-name>Springer-Verlag</publisher-name>
<publisher-loc>New York</publisher-loc>, <year>2016</year>).</mixed-citation></ref><ref id="R26"><label>26.</label><mixed-citation publication-type="journal"><name><surname>Kassambara</surname><given-names>A</given-names></name> and <name><surname>Kosinski</surname><given-names>M</given-names></name>, <source>survminer: Drawing Survival Curves using &#x02018;ggplot2&#x02019;</source>, (<year>2019</year>). <comment>R package version 0.4.4.</comment></mixed-citation></ref><ref id="R27"><label>27.</label><mixed-citation publication-type="book"><name><surname>Therneau</surname><given-names>Terry M.</given-names></name> and <name><surname>Grambsch</surname><given-names>Patricia M.</given-names></name>, <source>Modeling Survival Data: Extending the Cox Model</source> (<publisher-name>Springer</publisher-name>, <publisher-loc>New York</publisher-loc>, <year>2000</year>).</mixed-citation></ref><ref id="R28"><label>28.</label><mixed-citation publication-type="journal"><name><surname>Gray</surname><given-names>B</given-names></name>, <source>cmprsk: Subdistribution Analysis of Competing Risks</source>, (<year>2019</year>). <comment>R package version 2.2&#x02013;8.</comment></mixed-citation></ref></ref-list></back><floats-group><fig id="F1" orientation="portrait" position="float"><label>Fig. 1.</label><caption><p id="P54">Survival and hazard curves for outcomes of interest, 5 by subtypes. (A) Overall survival, (B) Chronic kidney disease, (C) Cirrhosis, (D) Hepatocellular carcinoma, (E) Cardiovascular disease, (F) Myocardial infarction.</p></caption><graphic xlink:href="nihms-1061138-f0001"/></fig><fig id="F2" orientation="portrait" position="float"><label>Fig. 2.</label><caption><p id="P55">Univariate hazard ratios for outcomes of interest, by 5 subtypes</p></caption><graphic xlink:href="nihms-1061138-f0002"/></fig><fig id="F3" orientation="portrait" position="float"><label>Fig. 3.</label><caption><p id="P56">Multivariate analyses for outcomes of interest. Darker shades of red correlate with increased risk of the outcome, while darker shades of green indicate reduced risk of the outcome. Only hazard ratios with p<italic>&#x0003c;</italic>0.05 are color coded. Non-significant findings are in grey.</p></caption><graphic xlink:href="nihms-1061138-f0003"/></fig><table-wrap id="T1" position="float" orientation="portrait"><label>Table 1.</label><caption><p id="P57">Baseline characteristics, selected features of interest, and outcomes by subtype</p></caption><table frame="void" rules="none"><colgroup span="1"><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/><col align="left" valign="middle" span="1"/></colgroup><tbody><tr><td colspan="7" align="left" valign="top" rowspan="1"><graphic xlink:href="nihms-1061138-t0004"/></td></tr></tbody></table></table-wrap></floats-group></article>