<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="abstract"><?properties open_access?><front><journal-meta><journal-id journal-id-type="nlm-ta">Online J Public Health Inform</journal-id><journal-id journal-id-type="iso-abbrev">Online J Public Health Inform</journal-id><journal-id journal-id-type="publisher-id">OJPHI</journal-id><journal-title-group><journal-title>Online Journal of Public Health Informatics</journal-title></journal-title-group><issn pub-type="epub">1947-2579</issn><publisher><publisher-name>University of Illinois at Chicago Library</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="pmc">6088011</article-id><article-id pub-id-type="publisher-id">ojphi-10-e11</article-id><article-id pub-id-type="doi">10.5210/ojphi.v10i1.8326</article-id><article-categories><subj-group subj-group-type="heading"><subject>ISDS 2018 Conference Abstracts</subject></subj-group></article-categories><title-group><article-title>Exploring the Value of Learned Representations for Automated
Syndromic Definitions</article-title></title-group><contrib-group><contrib contrib-type="author"><name><surname>Lee</surname><given-names>Scott</given-names></name><xref ref-type="corresp" rid="cor1">*</xref><xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref></contrib><contrib contrib-type="author"><name><surname>Levin</surname><given-names>Drew</given-names></name><xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref></contrib><contrib contrib-type="author"><name><surname>Thomas</surname><given-names>Jason</given-names></name><xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref></contrib><contrib contrib-type="author"><name><surname>Finley</surname><given-names>Patrick</given-names></name><xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref></contrib><contrib contrib-type="author"><name><surname>Heilig</surname><given-names>Charles</given-names></name><xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref></contrib><aff id="aff1"><label>1</label>Centers for Disease Control and Prevention, Decatur,
GA, <country>USA</country>;</aff><aff id="aff2"><label>2</label>Sandia National Laboratories, <addr-line>Albuquerque, NM</addr-line>, <country>USA</country></aff></contrib-group><author-notes><corresp id="cor1"><label>*</label>Scott Lee E-mail: <email xlink:href="yle4@cdc.gov">yle4@cdc.gov</email></corresp></author-notes><pub-date pub-type="epub"><day>30</day><month>5</month><year>2018</year></pub-date><pub-date pub-type="collection"><year>2018</year></pub-date><volume>10</volume><issue>1</issue><elocation-id>e11</elocation-id><permissions><license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by-nc/3.0/"><license-p>ISDS Annual Conference Proceedings 2018. This is an Open Access
article distributed under the terms of the Creative Commons
Attribution-Noncommercial 3.0 Unported License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/3.0/">http://creativecommons.org/licenses/by-nc/3.0/</ext-link>), permitting all
non-commercial use, distribution, and reproduction in any medium, provided the
original work is properly cited.</license-p></license></permissions><kwd-group kwd-group-type="author"><title>Keywords </title><kwd>Word embeddings</kwd><kwd>Deep learning</kwd><kwd>Syndrome definitions</kwd><kwd>ICD codes</kwd></kwd-group></article-meta></front><body><sec><title>Objective</title><p>To better define and automate biosurveillance syndrome categorization using modern
unsupervised vector embedding techniques.</p></sec><sec sec-type="intro"><title>Introduction</title><p>Comprehensive medical syndrome definitions are critical for outbreak investigation,
disease trend monitoring, and public health surveillance. However, because current
definitions are based on keyword string-matching, they may miss important
distributional information in free text and medical codes that could be used to
build a more general classifier. Here, we explore the idea that individual ICD codes
can be categorized by examining their contextual relationships across all other ICD
codes. We extend previous work in representation learning with medical data [1] by
generating dense vector embeddings of these ICD codes found in emergency department
(ED) visit records. The resulting representations capture information about disease
co-occurrence that would typically require SME involvement and support the
development of more robust syndrome definitions.</p></sec><sec sec-type="methods"><title>Methods</title><p>We evaluate our method on anonymized ED visit records obtained from the New York City
Department of Health and Mental Hygiene. The data set consists of approximately 3
million records spanning January 2016 to December 2016, each containing from one to
ten ICD-9 or ICD-10 codes. We use these data to embed each ICD code into a
high-dimensional vector space following techniques described in Mikolov, et al. [2],
colloquially known as word2vec. We define an individual code&#x02019;s context window
as the entirety of its current health record. Final vector embeddings are generated
using the gensim machine learning library in Python. We generate 300-dimensional
embeddings using a skip-gram network for qualitative evaluation. We use the
TensorFlow Embedding Projector to visualize the resulting embedding space. We
generate a three-dimensional t-SNE visualization with a perplexity of 32 and a
learning rate of 10, run for 1,000 iterations (Figure 1). Finally, we use cosine
distance to measure the nearest neighbors of common ICD-10 codes to evaluate the
consistency of the generated vector embeddings (Table 1).</p></sec><sec sec-type="results"><title>Results</title><p>T-SNE visualization of the generated vector embeddings confirms our hypothesis that
ICD codes can be contextually grouped into distinct syndrome clusters (Figure 1).
Manual examination of the resulting embeddings confirms consistency across codes
from the same top-level category but also reveals cross-category relationships that
would be missed from a strictly hierarchical analysis (Table 1). For example, not
only does the method appropriately discover the close relationship between influenza
codes J10.1 and A49.2, it also reveals a link between asthma code J45.20 and obesity
code E66.09. We believe these learned relationships will be useful both for refining
existing syndrome categories and developing new ones.</p></sec><sec sec-type="conclusions"><title>Conclusions</title><p>The embedding structure supports the hypothesis of distinct syndrome clusters, and
nearest-neighbor results expose relationships between categorically unrelated codes
(appropriate upon examination). The method works automatically without the need for
SME analysis and it provides an objective, data-driven baseline for the development
of syndrome definitions and their refinement.</p><fig id="fa" fig-type="figure" orientation="portrait" position="float"><caption><p>Table 1</p></caption><graphic xlink:href="ojphi-10-e11-g001"/></fig><fig id="f1" fig-type="figure" orientation="portrait" position="float"><label>Figure 1</label><caption><p>T-SNE visualization of [300 dimensional skip-gram] embedded ICD code vectors.
The heterogeneous structure suggests distinct syndrome definitions. Image
generated using Google&#x02019;s online TensorFlow Projector.</p></caption><graphic xlink:href="ojphi-10-e11-g002"/></fig></sec></body><back><ack><title>Acknowledgments</title><p>This work was supported by Laboratory Directed Research and Development funding from
Sandia National Laboratories. Sandia National Laboratories is a multimission
laboratory managed and operated by National Technology and Engineering Solutions of
Sandia LLC, a wholly owned subsidiary of Honeywell International Inc. for the U.S.
Department of Energy&#x02019;s National Nuclear Security Administration under
contract DENA0003525.</p></ack><ref-list><title>References</title><ref id="r1"><mixed-citation publication-type="other">Choi Y, Chiu CY-I, Sontag D. Learning
Low-Dimensional Representations of Medical Concepts. AMIA Summits on
Translational Science Proceedings. 2016;2016:41-50.</mixed-citation></ref><ref id="r2"><mixed-citation publication-type="book">Mikolov T, Sutskever I, Chen K, Corrado
GS, Dean J. Distributed representations of words and phrases and their
compositionality. InAdvances in neural information processing systems 2013 (pp.
3111- 3119).</mixed-citation></ref></ref-list></back></article>