Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease

South, Brett R; Shen, Shuying; Jones, Makoto; Garvin, Jennifer; Samore, Matthew H; Chapman, Wendy W; Gundlapalli, Adi V

doi:10.1186/1471-2105-10-S9-S12

i

Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease

Supporting Files

Sep 17 2009
By South, Brett R ; Shen, Shuying ; Jones, Makoto ; ...

Details

Alternative Title:

BMC Bioinformatics
Personal Author:

South, Brett R ; Shen, Shuying ; Jones, Makoto ; Garvin, Jennifer ; Samore, Matthew H ; Chapman, Wendy W ; Gundlapalli, Adi V
Description:

Background

Natural Language Processing (NLP) systems can be used for specific Information Extraction (IE) tasks such as extracting phenotypic data from the electronic medical record (EMR). These data are useful for translational research and are often found only in free text clinical notes. A key required step for IE is the manual annotation of clinical corpora and the creation of a reference standard for (1) training and validation tasks and (2) to focus and clarify NLP system requirements. These tasks are time consuming, expensive, and require considerable effort on the part of human reviewers.

Methods

Using a set of clinical documents from the VA EMR for a particular use case of interest we identify specific challenges and present several opportunities for annotation tasks. We demonstrate specific methods using an open source annotation tool, a customized annotation schema, and a corpus of clinical documents for patients known to have a diagnosis of Inflammatory Bowel Disease (IBD). We report clinician annotator agreement at the document, concept, and concept attribute level. We estimate concept yield in terms of annotated concepts within specific note sections and document types.

Results

Annotator agreement at the document level for documents that contained concepts of interest for IBD using estimated Kappa statistic (95% CI) was very high at 0.87 (0.82, 0.93). At the concept level, F-measure ranged from 0.61 to 0.83. However, agreement varied greatly at the specific concept attribute level. For this particular use case (IBD), clinical documents producing the highest concept yield per document included GI clinic notes and primary care notes. Within the various types of notes, the highest concept yield was in sections representing patient assessment and history of presenting illness. Ancillary service documents and family history and plan note sections produced the lowest concept yield.

Conclusion

Challenges include defining and building appropriate annotation schemas, adequately training clinician annotators, and determining the appropriate level of information to be annotated. Opportunities include narrowing the focus of information extraction to use case specific note types and sections, especially in cases where NLP systems will be used to extract information from large repositories of electronic clinical note documents.
Subjects:

Computational Biology Humans Inflammatory Bowel Diseases Phenotype Proceedings Vocabulary, Controlled
Source:

BMC Bioinformatics. 2009; 10(Suppl 9):S12.
Pubmed ID:

19761566
Pubmed Central ID:

PMC2745683
Document Type:

Journal Article
Funding:

1 PO1 CD000284-01/CD/ODCDC CDC HHS/United States
Volume:

10
Collection(s):

CDC Public Access
Main Document Checksum:

urn:sha256:855fc95e8adb8b77bbfaca9dc54a1b8dcb6f30540b32a261be87d234519b7963
Download URL:

https://stacks.cdc.gov/view/cdc/22940/cdc_22940_DS1.pdf
File Type:

[PDF - 783.26 KB ]

1471-2105-10-S9-S12-5.gif

Download gif
1471-2105-10-S9-S12-5.jpg

Download jpeg
1471-2105-10-S9-S12-6.gif

Download gif
1471-2105-10-S9-S12-6.jpg

Download jpeg
1471-2105-10-S9-S12-7.gif

Download gif
1471-2105-10-S9-S12-7.jpg

Download jpeg
1471-2105-10-S9-S12.nxml

Download txt
license.txt

Download txt
1471-2105-10-S9-S12-1.gif

Download gif
1471-2105-10-S9-S12-1.jpg

Download jpeg
1471-2105-10-S9-S12-2.gif

Download gif
1471-2105-10-S9-S12-2.jpg

Download jpeg
1471-2105-10-S9-S12-3.gif

Download gif
1471-2105-10-S9-S12-3.jpg

Download jpeg
1471-2105-10-S9-S12-4.gif

Download gif
1471-2105-10-S9-S12-4.jpg

Download jpeg

ON THIS PAGE

Details Supporting Files

CDC STACKS serves as an archival repository of CDC-published products including scientific findings, journal articles, guidelines, recommendations, or other public health information authored or co-authored by CDC or funded partners.

As a repository, CDC STACKS retains documents in their original published format to ensure public access to scientific information.