A Keyword-Enhanced Approach to Handle Class Imbalance in Clinical Text Classification
Supporting Files
-
6 2022
-
File Language:
English
Details
-
Alternative Title:IEEE J Biomed Health Inform
-
Personal Author:
-
Description:Recent applications ofdeep learning have shown promising results for classifying unstructured text in the healthcare domain. However, the reliability of models in production settings has been hindered by imbalanced data sets in which a small subset of the classes dominate. In the absence of adequate training data, rare classes necessitate additional model constraints for robust performance. Here, we present a strategy for incorporating short sequences of text (i.e. keywords) into training to boost model accuracy on rare classes. In our approach, we assemble a set of keywords, including short phrases, associated with each class. The keywords are then used as additional data during each batch of model training, resulting in a training loss that has contributions from both raw data and keywords. We evaluate our approach on classification of cancer pathology reports, which shows a substantial increase in model performance for rare classes. Furthermore, we analyze the impact of keywords on model output probabilities for bigrams, providing a straightforward method to identify model difficulties for limited training data.
-
Subjects:
-
Keywords:
-
Source:IEEE J Biomed Health Inform. 26(6):2796-2803
-
Pubmed ID:35020599
-
Pubmed Central ID:PMC9533247
-
Document Type:
-
Funding:HHSN261201800013C/CA/NCI NIH HHSUnited States/ ; HHSN261201800016I/CA/NCI NIH HHSUnited States/ ; HHSN261201800014I/CA/NCI NIH HHSUnited States/ ; U58 DP003907/DP/NCCDPHP CDC HHSUnited States/ ; HHSN261201800013I/CA/NCI NIH HHSUnited States/ ; HHSN261201800007C/CA/NCI NIH HHSUnited States/ ; HHSN261201800014C/CA/NCI NIH HHSUnited States/ ; HHSN261201800016C/CA/NCI NIH HHSUnited States/ ; P30 CA177558/CA/NCI NIH HHSUnited States/ ; HHSN261201300021C/CA/NCI NIH HHSUnited States/
-
Volume:26
-
Issue:6
-
Collection(s):
-
Main Document Checksum:urn:sha-512:687a780b2beefd69e4d9db76a24efb66c3c10bd6dfd2443615484f492046ff7963664d43f0ee01b81055a1466fcf2ef1283f2e0be88dbfa9134dd1608be80f03
-
Download URL:
-
File Type:
Supporting Files
File Language:
English
ON THIS PAGE
CDC STACKS serves as an archival repository of CDC-published products including
scientific findings,
journal articles, guidelines, recommendations, or other public health information authored or
co-authored by CDC or funded partners.
As a repository, CDC STACKS retains documents in their original published format to ensure public access to scientific information.
As a repository, CDC STACKS retains documents in their original published format to ensure public access to scientific information.
You May Also Like
COLLECTION
CDC Public Access