Optimal Vocabulary Selection Approaches for Privacy-Preserving Deep NLP Model Training for Information Extraction and Cancer Epidemiology
Supporting Files
-
2022
-
File Language:
English
Details
-
Alternative Title:Cancer Biomark
-
Personal Author:
-
Description:BACKGROUND:
With the use of artificial intelligence and machine learning techniques for biomedical informatics, security and privacy concerns over the data and subject identities have also become an important issue and essential research topic. Without intentional safeguards, machine learning models may find patterns and features to improve task performance that are associated with private personal information.
OBJECTIVE:
The privacy vulnerability of deep learning models for information extraction from medical textural contents needs to be quantified since the models are exposed to private health information and personally identifiable information. The objective of the study is to quantify the privacy vulnerability of the deep learning models for natural language processing and explore a proper way of securing patients’ information to mitigate confidentiality breaches.
METHODS:
The target model is the multi-task convolutional neural network for information extraction from cancer pathology reports, where the data for training the model are from the participated multiple state cancer registries. This study proposes the following schemes to collect vocabularies from the cancer pathology reports; (a) words appearing in multiple registries, and (b) words that have higher mutual information. We performed membership inference attacks on the models in high-performance computing environments.
RESULTS:
The comparison outcomes suggest that the proposed vocabulary selection methods resulted in lower privacy vulnerability while maintaining the same level of clinical task performance.
-
Subjects:
-
Keywords:
-
Source:Cancer Biomark. 33(2):185-198
-
Pubmed ID:35213361
-
Pubmed Central ID:PMC9377550
-
Document Type:
-
Funding:HHSN261201800032C/CA/NCI NIH HHSUnited States/ ; HHSN261201800009C/CA/NCI NIH HHSUnited States/ ; NU58DP006344/DP/NCCDPHP CDC HHSUnited States/ ; HHSN261201800015I/CA/NCI NIH HHSUnited States/ ; HHSN261201800013C/CA/NCI NIH HHSUnited States/ ; HHSN261201800016I/CA/NCI NIH HHSUnited States/ ; HHSN261201800014I/CA/NCI NIH HHSUnited States/ ; HHSN261201800032I/CA/NCI NIH HHSUnited States/ ; U58 DP003907/DP/NCCDPHP CDC HHSUnited States/ ; HHSN261201800015C/CA/NCI NIH HHSUnited States/ ; HHSN261201800013I/CA/NCI NIH HHSUnited States/ ; HHSN261201800014C/CA/NCI NIH HHSUnited States/ ; HHSN261201800016C/CA/NCI NIH HHSUnited States/ ; P30 CA177558/CA/NCI NIH HHSUnited States/ ; HHSN261201300021C/CA/NCI NIH HHSUnited States/ ; HHSN261201800009I/CA/NCI NIH HHSUnited States/ ; HHSN261201800007C/CA/NCI NIH HHSUnited States/
-
Volume:33
-
Issue:2
-
Collection(s):
-
Main Document Checksum:urn:sha-512:8eb2e8b8c97d6527071f79aa24555e23aff1fc73fe2b06adf832eea68859f249e8b6841e3edeb391cbbf53ef23b9e185bc96d67ba439bd25c8029f1c099b6f2c
-
Download URL:
-
File Type:
Supporting Files
File Language:
English
ON THIS PAGE
CDC STACKS serves as an archival repository of CDC-published products including
scientific findings,
journal articles, guidelines, recommendations, or other public health information authored or
co-authored by CDC or funded partners.
As a repository, CDC STACKS retains documents in their original published format to ensure public access to scientific information.
As a repository, CDC STACKS retains documents in their original published format to ensure public access to scientific information.
You May Also Like
COLLECTION
CDC Public Access