Using Event-Based Web-Scraping Methods and Bidirectional Transformers to Characterize COVID-19 Outbreaks in Food Production and Retail Settings
Public Domain
-
2021/06/15
-
-
Series: Mining Publications
Details
-
Personal Author:
-
Description:Current surveillance methods may not capture the full extent of COVID-19 spread in high-risk settings like food establishments. Thus, we propose a new method for surveillance that identifies COVID-19 cases among food establishment workers from news reports via web-scraping and natural language processing (NLP). First, we used web-scraping to identify a broader set of articles (n = 67,078) related to COVID-19 based on keyword mentions. In this dataset, we used an open-source NLP platform (ClarityNLP) to extract location, industry, case, and death counts automatically. These articles were vetted and validated by CDC subject matter experts (SMEs) to identify those containing COVID-19 outbreaks in food establishments. CDC and Georgia Tech Research Institute SMEs provided a human-labeled test dataset containing 388 articles to validate our algorithms. Then, to improve quality, we fine-tuned a pretrained RoBERTa instance, a bidirectional transformer language model, to classify articles containing >= 1 positive COVID-19 cases in food establishments. The application of RoBERTa decreased the number of articles from 67,078 to 1,112 and classified (>= 1 positive COVID-19 cases in food establishments) articles with 88% accuracy in the human-labeled test dataset. Therefore, by automating the pipeline of web-scraping and COVID-19 case prediction using RoBERTa, we enable an efficient human in-the-loop process by which COVID-19 data could be manually collected from articles flagged by our model, thus reducing the human labor requirements. Furthermore, our approach could be used to predict and monitor locations of COVID-19 development by geography and could also be extended to other industries and news article datasets of interest. [Description provided by NIOSH]
-
Subjects:
-
Keywords:
-
Series:
-
ISBN:9783030772109
-
ISSN:0302-9743
-
Publisher:
-
Document Type:
-
Genre:
-
Place as Subject:
-
CIO:
-
Division:
-
Topic:
-
Location:
-
Pages in Document:187-198
-
Volume:12721
-
NIOSHTIC Number:nn:20063307
-
Citation:Artificial Intelligence in Medicine: 19th International Conference on Artificial Intelligence in Medicine, AIME 2021, June 15-18, 2021, virtual event. Lecture notes in computer science, volume 12721. Tucker A, Henriques Abreu P, Cardoso J, Pereira Rodrigues P, Riaño D eds. Cham, Switzerland: Springer, 2021 Jun; 12721:187-198
-
Contact Point Address:Charity Hilton, Georgia Tech Research Institute, Atlanta, GA 30318, USA
-
Email:Charity.Hilton@gtri.gatech.edu
-
Editor(s):
-
Federal Fiscal Year:2021
-
Peer Reviewed:False
-
Source Full Name:Artificial Intelligence in Medicine: 19th International Conference on Artificial Intelligence in Medicine, AIME 2021, June 15-18, 2021, virtual event. Lecture notes in computer Science, volume 12721
-
Collection(s):
-
Main Document Checksum:urn:sha-512:63aaf81a88031c6ecfa52375cd7440dae166bc43efd9a0ceb1c0b7944014750e76adb9c29aea62b69a33951e597f7c894fb9fdc8473fad43edc66c6965ed74a2
-
Download URL:
-
File Type:
ON THIS PAGE
CDC STACKS serves as an archival repository of CDC-published products including
scientific findings,
journal articles, guidelines, recommendations, or other public health information authored or
co-authored by CDC or funded partners.
As a repository, CDC STACKS retains documents in their original published format to ensure public access to scientific information.
As a repository, CDC STACKS retains documents in their original published format to ensure public access to scientific information.
You May Also Like