Using Event-Based Web-Scraping Methods and Bidirectional Transformers to Characterize COVID-19 Outbreaks in Food Production and Retail Settings

Flynn M; Gangrade V; Hilton C; Miano J; Pomeroy M; Siven J; Tilashalski F

i

Using Event-Based Web-Scraping Methods and Bidirectional Transformers to Characterize COVID-19 Outbreaks in Food Production and Retail Settings

Public Domain

2021/06/15
By Flynn M ; Gangrade V ; Hilton C ; ...
Series: Mining Publications

Details

Personal Author:

Flynn M ; Gangrade V ; Hilton C ; Miano J ; Pomeroy M ; Siven J ; Tilashalski F
Description:

Current surveillance methods may not capture the full extent of COVID-19 spread in high-risk settings like food establishments. Thus, we propose a new method for surveillance that identifies COVID-19 cases among food establishment workers from news reports via web-scraping and natural language processing (NLP). First, we used web-scraping to identify a broader set of articles (n = 67,078) related to COVID-19 based on keyword mentions. In this dataset, we used an open-source NLP platform (ClarityNLP) to extract location, industry, case, and death counts automatically. These articles were vetted and validated by CDC subject matter experts (SMEs) to identify those containing COVID-19 outbreaks in food establishments. CDC and Georgia Tech Research Institute SMEs provided a human-labeled test dataset containing 388 articles to validate our algorithms. Then, to improve quality, we fine-tuned a pretrained RoBERTa instance, a bidirectional transformer language model, to classify articles containing >= 1 positive COVID-19 cases in food establishments. The application of RoBERTa decreased the number of articles from 67,078 to 1,112 and classified (>= 1 positive COVID-19 cases in food establishments) articles with 88% accuracy in the human-labeled test dataset. Therefore, by automating the pipeline of web-scraping and COVID-19 case prediction using RoBERTa, we enable an efficient human in-the-loop process by which COVID-19 data could be manually collected from articles flagged by our model, thus reducing the human labor requirements. Furthermore, our approach could be used to predict and monitor locations of COVID-19 development by geography and could also be extended to other industries and news article datasets of interest. [Description provided by NIOSH]
Subjects:

Communicable Diseases Epidemiology Food Handling Food Industry Occupational Health Public Health Safety Software
Keywords:

Author Keywords: COVID-19 Computer Models Coronavirus COVID-19 Disease Transmission Food Services Infectious Diseases Natural Language Processing Public Health Retail Workers Surveillance Web-scraping
Series:

Mining Publications
ISBN:

9783030772109
ISSN:

0302-9743
Publisher:

Springer
Document Type:

Text
Genre:

Proceedings
Place as Subject:

Georgia ; Ohio ; OSHA Region 3 ; OSHA Region 4 ; OSHA Region 5 ; Pennsylvania
CIO:

National Institute for Occupational Safety and Health (NIOSH)
Division:

PMRD - Pittsburgh Mining Research Division ; DSI - Division of Science Integration
Topic:

Workplace Safety & Health
Location:

North America
Pages in Document:

187-198
Volume:

12721
NIOSHTIC Number:

nn:20063307
Citation:

Artificial Intelligence in Medicine: 19th International Conference on Artificial Intelligence in Medicine, AIME 2021, June 15-18, 2021, virtual event. Lecture notes in computer science, volume 12721. Tucker A, Henriques Abreu P, Cardoso J, Pereira Rodrigues P, Riaño D eds. Cham, Switzerland: Springer, 2021 Jun; 12721:187-198
Contact Point Address:

Charity Hilton, Georgia Tech Research Institute, Atlanta, GA 30318, USA
Email:

Charity.Hilton@gtri.gatech.edu
Editor(s):

Tucker A; Henriques Abreu P; Cardoso J; Pereira Rodrigues P; Riaño D
Federal Fiscal Year:

2021
Peer Reviewed:

False
Source Full Name:

Artificial Intelligence in Medicine: 19th International Conference on Artificial Intelligence in Medicine, AIME 2021, June 15-18, 2021, virtual event. Lecture notes in computer Science, volume 12721
Collection(s):

National Institute for Occupational Safety and Health
Main Document Checksum:

urn:sha-512:63aaf81a88031c6ecfa52375cd7440dae166bc43efd9a0ceb1c0b7944014750e76adb9c29aea62b69a33951e597f7c894fb9fdc8473fad43edc66c6965ed74a2
Download URL:

https://stacks.cdc.gov/view/cdc/215438/cdc_215438_DS1.pdf
File Type:

[PDF - 794.73 KB ]

ON THIS PAGE

Details

CDC STACKS serves as an archival repository of CDC-published products including scientific findings, journal articles, guidelines, recommendations, or other public health information authored or co-authored by CDC or funded partners.

As a repository, CDC STACKS retains documents in their original published format to ensure public access to scientific information.