Social media are widely used by the general public and by public health and health care professionals. Emerging evidence suggests engagement with public health information on social media may influence health behavior. However, the volume of data accumulating daily on Twitter and other social media is a challenge for researchers with limited resources to further examine how social media influence health. To address this challenge, we used crowdsourcing to facilitate the examination of topics associated with engagement with diabetes information on Twitter.
We took a random sample of 100 tweets that included the hashtag “#diabetes” from each day during a constructed week in May and June 2014. Crowdsourcing through Amazon’s Mechanical Turk platform was used to classify tweets into 9 topic categories and their senders into 3 Twitter user categories. Descriptive statistics and Tweedie regression were used to identify tweet and Twitter user characteristics associated with 2 measures of engagement, “favoriting” and “retweeting.”
Classification was reliable for tweet topics and Twitter user type. The most common tweet topics were medical and nonmedical resources for diabetes. Tweets that included information about diabetes-related health problems were positively and significantly associated with engagement. Tweets about diabetes prevalence, nonmedical resources for diabetes, and jokes or sarcasm about diabetes were significantly negatively associated with engagement.
Crowdsourcing is a reliable, quick, and economical option for classifying tweets. Public health practitioners aiming to engage constituents around diabetes may want to focus on topics positively associated with engagement.
Diabetes is a major public health problem projected to reach rates as high as 1 in 3 adults in the United States by 2050 (
Social media have emerged as popular channels for health information-seeking and sharing; approximately 80% of US adult Internet users have searched online for health information (
Social media are unique communication and dissemination tools with interaction, or audience engagement, being a central feature. Social media engagement has been defined as “establishing a connection with others to contribute to a common good” (
Twitter is one of the top 3 social media applications and is used by 19% of all adults and 23% of online adults in the United States (
Twitter is an application for “microblogging,” or sending and receiving brief (140 characters or fewer), direct messages (ie, “tweets”) (
Applications such as Amazon’s Mechanical Turk allow the crowdsourcing of small online tasks, also known as Human Intelligence Tasks (HITs). Crowdsourcing is the use of large groups of people, often on the Internet, to do a specific task. HITs are tasks a computer is unable to perform alone; HITs are performed through the use of an open network of workers, also known as “turkers.” A researcher can post HITs that include classification, transcribing, image tagging, and other tasks, which are then completed by turkers, who earn anywhere from half a cent to tens of dollars per HIT completed.
Turkers can work from anywhere in the world; a 2010 study found most turkers reside in the United States (47%) or India (34%). As of April 2014, the percentage of turkers in the United States was 51.5%, and 33% were in India (
The widespread use of social media to find health information, including diabetes information, and the potential for social media engagement to influence health behavior presents an opportunity to better understand engagement with diabetes information online. However, the volume of Twitter data accumulating daily presents a challenge for social scientists with limitations on human and financial resources. To address the opportunity and challenge, we sought to 1) examine engagement with diabetes information on Twitter and 2) examine the Amazon Mechanical Turk as a new tool to aid public health researchers working with social media data.
As with traditional news sources, Twitter use varies by day of the week (
Three authors (J.K.H., A.M., S.M.R.) reviewed the tweets about diabetes and worked together to develop a classification scheme for each tweet and tweet sender. The classification scheme has 9 topic statements and 3 Twitter user types (
| Topic and User Characteristic | Example Tweet | ICC (95% CI) | Total Tweets, n (%) | Tweets Favorited, n (%) | Tweets Retweeted, n (%) |
|---|---|---|---|---|---|
|
| |||||
| Number or percentage of people with diabetes | @CDCgov estimates that 1 in 3 US adults will have #diabetes by 2050. There’s hope. | .82 (.80–.84) | 37 (8.3) | 7 (18.9) | 7 (18.9) |
| Diabetes-related joke or sarcasm | My crack dealer #wcw #littledebbie #diabetes @LittleDebbie | .82 (.80–.84) | 58 (12.9) | 16 (27.6) | 3 (5.2) |
| Diabetes-related event (for example: walk or 5k, conference, awareness month) | This goofy bunch raised over $2,500 to help find a cure for #diabetes. Way to go #TeamReasonRiders! #TourDeCureIndy | .82 (.80–.84) | 53 (11.8) | 15 (28.3) | 23 (43.4) |
| A person’s success story (for example: good blood glucose, exercise) | Holy Crap!! My blood glucose hasn't been at my goal of 130 in years!! Woo go me:p #diabetes #diabetic | .62 (.57–.67) | 37 (8.3) | 9 (24.3) | 8 (21.6) |
| A person’s failure or challenge (for example: bad blood glucose, eating candy) | That moment when u eat lunch then realize you forgot to bolus! DOH!! #diabetes #type1 #type2 #organic . . . | .67 (.63–.71) | 44 (9.8) | 9 (20.5) | 3 (6.8) |
| Children with diabetes | #Diabetes among kids is on the rise #GLV | .83 (.81–.85) | 24 (5.4) | 6 (25.0) | 4 (16.7) |
| Nonmedical resources for diabetes (eg, recipes, cookbooks, weight loss tips) | Everyone, especially those with #diabetes, need to avoid these 10 processed foods | .70 (.66–.74) | 124 (27.7) | 26 (21.0) | 18 (14.5) |
| Medical resources for diabetes (eg, new drug, alternative therapy, screening) | Gastric banding: new ammunition in the fight against type 2 diabetes | .72 (.68–.75) | 130 (29.0) | 23 (17.7) | 24 (18.5) |
| Diabetes-related health problems (eg, heart disease, cancer, amputation, anxiety) | Dr Lane on #diabetes complications: microalbuminuria is a marker for cardiovascular disease risk #APCU2014 | .66 (.61–.70) | 57 (12.7) | 11 (19.3) | 10 (17.5) |
|
| Example user description | .84 (.81–.86) | NA | NA | NA |
| Person | Type1 Diabetic, organic enthusiast, stay-at-home dad, blogger | NA | 246 (54.9) | 54 (22.0) | 39 (15.9) |
| Organization | Therapeutics initiative: providing physicians and pharmacists with up-to-date, evidence-based, practical information on prescription drug therapy | NA | 180 (40.2) | 37 (20.6) | 43 (23.9) |
| Sender description is blank | NA | 22 (4.9) | 3 (13.6) | 2 (9.1) | |
Abbreviations: ICC, intraclass correlation coefficient; CI, confidence interval; #, Twitter hashtag; NA, not applicable.
A screen capture of an example tweet and the description of the Twitter user who sent the tweet along with the instructions for classifying the tweet into topic and user categories. At the bottom is the submit button.
| Instructions |
| Choose the categories that best describe the tweet content and tweet sender shown below. If a link is included please click on it to help you classify the tweet and tweet send accurately. |
| Tweet: #Diabetes rates skyrocket in kids and teens – USA TODAY |
| The tweet includes information about . . . (Choose all that apply): |
| The number or percentage of people with diabetes |
| Diabetes-related joke or sarcasm |
| Diabetes-related event (for example: walk or 5k, conference, awareness month) |
| A person’s success story (for example: good blood sugar, exercise) |
| A person’s failure or challenge (for example: bad blood sugar, eating candy) |
| Children with diabetes |
| Non-medical resources for diabetes (for example: recipes, cookbooks, weight loss tips) |
| Medical resources for diabetes (for example: new drug, alternative therapy, screening) |
| Diabetes related health problems (for example: heart disease, cancer, amputation, anxiety) |
| Tweet sender description: Read The news Without #Ads By Replacing 757.no-ip. Biz With bit.ly |
| The sender of the tweet seems to be a(n) . . . |
| Person |
| Organization |
| Sender description is blank |
| Submit |
To ensure reliable classification, we followed Hipp et al (
To examine reliability of the classification system we used a 1-way random model for absolute agreement (
We used descriptive statistics and Tweedie regression to examine tweet and Twitter user characteristics associated with engagement. The 2 indicators of engagement, number of favorites and number of retweets, are count variables. Poisson models are often used to model count variables; however, each tweet was favorited a mean of 0.74 times (variance, 52.23), and each tweet was retweeted 0.74 times (variance, 32.03). The magnitude of the variance in relation to the mean violates the Poisson regression assumption that the mean and variance are equal. Having a very large variance in relation to the mean indicates the data are overdispersed. In addition, these data included many zeros for both favoriting (n = 363) and retweeting (n = 367). Tweedie regression accounts for overdispersed count data with a large number of zeros.
We built the regression models in 2 steps. We started with reduced models that included only predictors shown in prior studies to be associated with engagement. Specifically, reduced models included presence of a link in the tweet, the number of followers of the tweet sender, the number of followees of the tweet sender, and the age of the sender’s Twitter account. Although demonstrated as important to engagement, we did not include hashtags as a predictor because all tweets included the hashtag #diabetes as a result of the data collection process. To develop the full model, we then added topic and type of Twitter user to the reduced model.
We used the Aikake Information Criterion (AIC) to determine whether model fit improved from the reduced to the full model. A lower AIC indicates a better-fitting model. In addition, we examined leverage and Cook’s D values to identify and assess outlying and influential values. Analyses were conducted using IBM SPSS version 22 (IBM Corp).
Tweets were sent by Twitter users with a median of 631.5 followers (range, 7–242,646), and following a median of 613.5 others (followees range, 0–76,742), with accounts open a mean of 1,132 days (standard deviation [SD], 645). The most common diabetes tweet topics were medical resources for diabetes (n = 130, 29.0%) and nonmedical resources for diabetes (n = 124, 27.7%). The least common tweet topic was children with diabetes (n = 24, 5.4%). Tweets about events were most likely to be favorited and retweeted. The percentage of tweets favorited had a small range across tweet topics. The least favorited topic, medical resources for diabetes, was favorited 17.7% of the time, whereas the most favorited topic, diabetes-related event, was favorited 28.3% of the time. The range was much wider for retweeting, ranging from retweets of just 6.8% of tweets about a person’s failure or challenge and 5.2% of a diabetes-related joke or sarcasm to 43.4% of tweets regarding a diabetes-related event. Just over half the tweets were sent by a person (54.9%), 40.2% were sent by an organization, and 4.9% had a blank user description. Interrater reliability was good (0.60–0.74) for half the measures and excellent (0.75–1.00) for the other half.
There was 1 extreme outlying case for both outcomes and 1 additional outlier for the number of favorites model. The extreme case was an individual with the most followers (n = 262,646) of any of the Twitter users in the data but whose tweets were not favorited and were only retweeted once. The outlier for the favoriting model had the highest value for the number of favorites outcome. Because the 2 cases appeared legitimate, we retained them in the data set.
Reduced and full models were significantly better than null models at explaining the outcomes (
| Characteristic | Number of Favorites | Number of Retweets | ||
|---|---|---|---|---|
| Reduced Model | Full Model | Reduced Model | Full Model | |
|
| −.174 (.379) | −518 (0.588) | −.677 (.426) | −.003 (.590) |
|
| ||||
|
| .002 (.001) | .001 (.001) | .002 (.001) | .001 (.001) |
|
| .003 (.002) | .001 (.003) | .001 (.002) | .001 (.002) |
|
| −.072 (.022) | −.026 (.023) | −.049 (.023) | −.019 (.025) |
|
| .382 (.379) | .466 (.358) | .771 (.380) | .558 (.400) |
|
| ||||
| Organization | — | 1 [Reference] | — | 1 [Reference] |
| Person | — | .393 (.333) | — | .052 (.322) |
| No user description | — | −1.286 (1.043) | — | -.085 (.866) |
|
| ||||
| Prevalence | — | −1.344 (.672) | — | −1.259 (.699) |
| Sarcasm/joke | — | −.037 (.518) | — | −2.964 (.828) |
| Event | — | −.454 (.556) | — | −.098 (.506) |
| Success | — | −.084 (.587) | — | −.416 (.636) |
| Failure | — | −.948 (.589) | — | −1.153 (.768) |
| Children | — | −.204 (.749) | — | −.864 (.819) |
| Nonmedical resources | — | −.839 (.433) | — | −1.441 (.465) |
| Medical resources | — | −.702 (.443) | — | −.576 (.453) |
| Health problems | — | 1.062 (.454) | — | .388 (.483) |
|
| 31.76 ( | 64.56 ( | 25.37 ( | 55.38 ( |
|
| 853.49 | 842.70 | 812.16 | 804.15 |
Abbreviations: AIC, Aikake Information Criterion; SE, standard error; —, variable not included in the model.
Significance calculated using χ2.
Likewise, there was a positive and significant relationship between having a large number of followers and retweeting. However, there were negative associations between retweeting and the topics of number or percentage of people with diabetes, diabetes-related joke or sarcasm, and nonmedical resources for diabetes. In addition, although the proportion of tweets retweeted and favorited was highest overall for tweets about events, once other tweet characteristics were accounted for, the event topic was not significantly associated with favoriting or retweeting. Finally, contrary to the results of prior studies, the full models indicated that number of followees, account age, and including a URL did not influence engagement (
Through an examination of a sample of tweets about diabetes using crowdsourcing for data classification, we learned 2 things that may aid public health researchers and practitioners working with social media: 1) the Mechanical Turk may be a reliable, quick, and economical way for researchers to code large amounts of complex social media data; and 2) tweet topics may be associated with tweet engagement in public health. Consistent with Hipp et al (
Research that examined tweet characteristics associated with engagement has primarily relied on methods from computer science including data mining and machine learning. These tools are useful in identifying patterns in social media data related to tweet topic, sentiment (such as sarcasm), and parts of speech. However, the tools have 2 limitations: 1) they require specialized skills not always the purview of social scientists and 2) machine learning algorithms have some limitations in the types of classification they can accurately handle, although methods are increasingly sophisticated and able to handle complex tasks. In contrast, the Mechanical Turk system requires minimal technical skill for use by researchers and provides access to a large population of people with the ability to reliably code many complex topics.
An analysis of tweets classified through Mechanical Turk identified several tweet topics associated with 2 forms of tweet engagement, retweeting and favoriting, which may be explained by tweet topic. Specifically, the topic “nonmedical resources for diabetes” had a negative significant relationship with both favoriting and retweeting. An examination of tweets classified as nonmedical resources indicated that some of these tweets may lack credibility or appear to be spam. For example, this tweet was not favorited or retweeted a single time despite the Twitter user sending the tweet having more than 20,000 followers: “
In addition, retweeting and favoriting were significantly lower for tweets about the number or percentage of people with diabetes, whereas favoriting was higher for tweets about health problems associated with diabetes. This may indicate that Twitter users are engaging with health information specific to their personal health situation but not with general information. Finally, retweeting was significantly lower for tweets that included a diabetes-related joke or sarcasm.
Public health professionals working in diabetes and other areas may wish to consider how Twitter topics influence engagement. Tweet strategies often include guidance on features (eg, hashtags, URLs) to include in a tweet, tweet timing, and other nontopical strategies for increasing engagement. However, our results demonstrated that, controlling for tweet and tweet sender characteristics, tweet topic is influential in whether a tweet is favorited or retweeted.
Our study has several limitations, including the use of a hashtag for data collection. Tweets about diabetes may not contain #diabetes, so we may have missed some important tweets or patterns of relationships. An emerging body of work on hashtag use on Twitter (
This article was made possible by grant no. 1P30DK092950 from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). Its contents are solely the responsibility of the authors and do not necessarily represent the official views of NIDDK. We acknowledge the support of the Washington University Institute for Public Health for co-sponsoring, with the Washington University Center for Diabetes Translation Research, the Next Steps in Public Health event that led to the development of this article.
The opinions expressed by authors contributing to this journal do not necessarily reflect the opinions of the U.S. Department of Health and Human Services, the Public Health Service, the Centers for Disease Control and Prevention, or the authors' affiliated institutions.