Infodemiology Study of Reddit Discussions

The spread of misinformation on vaccines has become an important emerging public health problem. This has been particularly true during the COVID-19 pandemic, but it has also impacted the acceptance of other vaccines such as the Mumps, Measles, and Rubella (MMR) and Human papillomavirus (HPV) vaccine.


The Melax Tech team recently took part in a study, led by the University of Texas School of Biomedical Informatics, on the feasibility of using Machine-Learning and Deep-Learning technologies to combat misinformation on social media. Here, as a proof of concept of our approach and methods, the team focused on the Human papillomavirus (HPV) vaccine. An important and preventable disease, HPV infection is a highly prevalent sexually transmitted disease that has been shown to cause approximately 33,700 cases of cancer every year in the United States, with these cancers occurring in both males and females. A vaccine against most common HPV subtypes, with the goal of preventing cancer and precancerous lesions has been available since 2006. Unfortunately, uptake of the HPV vaccine has suffered from misinformation on social media. Automated methods to help detect and curb the spread of this vaccine misinformation on social media would be an important step in dealing with this important public health crisis. To our knowledge, there is no prior work on automated identification of vaccine-related misinformation on social media.


In this study we aimed to develop and evaluate an intelligent automated protocol for identifying and classifying human papillomavirus (HPV) vaccine misinformation on social media using machine learning (ML)–based methods. To do this, we applied a total of 5 machine learning (ML) algorithms, three using traditional ML approaches, and two employing deep learning-based algorithms, to 28,121 Reddit posts from more than 16,633 unique users during 2007 to 2017, and containing keywords related to human papillomavirus vaccination. A support vector machine, logistic regression, extremely randomized trees, a convolutional neural network, and a recurrent neural network were each used to identify vaccine misinformation. We then applied topic modeling using the Biterm Topic model to determine the major categories of HPV vaccine misinformation and to provide a visualization tool. This approach classified 7,207 (25.63%) of the 28,121 Reddit posts as vaccine misinformation. Within the vaccine misinformation category, posts about general safety issues were found to be the leading type of misinformed posts, accounting for 36.99% of the total. All told, topic modeling followed by qualitative manual review determined seven categories of misinformation, as follows: general vaccine adverse events (promoting general misinformation about vaccine safety); conspiracy theories (promoting conspiracy theories about the vaccine and/or fraud by a third party such as the government or drug company); the citing of unfounded studies (appearing to cite scientific studies from sources that are not, in-fact, scientifically peer reviewed); vaccine deaths and serious reactions (propagation of vaccine–induced death and serious adverse reactions); aluminum-containing adjuvants (misinformation on the safety of aluminum-containing compounds in vaccines); vaccines and autism (misinformation on the discredited causal link between vaccine and autism) ; and the category “other.”

This preliminary work can be applied to other types of vaccines or other pertinent health-related topics and, due to its use of ML technology, is highly scalable to large social media data. The work has applications to both policy makers and industry as a tool to help analyze and understand the prevalence and spread of misinformation on social media. We believe our approach is generalizable to other social media platforms and can be used for both real-time and retrospective analysis of feeds from social media sites. This technology could also be used as a mechanism to prevent the spread of misinformation, although the authors believe that ethical review needs to be considered for such uses.


A full description of our methodology and results is given in “Using Machine Learning–Based Approaches for the Detection and Classification of Human Papillomavirus Vaccine Misinformation: Infodemiology Study of Reddit Discussions.” Jingcheng Du, Sharice Preston, et al. Journal of Medical Internet Research. 2021 Aug; 23(8): e26478. DOI: 10.2196/26478.