Qualitative analysis of Tweets on Vaccination banner

Project

Status:

Active/Ongoing

Linked to group(s)/challenge(s):

Qualitative analysis of Tweets on Vaccination

This project is geared towards a qualitative analysis of both real-time and historical data from Twitter on Vaccination.

Introduction:

There is a lot more that can we attain from social media sentiment and data than mere likes and shares especially where health care is concerned. For the Vaccine hesitancy challenge, We believe it is important to capture the views and trends of the public, social media sites like twitter provide a good window into this area.

Dashboard APP:

The app itself can be accessed through this link .

The app collects all the tweets related to vaccination posted and visualizes some statistics. The first panel on the left-hand side is a line plot of the word-count trend, showing the changing pattern of the top-5 most frequently mentioned words. The top panel on the right-hand side is a bar chart showing the top-10 words for better comparison. The figure below is a time-shifting scatter plot of the averaged real-time sentiment score for all the tweets grouped by the top-5 mentioned words.

The predefined tracking keywords are;

Vaccine, Vaccination and Antivaxx

Elevator pitch / Abstract

Social media today has become a very popular communication tool among Internet users. Millions of messages are appearing daily in popular websites that provide services for microblogging such as Twitter, Tumblr, Facebook.

Authors of those messages write about their life, share opinions on variety of topics and discuss current issues. Because of a free format of messages and an easy accessibility of these platforms, Internet users tend to shift from traditional communication tools (such as traditional blogs or mailing lists) to these services. As more and more users post about products and services they use, or express their political and religious views, microblogging web-sites become valuable sources of people’s opinions and sentiments. Such data can be efficiently used

for marketing or social studies.(Pak, Alexander, and Patrick Paroubek. "Twitter as a Corpus for Sentiment Analysis and Opinion Mining." LREc. Vol. 10. 2010.)

Vaccine hesitancy is not new, but today – as the use of social media continues to increase – opportunities to spread misinformation and pseudoscience are unprecedented. Facebook, Twitter and other networks are taking steps to limit anti-vax content – but is it enough? And is there a role to harness social media for the powers of good?

Twitter is a platform that embraces a large amount of information flow per second, which should be fully utilized if one wants to explore the real-time interaction between communities and real-life events. This is the basis of developing a tool that is capable of collecting, storing, analyzing, and finally, visualizing Twitter data to glean real time information on Vaccination.

How to contribute

Contributing:

Needs:

Data cleaning and preprocessing for NLP on a dataset of tweets from 2006 -2019
Exploratory data analysis and Visualization
Topic modeling from the dataset
Graph analysis
Machine/deep learning models.

Contributing on Github:

The code and datasets for this project are openly available and we are eager for collaborators. Please use the projects github repo or JOGL project page.

On GITHUB:

File an issue to notify us about what you're working on.
Fork the repo, download the dataset, develop and test your code.
Make sure that your commit messages clearly describe your work.
Send a pull request.

File an Issue

Use the issue tracker in github to start the discussion.

Another team or individual may already working on your idea, your approach is not quite right, or that what your wish to work on exists already. The ticket you file in the issue tracker will be used to sort that all out.

Style Guides

Write in UTF-8 in Python 3
User modular architecture to group similar functions, classes, etc.
Always use 4 spaces for indentation (don't use tabs)
Try to limit line length to 80 characters
Class names should always be capitalized
Function names should always be lowercase
Look at the existing style and adhere accordingly

Fork the Repository

Be sure to add the relevant tests before making the pull request. Docs will be updated automatically when we merge to master, but you should also build the docs yourself and make sure they're readable.

Make the Pull Request

Once you have made all your changes, tests, and updated the documentation, make a pull request to move everything back into the main branch of the repository. Be sure to reference the original issue in the pull request. Expect some back-and-forth with regards to style and compliance of these rules.

Problem Statement

The fatal impact of vaccine hesitancy

In 2018, there were over 82,000 cases of measles confirmed in the EU – three times more than in 2017 – and measles led to 72 deaths. Cases of measles are affecting all unvaccinated groups, adults and children alike, with large numbers of cases and fatalities in countries which had previously eliminated the disease.

Vaccine hesitancy is a key reason for this worrying trend. Europe is the most vaccine-hesitant region in the world, and we are now witnessing the results. Last year’s wide-ranging survey of vaccine confidence in Europe, led by Heidi Larson and her colleagues from the Vaccine Confidence Project, found that the picture in the EU is complex with varying levels of vaccine confidence between countries.

The role of social media

At the core of social media is the ability for us to share ideas and content with our peers. While this freedom of information is what makes social media so appealing, it is also what can make it dangerous. Social media is not the cause of vaccine hesitancy, but it has certainly played a role in making anti-vaccination arguments and pseudoscience accessible to a wider audience.

Objectives & Methodology

Objectives:

Create a real-time twitter vaccination data analysis dashboard.
Qualitative analysis of historical Twitter data on vaccination from 2006 to 2019.

1.Creating the real time twitter sentiment analysis dashboard.

1.1 Basic Overview

Application Framework:

For the live streaming app, the traditional workflow is divided into two independent pipelines working together (Figure 1). In detail, data processing starts right after the first line of record is received, followed by data analyzing and results-visualization, etc. While the visualization server (Plotly Dash) is handling data-processing, the streaming server (Tweepy), on the other hand, will bring in the next line of newly generated data. Each pipeline keeps looping at its own pace.

Figure 1 shows the underlying framework of all the processes described:

As shown above, to make the application react promptly, we break the traditional data science pipeline into two modules that are taken care of by their corresponding local servers. Specifically, Tweepy is responsible for streaming data, i.e., control the flow of tweets, and we add extra functionalities to the trigger behavior, such as pushing data into a database or deleting old records from the database (optional). Note that the extra functionalities are triggered whenever there is a tweet falls under the predefined tracking conditions thus heard by the server. Therefore, for the Tweepy server, the intervals between the last operation and the next are random, as shown in the figure.

Meanwhile, data processing and visualization are carried out by another platform, Dash by Plotly. Dash controls the rendering and refreshing of the visualization on the browser; hence, a constant time interval for this trigger is preferred. Depends on the data throughput, the time window can be ranging from 0.5 seconds to 5 seconds ( 2 seconds for this app).

In the big picture, two running servers are executing their loops independently, and data is flowing in between them in the predefined time interval to update the application’s graphical interface.

1.2 Sentiment Analysis using VADER:

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. VADER uses a combination of both.

A sentiment lexicon is a list of lexical features (e.g., words) which are generally labelled according to their semantic orientation as either positive or negative.

VADER has been found to be quite successful when dealing with social media texts, NY Times editorials, movie reviews, and product reviews. This is because VADER not only tells about the Positivity and Negativity score but also tells us about how positive or negative a sentiment is.

It is fully open-sourced under the MIT License. The developers of VADER have used Amazon’s Mechanical Turk to get most of their ratings, You can find complete details on their Github Page.

1.2.1 Advantages of using VADER

VADER has a lot of advantages over traditional methods of Sentiment Analysis, including:

It works exceedingly well on social media type text, yet readily generalizes to multiple domains
It doesn’t require any training data but is constructed from a generalizable, valence-based, human-curated gold standard sentiment lexicon
It is fast enough to be used online with streaming data, and
It does not severely suffer from a speed-performance tradeoff.

The source of this article is a very easy to read paper published by the creators of VADER library.You can read the paper here.

2.Historical analysis of tweets on vaccination (2006-30/11/2019

2.1: Data Gathering:

We collected all tweets containing at the search string: vaccination. Along with the tweet text, we downloaded the date and time when the tweet was published, and the location of the user (if provided). We also downloaded the user id, follower ids, and friends ids. The followers of a user A are those users who will receive messages from user A. The friends of a user A are those users from whom user A receives messages. Thus, information flows from a user to his followers. We collected tweets using the open source information tool, TWINT.(https://github.com/twintproject) and a python algorithm.

In contrast to the open Twitter Search API, which only allows one to query tweets posted within the last seven days, Twint makes it possible to collect a much larger sample of Twitter posts, ranging several years. We queried Twint for different key terms that relate to the topic of vaccination ranging from the year 2006 to 30th of November 2019 and stored in an aggregated CSV file.

2.2: Data Analysis:

The aggregated dataset was cleaned and preprocessed for Natural Language processing using python 3 scripts. The process is described fully in the hosted Jupyter notebooks within the github repo.

2.3: Creation of an online Platform:

Features:

Welcome Page detailing the project and documentation on the available APIs.
Locations, the user can select a certain date range, a specific keyword, sentiment category and a specific region
Wordclouds visualize the most frequent words and hashtags used in the posts we have already collected. The highest the frequency, the largest the font size. The user can select a date range in this section as well. Once the user selected the date range, he/she chooses the category of the keywords
Influencer analysis. A social media influencer is a user who has established credibility in a certain topic. Influencers usually have access to a large audience and can persuade others by virtue of their authenticity and reach (pixlee,). In order to discover these users, we will create a social graph based on the retweets, mentions and replies between all the users whose posts have been collected. By applying, specific algorithms, we will identify the top-100 influencers.
The user can also view the most propagated tweets. Three options are available: Top tweets, top urls,topic modelling

2.4: Continuing work

The Dataset will be used for further analysis in the future stages of this project Including:

Topic modeling from the dataset
Graph analysis
Machine/deep learning models.
Descriptive analysis of twitter vaccination data with epidemiological data.
Model simulations for assessment of effects of changing vaccine sentiment on outbreaks and disease spread.
Extracting high quality content from the tweets of users that have been identified as key influencers by our system and use it to train an LDA model, which will then be used to classify other users.
Extract topics using topic modelling per location.
Provide a filtering process for identifying polarising tweets.
Develop an iterative methodology that will be built upon the intelligence extracted by the already available high-quality content (top tweets – top URLs) to identify new trends and dynamically update the keywords used to track tweets of specific content.

State of the art

To our knowledge there is no active program that is currently actively carrying out qualitative analysis on Twitter data for sentiment associated with Vaccination. However a number of studies have been carried out to analyse twitter for social media trends on Vaccination.

Our project stands out by creating dashboard for real time analysis and creation of datasets that can be used to asses historical trends.

We aim to

Progress report

Our project is subdivided into four distinct phases

Scoping
Research
Development
Deployment

1.In the scoping phase; we identified the project need as; 'we needed a way to carry out a quantitative analysis of social media sentiment on Vaccination'. We identified twitter as target for data gathering and analysis. [ Two Weeks]

2.We reviewed both academic and existing code and came up with a relevant strategy for data collection with a tentative plan for analysis and exploration.[Two Weeks]

3.Development.

We are currently in the development phase.

Phase one was creation of the first version of a live dashboard with the capability for twitter sentiment analysis.[ one -two weeks]
A second iteration of the dashboard is currently being created with better visual appeal and interaction based on user feedback.
Phase two was gathering data for historical analysis of twitter data. [this took 16 days]
Phase three is ongoing analysis of the collected data.
Phase four is developing of an online portal to provide end users with an interactive tool to explore the dataset and visualise trends and historical data on vaccination on twitter. [This is estimated to last approximately 2 to 3 months]

4.Deployment.

The updated live dashboard is estimated to user ready by 17th December 2019.
The online tool should be deployed by the end of January 2020.
Epidemiological research work will be ready for dissemination based on our work with other teams.

Stakeholders

Health research institutes, governments and not for profit organizations are only a few of the stakeholders concerned with the rising impact of social media trends on public health.

We aim to partner with researchers from these organisations to make the tools we create available to them for aggregation with epidemiological data.

Ethical considerations

Twitter and social media trends and analysis

Sustainability and scalability

The project is currently run by a volunteer.

Different team members are actively being sought for recruitment to support the running of the project.

Communication and dissemination strategy

All the datasets and code from the project will be made openly available on JOGL, and GITHUB.

A web app with an interactive dashboard for real time analysis of vaccination sentiment on twitter is deployed here.

An open platform for historical analysis of twitter data on vaccination will be developed. The final version will be publicly available by January 2020.

Research papers based on the analysis of the datasets collected will be shared with the public and peers through relevant channels.

Funding

This project has not received any external funding..

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

Additional information

Short Name: #TwitterVac
Last update: April 17, 2020

Keywords

data science

data visualisation

data analysis

Associated SDGs

Good Health and Well-being