Sentiment Tweets Analysis - Multi-Cloud - Spark ML Model - Prediction & Dashboard

Sentiment Tweets Analysis - Multi-Cloud - Spark ML Model - Prediction & Dashboard

One of the most valuable tools used to mine opinions or appraisal of a determinate subject is provided by the subjective analysis of comments on social media chats following the conversation threads to extract the referential words known as tokens. These can be detected, valorized, tabulated, correlated, and modeled to be finally predicted.

In this way, the opinion of a given audience can be evaluated and even known ("predicted") with the tokens represented as the sentiment expressed in the text of a message or comment. This allows to classify any text sentiment.        

Within the framework of the Data Science Bootcamp, for this project WeCloudData provided access to an AWS s3 bucket with a dataset as a CSV file containing 1578627 tweets, tagged as "0" for negative or "1" for positive sentiment.

It should be noted that this volume of data ("BIG-DATA") represents a challenge since significant computing capabilities are required to obtain results in a reasonable time, which may exceed the resources available locally. This suggests the adequate use of a free-access cloud cluster that allows accessing the data, processing it, generating a model, and saving the results in a new repository with the same features as the original.

To display the findings in data exploration, analysis, and visualization, a dashboard needs to be developed using a cloud platform or service with the ability to access the repository.

Objectives

  • Use Spark ML on Databricks to train a sentiment classification model.
  • Read the tweets collected from AWS s3 into Databricks use the trained model to create predictions and then save it back to s3.
  • Use AWS Athena to create tables.
  • Use AWS Quicksight to build a dashboard.
  • Add the sentiment predictions to the Quicksight dashboard.

Architecture

Article content

AWS s3

  • WCD Bucket: WeCloudData gave us access to a CSV file containing a huge set of tweets labeled indicating the sentiment they represent and their texts. They gave us valid credentials for their S3 bucket in the AWS cloud to do this.

  • My Bucket: In the same way, the results of our pipeline were stored in our s3 bucket in the AWS cloud. This includes three files, one containing the clean data from the tweets, a second with the predictions estimated by our model, and the third contains model performance metrics.

Data Bricks

  • Spark Cluster: A free access cluster computing service became necessary due to the large volume of information handled (BigData). Databricks Community Edition was selected to control and execute the processes of data access, loading, cleaning, formatting, preprocessing, modeling, estimation, evaluation, and storage.

INSTANCE SETTINGS
1 Driver : 15.3 GB Memory, 2 Cores, 1 DBU 
Runtime : 12.2 LTS (Scala 2.12, Spark 3.3.2) 
Spark config : spark.databricks.rocksDB.fileManager.useCommitService false
Environment variables : PYSPARK_PYTHON=/databricks/python3/bin/python3        

  • Notebook: With the instance running, a notebook was created following a two-stage, multi-step structure.

Article content

The first is "From WCD s3 to DB", where 4 steps are carried out.

  • In STEP.0 the PySpark session is created.

Article content

  • In STEP.1 the DataSet is loaded from the WCD s3 bucket.

Article content
Article content

  • In STEP.2 the Text is Cleaned and Preprocessed.

Article content

  • In STEP.3 the 'Pipeline' is executed to create the model and obtain the results. So far, we've trained a logistic regression classifier to predict the sentiment. To make the model pipeline reusable for predicting future tweet sentiment, we can put all the transformers and estimators in a Pipeline object including NGram, VectorAssembler, and ChiSqSelector.

Article content

Once the predictions were found, the model could be evaluated by calculating "Accuracy" and "ROC-AUC" metrics.

Article content

The second stage "FROM DB to GAB s3", where 5 steps are carried out.

  • First "Connect with My s3" establishes the secure connection with my s3 Bucket.

Article content

  • The processed data is then saved as tweets.CSV in my S3 Bucket.

Article content
Article content

  • The predictions are then saved as pred.parquet in my S3 Bucket.

Article content
Article content

  • In the same way, the model's performance metrics are saved as perform.CSV in my S3 Bucket.

Article content
Article content

  • Finally, the performance of the model is visualized.

Article content
Article content

AWS Athena

  • Data Base: At this point, three big data files were contained in our "s3 Bucket", the clean original data set, the model predictions, and the model performance metrics. The next step was to use "AWS Athena" to create an "Apache-Hadoop" environment to manage the storage and processing of large amounts of data. This was allowed by Hive's SQL interface.

Article content

  • Table -> MySQL Workbench: In this sense, the SQL code used to create the data tables was preserved as a MySQL file, maintaining the format and structure of big data.

Article content

AWS Quick Sight

  • Data

First, a “word cloud” was created showing the tweets' text, where the size indicates how many times it appears in the data set. In addition, two top 3 were presented, one for the original data and another for the predictions.

Article content

In the second element another “word cloud” was created showing the tokens extracted from the tweets, indicating their total number. The top-ranked 3 and bottom-ranked 3 of the tokens found were also presented.

Article content

Between these sections, two ring graphs were presented, these show the percentages obtained as positive and negative sentiments, both for the current data and for the predictions.

Article content
Article content

Below is a "Matrix Plot" for predicted versus actual sentiment. This shows the cross-connections for the two positive or negative options, allowing you to visualize even with a color scale which connections are most frequent.

Article content

In this same sense, there is a "Sankey Plot" allowing, in this case, to visualize how many of the current cases were kept for the prediction (Neg to Neg or Pos to Pos), as well as how many showed a change of option (Neg to Pos or Pos to Neg).

Article content

  • Calculated Fields

To finish this “Dashboard” it is necessary to show some performance metrics that were not obtained directly from the model, but that can be calculated from the predicted data.

Article content

These are "Recall", "Precision" and "F1 Score". Since it was possible to count the number of times where the sentiment changed and where it did not, for both positive and negative sentiment options.

Article content
Article content

  • Dashboard

In this way, the development of the “Dashboard” could be completed, leaving the final result as shown below.

Article content

CHALLENGES

Human Limitations

As responsible for the successful execution of this project, all the experiences on the DataScience Boot Camp gave me the necessary skills and expertise to complete the project on time. Thanks to that, this task is feasible properly with the time limitations for a part-time participant.

Databricks Limitances

One of the most important limitations was that I couldn't access the Menu bar in the Data Bricks' new version until I identified the best browser option to proceed.

In the same way, this free-use version allows only 1h of running time on EC2 instances community Databricks, which was not enough to process the big data. Because of that, the code was optimized to reduce the time required to finish.

Another situation to deal with isn't Run (%) magic functions in Multi Languages Notebooks Cells. This issue required adapting the code to the libraries and functions allowed in the platform.

It was necessary to use a process to connect, test the connection, and disconnect with AWS because only one S3 Bucket could be accessed at the time. It can't be active connections to accessing multi S3 Buckets.

Athena & QuickSight

The first time you configure s3 as a data source with Athena, the process becomes more complex.

When the data is finally accessible, the array format becomes the next challenge, because data management becomes tedious due to the restriction in Athena of only 1 SQL query per tab. This was solved using the "parquet" format in the data where more efficient handling was required.

After testing different ways to set data sources and visualizing the data in QuickSight, it was possible to place Athena properly and select the set of graphs and visual elements that would allow the data to be understood, summarizing the dataset and the model's performance, and neatly transmitting the information.

Conclusions

The potential of cloud resources is limitless, and a Multi-Heterogeneous Architecture can be effectively deployed. The sentiment, both actual and predicted, is predominantly positive. The sentiment prediction overall is good, with an accuracy of 0.86 and an f1 score of 0.79. Now more than ever, with the necessary knowledge, it is possible to take advantage of the key features and services that providers offer.

Next Steps

For this project, we start with a dataset provided by WeCloudData with records labeled as positive or negative.

The first option in this sense is to try to detect the sentiment of each text and classify it including the “neutral sentiment” option.

The second initiative is to create a new Dashboard to include another platform like “Tableau”.

A third idea is to include elements in the architecture that allow data to be acquired, processed, and displayed dynamically.

Acknowledgment

As a WeCloudData Data Science Bootcamp part-time participant, time is a significant constraint that reflects my effort, commitment, discipline, and confidence. It also requires best practices and patience from the instructors, TA team (always on call), and partners on this journey to success. Without them, this result would not have been achieved.

Ciserelly

THANKS FOR ALL


To view or add a comment, sign in

More articles by Gabriel Villa

Insights from the community

Others also viewed

Explore topics