Sentiment Tweets Analysis - Multi-Cloud - Spark ML Model - Prediction & Dashboard
One of the most valuable tools used to mine opinions or appraisal of a determinate subject is provided by the subjective analysis of comments on social media chats following the conversation threads to extract the referential words known as tokens. These can be detected, valorized, tabulated, correlated, and modeled to be finally predicted.
In this way, the opinion of a given audience can be evaluated and even known ("predicted") with the tokens represented as the sentiment expressed in the text of a message or comment. This allows to classify any text sentiment.
Within the framework of the Data Science Bootcamp, for this project WeCloudData provided access to an AWS s3 bucket with a dataset as a CSV file containing 1578627 tweets, tagged as "0" for negative or "1" for positive sentiment.
It should be noted that this volume of data ("BIG-DATA") represents a challenge since significant computing capabilities are required to obtain results in a reasonable time, which may exceed the resources available locally. This suggests the adequate use of a free-access cloud cluster that allows accessing the data, processing it, generating a model, and saving the results in a new repository with the same features as the original.
To display the findings in data exploration, analysis, and visualization, a dashboard needs to be developed using a cloud platform or service with the ability to access the repository.
Objectives
Architecture
AWS s3
Data Bricks
INSTANCE SETTINGS
1 Driver : 15.3 GB Memory, 2 Cores, 1 DBU
Runtime : 12.2 LTS (Scala 2.12, Spark 3.3.2)
Spark config : spark.databricks.rocksDB.fileManager.useCommitService false
Environment variables : PYSPARK_PYTHON=/databricks/python3/bin/python3
The first is "From WCD s3 to DB", where 4 steps are carried out.
Once the predictions were found, the model could be evaluated by calculating "Accuracy" and "ROC-AUC" metrics.
The second stage "FROM DB to GAB s3", where 5 steps are carried out.
AWS Athena
Recommended by LinkedIn
AWS Quick Sight
First, a “word cloud” was created showing the tweets' text, where the size indicates how many times it appears in the data set. In addition, two top 3 were presented, one for the original data and another for the predictions.
In the second element another “word cloud” was created showing the tokens extracted from the tweets, indicating their total number. The top-ranked 3 and bottom-ranked 3 of the tokens found were also presented.
Between these sections, two ring graphs were presented, these show the percentages obtained as positive and negative sentiments, both for the current data and for the predictions.
Below is a "Matrix Plot" for predicted versus actual sentiment. This shows the cross-connections for the two positive or negative options, allowing you to visualize even with a color scale which connections are most frequent.
In this same sense, there is a "Sankey Plot" allowing, in this case, to visualize how many of the current cases were kept for the prediction (Neg to Neg or Pos to Pos), as well as how many showed a change of option (Neg to Pos or Pos to Neg).
To finish this “Dashboard” it is necessary to show some performance metrics that were not obtained directly from the model, but that can be calculated from the predicted data.
These are "Recall", "Precision" and "F1 Score". Since it was possible to count the number of times where the sentiment changed and where it did not, for both positive and negative sentiment options.
In this way, the development of the “Dashboard” could be completed, leaving the final result as shown below.
CHALLENGES
Human Limitations
As responsible for the successful execution of this project, all the experiences on the DataScience Boot Camp gave me the necessary skills and expertise to complete the project on time. Thanks to that, this task is feasible properly with the time limitations for a part-time participant.
Databricks Limitances
One of the most important limitations was that I couldn't access the Menu bar in the Data Bricks' new version until I identified the best browser option to proceed.
In the same way, this free-use version allows only 1h of running time on EC2 instances community Databricks, which was not enough to process the big data. Because of that, the code was optimized to reduce the time required to finish.
Another situation to deal with isn't Run (%) magic functions in Multi Languages Notebooks Cells. This issue required adapting the code to the libraries and functions allowed in the platform.
It was necessary to use a process to connect, test the connection, and disconnect with AWS because only one S3 Bucket could be accessed at the time. It can't be active connections to accessing multi S3 Buckets.
Athena & QuickSight
The first time you configure s3 as a data source with Athena, the process becomes more complex.
When the data is finally accessible, the array format becomes the next challenge, because data management becomes tedious due to the restriction in Athena of only 1 SQL query per tab. This was solved using the "parquet" format in the data where more efficient handling was required.
After testing different ways to set data sources and visualizing the data in QuickSight, it was possible to place Athena properly and select the set of graphs and visual elements that would allow the data to be understood, summarizing the dataset and the model's performance, and neatly transmitting the information.
Conclusions
The potential of cloud resources is limitless, and a Multi-Heterogeneous Architecture can be effectively deployed. The sentiment, both actual and predicted, is predominantly positive. The sentiment prediction overall is good, with an accuracy of 0.86 and an f1 score of 0.79. Now more than ever, with the necessary knowledge, it is possible to take advantage of the key features and services that providers offer.
Next Steps
For this project, we start with a dataset provided by WeCloudData with records labeled as positive or negative.
The first option in this sense is to try to detect the sentiment of each text and classify it including the “neutral sentiment” option.
The second initiative is to create a new Dashboard to include another platform like “Tableau”.
A third idea is to include elements in the architecture that allow data to be acquired, processed, and displayed dynamically.
Acknowledgment
As a WeCloudData Data Science Bootcamp part-time participant, time is a significant constraint that reflects my effort, commitment, discipline, and confidence. It also requires best practices and patience from the instructors, TA team (always on call), and partners on this journey to success. Without them, this result would not have been achieved.
Ciserelly
THANKS FOR ALL