Text analysis on WEB 3.0 using R & Python.
Let me take you to a situation, you are a group of non-tech people and you don’t have any prior experience with coding you have been assigned your first business analytics project on text analysis which included the following task:
· Extraction of the dataset by scraping or using structured data available on the internet for your topic.
· Data cleaning and pre-processing.
· Extracting and formatting 5 keywords
· Identifying unique words in the vocabulary
· Use Association rule mining (ARM) to understand the relation between the keywords.
This was our first-hand experience with python and business analytics.
So, we had two options panic or work hard!
The topic we decided on for our dataset was WEB3.0. Still, we could not find any structured dataset on the internet for our users, so we went with the data scraping process for data extractions and after many failed processes and tools we came across Snscrape which helped us in scraping the required dataset.
Okay, enough of talking now let's get into the process, codes, and analytical part of this project.
Tools we used:
· Python
o Data extraction
o Data cleaning and pre-processing
o Extracting and formatting 5 keywords
· R programming
o Association rule mining (ARM)
Then we installed and used the following Package:
· pandas: A Python library for data manipulation and analysis, with tools for working with tabular data, time series data, and more.
· scrape: A Python library for scraping social media data from platforms like Twitter, Instagram, and Reddit.
· numpy: A popular Python library for numerical computing, with tools for working with large arrays and matrices of numeric data.
· collections: A built-in Python library providing useful data structures, including deque, defaultdict, Counter, and OrderedDict.
· gensim.models: A Python library for topic modeling and natural language processing, with tools for building and training various types of models.
· matplotlib.plot: A module within the matplotlib library providing tools for creating data visualizations in Python.
· spacy: A Python library for natural language processing, with tools for tokenization, parsing, named entity recognition, and more.
· nltk: The Natural Language Toolkit, a Python library for working with human language data, providing tools for tokenization, stemming, tagging, parsing, and more.
· itertools: A built-in Python library providing tools for working with iterators and inerrable objects, including functions for generating combinations, permutations, and other useful operations.
· multiprocessing: A built-in Python library providing a way to run multiple processes in parallel, allowing for efficient use of multi-core processors and speeding up certain types of computations.
· sklearn.decomposition: A module within scikit-learn that provides implementations of matrix factorization techniques such as PCA, NMF, and ICA, for feature extraction and dimensionality reduction in machine learning.
· sklearn.manifold: A module within the scikit-learn library providing implementations of manifold learning techniques for dimensionality reduction and visualization of high-dimensional datasets.
Extraction of dataset
We came across Snscrape and it is a great package that lets you extract data from various social media platforms.
Following are the codes of installation and writing the query.
import snscrape.modules.Twitter as sntwitter
import pandas as pd
query = "web3.0 lang:en since:2022-01-01 until:2022-01-31"
tweets = []
limit = 1000
for the tweet in sntwitter.TwitterSearchScraper(query).get_items():
# print(vars(tweet))
# break
if len(tweets) == limit:
break
else:
tweets.append([tweet.content])
in our case we have extracted data from Twitter you can change that, in the query we have to describe the requirement for a dataset; the data points that we have collected are 1000 in the English language we have only asked for the tweets as we did not require the other information for our text analysis you can make necessary changes in the query as per your requirement like user-id, data-time, likes and comments, etc.
Data cleaning and pre-processing
Then further in this process, we used various libraries for text cleaning and pre-processing like pandas and NLTK data cleaning and modification.
Following were the codes for:
o Removing URL
o Removing numbers
o remove Twitter handles (@user)
o function to remove patterns in the input text.
o, remove special characters, numbers, and punctuations.
o Lemmatizes and removes stopwords
# removing URLs
def cleaning_URLs(data):
return re.sub('((www.[^s]+)|(https?://[^s]+))',' ',data)
dataframe['Text'] = dataframe['Text'].apply(lambda x: cleaning_URLs(x))
# removing numbers
def cleaning_numbers(data):
return re.sub('[0-9]+', '', data)
dataframe['Text'] = dataframe['Text'].apply(lambda x: cleaning_numbers(x))
# fucntion to remove pattern in the input text.
def remove_pattern(input_txt, pattern):
r = re.findall(pattern, input_txt)
for word in r:
input_txt = re.sub(word, "", input_txt)
return input_txt
# remove twitter handles (@user)
dataframe['Text'] = np.vectorize(remove_pattern)(dataframe['Text'], "@[\w]*")
# remove special characters, numbers, and punctuations
dataframe['Text'] = dataframe['Text'].str.replace("[^a-zA-Z#]", " ")
dataframe.head()
# remove special characters, numbers and punctuations
dataframe['Text'] = dataframe['Text'].str.replace("[^a-zA-Z#]", " ")
dataframe.head()
def cleaning(doc):
# Lemmatizes and removes stopwords
# doc needs to be a spacy Doc object
txt = [token.lemma_ for token in doc if not token.is_stop]
# Word2Vec uses context words to learn the vector representation of a target word,
# if a sentence is only one or two words long,
# the benefit of the training is very small
if len(txt) > 2:
return ' '.join(txt)
brief_cleaning = (re.sub("[^A-Za-z']+", ' ', str(row)).lower() for row in dataframe['Text'])
The provided code includes several functions that perform common text pre-processing tasks to clean and standardize text data in the "Text" column of a data frame for text analysis. The functions remove URLs, numeric characters, and Twitter handles using regular expressions, replace special characters and punctuation with spaces, and perform lemmatization and stop word removal using spaCy. These pre-processing steps help to remove irrelevant or noisy information from the text and prepare it for further analysis. However, the cleaning function is defined but not used in the provided code.
· Applying stopwords
then we went with stopwrords to remove unnecessary words
following are the codes
· import nltk
· from nltk.corpus import stopwords
· nltk.download('stopwords')
·
· with open('/content/new file of Analytics Project.csv', newline='') as csvfile:
· data = list(csv.reader(csvfile))
·
· stop_words = set(stopwords.words('english'))
· # add additional words to exclude
· stop_words.update(['one', 'two', 'to', 'three', 'this', 'are', 'the'])
· filtered_data = []
· for row in data:
· filtered_row = [word for word in row if word.lower() not in stop_words]
· filtered_data.append(filtered_row)
· with open('/content/stopper file.csv', 'w', newline='') as csvfile:
· writer = csv.writer(csvfile)
· writer.writerows(filtered_data)
· Extracting and formatting 5 keywords
After stopwords we were only left with the keywords now here we had extracted 5 keywords and formulated them.
The following codes were used to formulate and extract 5 keywords per tweet.
import csv
import re
with open('/content/stopper file.csv', newline='') as csvfile:
data = list(csv.reader(csvfile))
def extract_keywords(text):
keywords = []
# modify the regular expression to match the format of your keywords
pattern = r'\b(keyword1|keyword2|keyword3|keyword4|keyword5)\b'
matches = re.findall(pattern, text, re.IGNORECASE)
for match in matches:
keywords.append(match.lower())
return keywords
keyword_data = []
for row in data:
Recommended by LinkedIn
keywords = extract_keywords(' '.join(row))
keyword_data.append(keywords)
with open('/content/KEYWORD.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(keyword_data)
import csv
import re
with open('/content/stopper file.csv', newline='') as csvfile:
data = list(csv.reader(csvfile))
def extract_keywords(text):
keywords = []
# modify the regular expression to match the format of your keywords
pattern = r'\b(keyword1|keyword2|keyword3|keyword4|keyword5)\b'
matches = re.findall(pattern, text, re.IGNORECASE)
for match in matches:
keywords.append(match.lower())
return keywords
for row in data:
keywords = extract_keywords(' '.join(row))
print(keywords)
import csv
import re
from collections import Counter
from nltk.corpus import stopwords
with open('/content/stopper file.csv', newline='') as csvfile:
data = list(csv.reader(csvfile))
def extract_top_keywords(text):
stop_words = set(stopwords.words('english'))
words = re.findall('\w+', text.lower())
filtered_words = [word for word in words if word not in stop_words and not word.isnumeric()]
keyword_count = Counter(filtered_words)
return [keyword for keyword, count in keyword_count.most_common(4)]
with open('/content/NEW FILE.csv', mode='w', newline='') as csvfile:
writer = csv.writer(csvfile)
for row in data:
top_keywords = extract_top_keywords(' '.join(row))
writer.writerow(top_keywords)
· Identifying unique words in the vocabulary
We also went with identifying the most unique vocabulary which was repeated in these tweets.
The following codes were used:
# How many unique words are in the vocabulary?
all_words = " ".join([sentence for sentence in df_clean['clean']])
all_words = all_words.split()
freq_dict = {}
for word in all_words:
# set the default value to 0
freq_dict.setdefault(word, 0)
# increment the value by 1
freq_dict[word] += 1
voc_freq_dict = dict(sorted(freq_dict.items(), key=lambda item: item[1], reverse = True))
print(len(voc_freq_dict))
# top 10 words with frequency.
hist_plot = dict(itertools.islice(voc_freq_dict.items(), 10))
plt.bar(hist_plot.keys(), hist_plot.values(), width=0.5, color='g')
plt.xticks(rotation=90)
plt.show()
These codes will create a bar plot of the top 10 words with their frequencies from the dictionary "voc_freq_dict". The first line uses the "itertools.islice()" function to slice the dictionary and select the top 10 key-value pairs based on their values (frequencies). The second line creates a bar plot using the "plt.bar()" function, with the keys of the selected dictionary pairs as the x-axis values, the values of the selected dictionary pairs as the y-axis values, and a green color for the bars. The third line rotates the x-axis tick labels by 90 degrees for readability. The fourth line displays the plot using the "plt.show()" function.
Result
We have used R programming tool for visualization, plotting, and Association rule mining (ARM)
· We used R code to perform data manipulation and visualization tasks on a data frame.
· To filter incomplete cases.
· To convert columns to appropriate data types and create new columns.
· Used ggplot2 to create histograms and bar charts to visualize the data.
· At last, we coded the group data by specific columns, summarize the data using the count of rows, and visualize the results using bar charts.
· The aim was to explore the relationships between different variables in the data frame.
Following are the R codes.
## Association Rule
NEW_FILE_sorted <- NEW_FILE2[order(NEW_FILE2$text),]
library(plyr)
itemList <- ddply(NEW_FILE2,c("text","Date"),
function(df1)paste(df1$X.3,
collapse = ","))
itemList$text <- NULL
itemList$Date <- NULL
colnames(itemList) <- c("items")
write.csv(itemList,"NEW_FILE.csv", quote = FALSE, row.names = TRUE)
tr <- read.transactions('NEW_FILE.csv', format = 'basket', sep=',')
tr
summary(tr)
itemFrequencyPlot(tr, topN=20, type='absolute')
length(rules)
rules <- apriori(tr, parameter = list(supp=0.05, conf=0.0001))
rules <- sort(rules, by='confidence', decreasing = TRUE)
summary(rules)
inspect(rules[0:0001])
rules <- apriori(tr, parameter = list(supp=0.005, conf=0.1))
rules <- sort(rules, by='confidence', decreasing = FALSE)
summary(rules)
inspect(rules[1:10])
rules <- sort(rules, by='lift', decreasing = TRUE)
inspect(rules[1:10])
environment(rules)
length(rules)
topRules<-rules[1:10]
plot(topRules)
plot(rules)
plot(topRules, method="graph")
plot(rules, method = "graph")
plot(topRules, method = "grouped")
Association rule mining (ARM)
ARM uses algorithms to automatically evaluate and compare a large number of models with different parameters, selecting the best model based on predefined evaluation metrics. This allows users to focus on the problem they are trying to solve rather than spending time on the technical details of machine learning.
While ARM is a powerful tool, it is important to note that it is not a replacement for human expertise in machine learning. Domain knowledge and expertise are still necessary to understand the problem at hand, interpret the results, and make informed decisions about the suitability of the model for a particular application.