From the course: AI Projects with Python, TensorFlow, and NLTK
Increasing the accuracy of your analysis
From the course: AI Projects with Python, TensorFlow, and NLTK
Increasing the accuracy of your analysis
- [Instructor] In this video, we'll be creating a more advanced sentiment analysis model instead of using the pre-trained analyzer. So again, we'll start off by importing the required modules. We import nltk and the movie_reviews module as well, which is a corpus of different movie reviews included with nltk, which we can use in order to train and test our classifier. Then we also import the NaiveBayesClassifier, which is a class within nltk.classifier. And we'll use this for our classification of different words. Then we also import accuracy, which will help us evaluate the accuracy of our classifier. So let's move on now and prepare the documents that we need. Now, if you remember, we imported the movie_reviews module, which actually has all the documents that we need. So all that this part of the code does is it prepares the data set by creating a list of documents where each document is represented as a tuple, and the first element of the tuple is a list of words in the different documents. And the second element is the category of the document, whether it's positive or negative, et cetera. Now we can move on and we shuffle the documents. So we do this to ensure that the order, because we want to try and reduce any bias that we can. Then we define the feature extractor, and we create the document_features function. Now, this part of the code essentially, is dealing with the feature extractor. Now, the feature extractor is a function that takes a document as an input and then it returns a dictionary of the different features of that document. In this case, the features are the 2000 most common words in the movie_reviews corpus. And the value of each of these features is whether or not a word is in that document. So then we can move on and actually start to train our classifier. Now, to do so, we first create a list of feature sets where each feature set is a tuple and it consists of a dictionary of features, which we've created earlier, and a category. It then splits the features sets into a training set and a test set. Finally, we use the NaiveBayesClassifier on the training set. Now, you may be wondering what the NaiveBayesClassifier is. It's simply a probabilistic classifier and it's based on applying base theorem, which you can have a look at in more detail if you're interested. And essentially it works by having really strong assumptions on the different independencies of the different features that we have created. Then we can go ahead and test the accuracy of our classifier. So we've trained our classifier and now we can check how accurate this classifier actually is. The test data is also provided within the movie_reviews data set, which we had imported earlier. Then we'll go ahead and print out the most informative features, or the features that are the most effective at distinguishing between the different categories. For each feature, the output will show the ratio occurrence in each category. And we'll also see which category it's more indicative of. Let's go ahead and run this to see what we have done. So, first of all, we see the accuracy. This is a number from 0 to 1, and it essentially represents the proportion of the test set that our classifier was able to label correctly. And in our case, that was 86%, which is pretty decent. Then we see the most informative features, which we printed out right at the end, which helped us distinguish between the different categories. So we see the ratio of occurrence in each category along with which category it is the most indicative of. So we see that if it includes the word outstanding, that there's a very high chance that it's actually a positive sentence. And if it includes the word wonderfully, again, it's a really high chance that it's a positive sentence. Whereas if it includes the word seagull, then it's a really high chance that it's actually a really negative sentence because the ratio of negative to positive is much higher.