Let’s do some sentiment classification with logistic regression! We’ll use the scikit-learn package (sklearn) to fit the logistic regression, as well as for train-test splitting.
import nltk from sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitimport numpy as npnltk.download("movie_reviews")nltk.download("punkt")nltk.download("stopwords")nltk.download("vader_lexicon")
Approach 1: Sentiment Analyzer
For our first approach, we’ll use a sentiment analyzer to extract some features for the logistic regression. According to the VADER documentation, the sentiment scores we get back will be the proportion of words in a review that were in the positive, negative, and neutral lexicon lists.
from nltk.corpus import movie_reviewsfrom nltk.sentiment.vader import SentimentIntensityAnalyzer
We’ll grab all of the review ids and labels, to then split them into training and test sets.
all_ids =list( movie_reviews.fileids() )all_labels = [x.split("/")[0] for x in all_ids]
The train_tets_split() function will give is nice train/test divisions of the data.
What we’ll need to fit the logistic regression is an array of feature values with 1 row per review, and one column per sentiment value. Here’s a function that’ll do that.
def make_features(X): whole_reviews = [movie_reviews.raw(fileids =id) foridin X] review_words = [movie_reviews.words(fileids =id) foridin X] sent_analyzer = SentimentIntensityAnalyzer() all_sentiments = [sent_analyzer.polarity_scores(r) for r in whole_reviews] neg_array = np.array([ a_sentiment["neg"]for a_sentiment in all_sentiments ]) pos_array = np.array([ a_sentiment["pos"]for a_sentiment in all_sentiments ]) neu_array = np.array([ a_sentiment["neu"]for a_sentiment in all_sentiments ]) all_features = np.vstack([neg_array, pos_array, neu_array]).Treturn all_features
We’ll also need a function that will turn “pos” and “neg” into 1 and 0.
def make_outcomes(y): binary = [1if label =="pos"else0for label in y ]return np.array(binary)
To set up the logistic regression, we create a model object with all of the settings we want. For example, here we’ll use L2 regularization.
model = LogisticRegression(penalty ="l2")
Then, it’s just a matter of fitting the model.
model.fit(train_features, train_outcomes)
Since we have relatively few features, we can make some kind of sense of the coefficients.
array([[-5.24518667, 6.06030818, -0.72804943]])
Evaluating the classifier
To evaluate the classifier, we need to create our test features and test outcomes.
How doe we need to adjust the code above to get the precision array?
precision_array = np.array([])
Doing it with word counts
from sklearn.feature_extraction.text import CountVectorizer
def make_token_features(X): raw_reviews = [movie_reviews.raw(id) foridin X] count_vec = CountVectorizer(analyzer="word", stop_words="english", binary=True) X = count_vec.fit_transform(raw_reviews)return count_vec, X
