Let’s do some sentiment classification with logistic regression! We’ll use the scikit-learn package (sklearn) to fit the logistic regression, as well as for train-test splitting.
python
import nltk from sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitimport numpy as npnltk.download("movie_reviews")nltk.download("punkt")nltk.download("stopwords")nltk.download("vader_lexicon")
[nltk_data] Downloading package movie_reviews to
[nltk_data] /Users/joseffruehwald/nltk_data...
[nltk_data] Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data] /Users/joseffruehwald/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/joseffruehwald/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data] /Users/joseffruehwald/nltk_data...
[nltk_data] Package vader_lexicon is already up-to-date!
True
Approach 1: Sentiment Analyzer
For our first approach, we’ll use a sentiment analyzer to extract some features for the logistic regression. According to the VADER documentation, the sentiment scores we get back will be the proportion of words in a review that were in the positive, negative, and neutral lexicon lists.
python
from nltk.corpus import movie_reviewsfrom nltk.sentiment.vader import SentimentIntensityAnalyzer
We’ll grab all of the review ids and labels, to then split them into training and test sets.
python
all_ids =list( movie_reviews.fileids() )all_labels = [x.split("/")[0] for x in all_ids]
The train_tets_split() function will give is nice train/test divisions of the data.
What we’ll need to fit the logistic regression is an array of feature values with 1 row per review, and one column per sentiment value. Here’s a function that’ll do that.
python
def make_features(X): whole_reviews = [movie_reviews.raw(fileids =id) foridin X] review_words = [movie_reviews.words(fileids =id) foridin X] sent_analyzer = SentimentIntensityAnalyzer() all_sentiments = [sent_analyzer.polarity_scores(r) for r in whole_reviews] neg_array = np.array([ a_sentiment["neg"]for a_sentiment in all_sentiments ]) pos_array = np.array([ a_sentiment["pos"]for a_sentiment in all_sentiments ]) neu_array = np.array([ a_sentiment["neu"]for a_sentiment in all_sentiments ]) all_features = np.vstack([neg_array, pos_array, neu_array]).Treturn all_features
We’ll also need a function that will turn “pos” and “neg” into 1 and 0.
python
def make_outcomes(y): binary = [1if label =="pos"else0for label in y ]return np.array(binary)
To set up the logistic regression, we create a model object with all of the settings we want. For example, here we’ll use L2 regularization.
python
model = LogisticRegression(penalty ="l2")
Then, it’s just a matter of fitting the model.
python
model.fit(train_features, train_outcomes)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Since we have relatively few features, we can make some kind of sense of the coefficients.
python
model.coef_
array([[-5.24518667, 6.06030818, -0.72804943]])
Evaluating the classifier
To evaluate the classifier, we need to create our test features and test outcomes.
How doe we need to adjust the code above to get the precision array?
python
precision_array = np.array([])
python
precision_array.mean()
/var/folders/sf/84lgr1c10pggym7_8p8f54r40000gp/T/ipykernel_54013/2512464149.py:1: RuntimeWarning: Mean of empty slice.
precision_array.mean()
/Users/joseffruehwald/Documents/courses/Lin511-2024.github.io/renv/python/virtualenvs/renv-python-3.11/lib/python3.11/site-packages/numpy/core/_methods.py:129: RuntimeWarning: invalid value encountered in scalar divide
ret = ret.dtype.type(ret / rcount)
nan
Doing it with word counts
python
from sklearn.feature_extraction.text import CountVectorizer
python
def make_token_features(X): raw_reviews = [movie_reviews.raw(id) foridin X] count_vec = CountVectorizer(analyzer="word", stop_words="english", binary=True) X = count_vec.fit_transform(raw_reviews)return count_vec, X
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
/var/folders/sf/84lgr1c10pggym7_8p8f54r40000gp/T/ipykernel_54013/1754276131.py:4: RuntimeWarning: Mean of empty slice.
recall_array.mean()
/Users/joseffruehwald/Documents/courses/Lin511-2024.github.io/renv/python/virtualenvs/renv-python-3.11/lib/python3.11/site-packages/numpy/core/_methods.py:129: RuntimeWarning: invalid value encountered in scalar divide
ret = ret.dtype.type(ret / rcount)