Training and evaluating a logistic regression

python
Author
Published

March 27, 2024

Modified

April 8, 2024

Let’s do some sentiment classification with logistic regression! We’ll use the scikit-learn package (sklearn) to fit the logistic regression, as well as for train-test splitting.

python
import nltk 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

import numpy as np

nltk.download("movie_reviews")
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("vader_lexicon")
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/joseffruehwald/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/joseffruehwald/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/joseffruehwald/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/joseffruehwald/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
True

Approach 1: Sentiment Analyzer

For our first approach, we’ll use a sentiment analyzer to extract some features for the logistic regression. According to the VADER documentation, the sentiment scores we get back will be the proportion of words in a review that were in the positive, negative, and neutral lexicon lists.

python
from nltk.corpus import movie_reviews
from nltk.sentiment.vader import SentimentIntensityAnalyzer

We’ll grab all of the review ids and labels, to then split them into training and test sets.

python
all_ids = list(
  movie_reviews.fileids()
  )

all_labels  = [x.split("/")[0] for x in all_ids]

The train_tets_split() function will give is nice train/test divisions of the data.

python
X_train, X_test, y_train, y_test = train_test_split(all_ids, all_labels, train_size=0.8)

Looking at the Sentiment Analyzer

Let’s load up the sentiment analyzer and see what it does to one movie review.

python
sent = SentimentIntensityAnalyzer()
one_review = movie_reviews.raw(X_train[0])
python
sent.polarity_scores(one_review)
{'neg': 0.15, 'neu': 0.702, 'pos': 0.149, 'compound': -0.3162}

What we’ll need to fit the logistic regression is an array of feature values with 1 row per review, and one column per sentiment value. Here’s a function that’ll do that.

python
def make_features(X):
    whole_reviews = [movie_reviews.raw(fileids = id) for id in X]
    review_words = [movie_reviews.words(fileids = id) for id in X]
    sent_analyzer = SentimentIntensityAnalyzer()
    all_sentiments = [sent_analyzer.polarity_scores(r) for r in whole_reviews]

    neg_array = np.array([
        a_sentiment["neg"]
        for a_sentiment in all_sentiments
        ])

    pos_array = np.array([
        a_sentiment["pos"]
        for a_sentiment in all_sentiments
        ])

    neu_array = np.array([
        a_sentiment["neu"]
        for a_sentiment in all_sentiments
        ])
    
    all_features = np.vstack([neg_array, pos_array, neu_array]).T

    return all_features

We’ll also need a function that will turn “pos” and “neg” into 1 and 0.

python
def make_outcomes(y):
    binary = [
        1 
        if label == "pos"
        else
        0
        for label in y
    ]

    return np.array(binary)
python
train_features = make_features(X_train)
train_outcomes = make_outcomes(y_train)
python
train_features.shape
(1600, 3)

Setting up the logistic regression

To set up the logistic regression, we create a model object with all of the settings we want. For example, here we’ll use L2 regularization.

python
model = LogisticRegression(penalty = "l2")

Then, it’s just a matter of fitting the model.

python
model.fit(train_features, train_outcomes)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Since we have relatively few features, we can make some kind of sense of the coefficients.

python
model.coef_
array([[-5.24518667,  6.06030818, -0.72804943]])

Evaluating the classifier

To evaluate the classifier, we need to create our test features and test outcomes.

python
test_features = make_features(X_test)
test_outcomes = make_outcomes(y_test)
python
predictions = model.predict(test_features)

Now, let’s get the recall score.

python
recall_array = np.array([
    pred
    for pred, label in zip(predictions, test_outcomes)
    if label == 1
])
python
recall_array.mean()
0.5849056603773585

How doe we need to adjust the code above to get the precision array?

python
precision_array = np.array([

])
python
precision_array.mean()
/var/folders/sf/84lgr1c10pggym7_8p8f54r40000gp/T/ipykernel_54013/2512464149.py:1: RuntimeWarning: Mean of empty slice.
  precision_array.mean()
/Users/joseffruehwald/Documents/courses/Lin511-2024.github.io/renv/python/virtualenvs/renv-python-3.11/lib/python3.11/site-packages/numpy/core/_methods.py:129: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
nan

Doing it with word counts

python
from sklearn.feature_extraction.text import CountVectorizer
python
def make_token_features(X):
    raw_reviews = [movie_reviews.raw(id) for id in X]
    count_vec = CountVectorizer(analyzer="word", stop_words= "english", binary=True)
    X = count_vec.fit_transform(raw_reviews)

    return count_vec, X
python
vectorizer, train_features2 = make_token_features(X_train)
python
model2 = LogisticRegression(penalty = "l2")
python
model2.fit(train_features2, train_outcomes)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
python
raw_reviews = [movie_reviews.raw(id) for id in X_test]
test_features2 = vectorizer.transform(raw_reviews)
python
predictions2 = model2.predict(test_features2)
python
recall_array = np.array([

])
recall_array.mean()
/var/folders/sf/84lgr1c10pggym7_8p8f54r40000gp/T/ipykernel_54013/1754276131.py:4: RuntimeWarning: Mean of empty slice.
  recall_array.mean()
/Users/joseffruehwald/Documents/courses/Lin511-2024.github.io/renv/python/virtualenvs/renv-python-3.11/lib/python3.11/site-packages/numpy/core/_methods.py:129: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
nan
python
precision_array = np.array([
 
])
precision_array.mean()
/var/folders/sf/84lgr1c10pggym7_8p8f54r40000gp/T/ipykernel_54013/1530356477.py:4: RuntimeWarning: Mean of empty slice.
  precision_array.mean()
nan
Back to top

Reuse

CC-BY-SA 4.0

Citation

BibTeX citation:
@online{fruehwald2024,
  author = {Fruehwald, Josef},
  title = {Training and Evaluating a Logistic Regression},
  date = {2024-03-27},
  url = {https://lin511-2024.github.io/notes/programming/07_logistic.html},
  langid = {en}
}
For attribution, please cite this work as:
Fruehwald, Josef. 2024. “Training and Evaluating a Logistic Regression.” March 27, 2024. https://lin511-2024.github.io/notes/programming/07_logistic.html.