Training and evaluating a logistic regression

python

Author

Published

March 27, 2024

Modified

April 8, 2024

Let’s do some sentiment classification with logistic regression! We’ll use the scikit-learn package (sklearn) to fit the logistic regression, as well as for train-test splitting.

python

import nltk 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

import numpy as np

nltk.download("movie_reviews")
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("vader_lexicon")

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/joseffruehwald/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/joseffruehwald/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/joseffruehwald/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/joseffruehwald/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!

True

Approach 1: Sentiment Analyzer

For our first approach, we’ll use a sentiment analyzer to extract some features for the logistic regression. According to the VADER documentation, the sentiment scores we get back will be the proportion of words in a review that were in the positive, negative, and neutral lexicon lists.

python

from nltk.corpus import movie_reviews
from nltk.sentiment.vader import SentimentIntensityAnalyzer

We’ll grab all of the review ids and labels, to then split them into training and test sets.

python

all_ids = list(
  movie_reviews.fileids()
  )

all_labels  = [x.split("/")[0] for x in all_ids]

The train_tets_split() function will give is nice train/test divisions of the data.

python

X_train, X_test, y_train, y_test = train_test_split(all_ids, all_labels, train_size=0.8)

Looking at the Sentiment Analyzer

Let’s load up the sentiment analyzer and see what it does to one movie review.

python

sent = SentimentIntensityAnalyzer()
one_review = movie_reviews.raw(X_train[0])

python

sent.polarity_scores(one_review)

{'neg': 0.15, 'neu': 0.702, 'pos': 0.149, 'compound': -0.3162}

What we’ll need to fit the logistic regression is an array of feature values with 1 row per review, and one column per sentiment value. Here’s a function that’ll do that.

python

def make_features(X):
    whole_reviews = [movie_reviews.raw(fileids = id) for id in X]
    review_words = [movie_reviews.words(fileids = id) for id in X]
    sent_analyzer = SentimentIntensityAnalyzer()
    all_sentiments = [sent_analyzer.polarity_scores(r) for r in whole_reviews]

    neg_array = np.array([
        a_sentiment["neg"]
        for a_sentiment in all_sentiments
        ])

    pos_array = np.array([
        a_sentiment["pos"]
        for a_sentiment in all_sentiments
        ])

    neu_array = np.array([
        a_sentiment["neu"]
        for a_sentiment in all_sentiments
        ])
    
    all_features = np.vstack([neg_array, pos_array, neu_array]).T

    return all_features

We’ll also need a function that will turn “pos” and “neg” into 1 and 0.

python

def make_outcomes(y):
    binary = [
        1 
        if label == "pos"
        else
        0
        for label in y
    ]

    return np.array(binary)

python

train_features = make_features(X_train)
train_outcomes = make_outcomes(y_train)

python

train_features.shape

(1600, 3)

Setting up the logistic regression

To set up the logistic regression, we create a model object with all of the settings we want. For example, here we’ll use L2 regularization.

python

model = LogisticRegression(penalty = "l2")

Then, it’s just a matter of fitting the model.

python

model.fit(train_features, train_outcomes)

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Since we have relatively few features, we can make some kind of sense of the coefficients.

python

model.coef_

array([[-5.24518667,  6.06030818, -0.72804943]])

Evaluating the classifier

To evaluate the classifier, we need to create our test features and test outcomes.

python

test_features = make_features(X_test)
test_outcomes = make_outcomes(y_test)

python

predictions = model.predict(test_features)

Now, let’s get the recall score.

python

recall_array = np.array([
    pred
    for pred, label in zip(predictions, test_outcomes)
    if label == 1
])

python

recall_array.mean()

0.5849056603773585

How doe we need to adjust the code above to get the precision array?

python

precision_array = np.array([

])

python

precision_array.mean()

/var/folders/sf/84lgr1c10pggym7_8p8f54r40000gp/T/ipykernel_54013/2512464149.py:1: RuntimeWarning: Mean of empty slice.
  precision_array.mean()
/Users/joseffruehwald/Documents/courses/Lin511-2024.github.io/renv/python/virtualenvs/renv-python-3.11/lib/python3.11/site-packages/numpy/core/_methods.py:129: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)

nan

Doing it with word counts

python

from sklearn.feature_extraction.text import CountVectorizer

python

def make_token_features(X):
    raw_reviews = [movie_reviews.raw(id) for id in X]
    count_vec = CountVectorizer(analyzer="word", stop_words= "english", binary=True)
    X = count_vec.fit_transform(raw_reviews)

    return count_vec, X

python

vectorizer, train_features2 = make_token_features(X_train)

python

model2 = LogisticRegression(penalty = "l2")

python

model2.fit(train_features2, train_outcomes)

LogisticRegression()

python

raw_reviews = [movie_reviews.raw(id) for id in X_test]
test_features2 = vectorizer.transform(raw_reviews)

python

predictions2 = model2.predict(test_features2)

python

recall_array = np.array([

])
recall_array.mean()

/var/folders/sf/84lgr1c10pggym7_8p8f54r40000gp/T/ipykernel_54013/1754276131.py:4: RuntimeWarning: Mean of empty slice.
  recall_array.mean()
/Users/joseffruehwald/Documents/courses/Lin511-2024.github.io/renv/python/virtualenvs/renv-python-3.11/lib/python3.11/site-packages/numpy/core/_methods.py:129: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)

nan

python

precision_array = np.array([
 
])
precision_array.mean()

/var/folders/sf/84lgr1c10pggym7_8p8f54r40000gp/T/ipykernel_54013/1530356477.py:4: RuntimeWarning: Mean of empty slice.
  precision_array.mean()

nan

Reuse

CC-BY-SA 4.0

Citation

BibTeX citation:

@online{fruehwald2024,
  author = {Fruehwald, Josef},
  title = {Training and Evaluating a Logistic Regression},
  date = {2024-03-27},
  url = {https://lin511-2024.github.io/notes/programming/07_logistic.html},
  langid = {en}
}

For attribution, please cite this work as:

Fruehwald, Josef. 2024. “Training and Evaluating a Logistic Regression.” March 27, 2024. https://lin511-2024.github.io/notes/programming/07_logistic.html.