Term Frequency - Inverse Document Frequency
Descrbing a document by its words
Binary coding
One way to represent the content of a document, like a movie review, is with a binary code of 1, if a word appears in it, or a 0, if a word does not.
python
import numpy as np
import nltk
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
= movie_reviews.fileids()
all_ids = [movie_reviews.words(id) for id in all_ids] all_words
python
= stopwords.words("english")
english_stop print(english_stop[0:10])
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
python
= [
all_words_filtered
[
word for word in review
if word not in english_stop
]for review in all_words
]
python
= np.array([
review_has_good 1
if "good" in review
else
0
for review in all_words_filtered
])
= np.array([
review_has_excellent 1
if "excellent" in review
else
0
for review in all_words_filtered
])
= np.array([
review_has_bad 1
if "bad" in review
else
0
for review in all_words_filtered
])
Token Counts
Or, we could count how often each word appeared in a review. Probably if a review has the word “good” in it 6 times, that’s a more important word to the review than one where it appears just once.
python
from collections import Counter
= np.array([
good_count "good"]
Counter(review)[for review in all_words_filtered
])
= np.array([
excellent_count "excellent"]
Counter(review)[for review in all_words_filtered
])
= np.array([
bad_count "bad"]
Counter(review)[for review in all_words_filtered
])
Since the reviews are different lengths, we can “normalize” them.
python
= np.array([
total_review len(review)
for review in all_words_filtered
])
= good_count/total_review
good_norm = excellent_count/total_review
excellent_norm = bad_count/total_review bad_norm
Document Frequency
On the other hand, it looks like “good” and “bad” appear in lots of reviews.
python
review_has_good.mean()
0.591
python
review_has_bad.mean()
0.3865
Wheras, the word “excellent” doesn’t appear in that many reviews overall.
python
review_has_excellent.mean()
0.0725
Maybe, when the word “excellent” appears in a review, it should be taken more seriously, since it doesn’t appear in that many.
TF-IDF
- TF
-
Term Frequency
- IDF
-
Inverse Document Frequency
The TF-IDF value tries to describe the words that appear in a document by how important they are to that document.
If a word appears __ in this document | that appears in documents __ | then… |
---|---|---|
often | rarely | it’s probably important |
often | very often | it might not be that important |
Term Frequency
\[ tf = \log C(w)+1 \]
Why log?
python
= np.log(good_count+1)
good_tf = np.log(excellent_count + 1)
excellent_tf = np.log(bad_count + 1) bad_tf
Inverse Document Frequency
If \(n\) is the total number of documents there are, and \(df\) is the number of documents a word appears in
\[ idf = \log \frac{n}{df} \]
python
= len(all_words_filtered)
n_documents
= np.log(
good_idf /review_has_good.sum()
n_documents
)
= np.log(
excellent_idf /review_has_excellent.sum()
n_documents
)
= np.log(
bad_idf /review_has_bad.sum()
n_documents )
TF-IDF
We just multiply these two together
python
= good_tf * good_idf
good_tf_idf = excellent_tf * excellent_idf
excellent_tf_idf = bad_tf * bad_idf bad_tf_idf
Document “vectors”
Another way to look at these reviews is as “vectors”, or rows of numbers, that exist along the “good” axis, or the “bad” axis.
Doing it with sklearn
python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
Setting up the data.
python
= [
raw_reviews id)
movie_reviews.raw(for id in all_ids
]
= [
labels id.split("/")[0]
for id in all_ids
]
= np.array([
binary 1
if label == "pos"
else
0
for label in labels
])
Test-train split
python
= train_test_split(
X_train, X_test, y_train, y_test
raw_reviews,
binary,= 0.8
train_size )
Making the tf-idf “vectorizor”
python
= TfidfVectorizer(stop_words="english")
vectorizor = vectorizor.fit_transform(X_train)
X_train_vec = vectorizor.transform(X_test) X_test_vec
Fitting a logistic regression
python
= LogisticRegression(penalty = "l2")
model model.fit(X_train_vec, y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Testing the logistic regression
python
= model.predict(X_test_vec) preds
Accuracy
python
== y_test).mean() (preds
0.8425
Recall
python
= np.array([
recall_array
pred for pred, label in zip(preds,y_test)
if label == 1
])
recall_array.mean()
0.86
Precision
python
= np.array([
precision_array
labelfor pred, label in zip(preds, y_test)
if pred == 1
]) precision_array.mean()
0.8309178743961353
F score
python
= precision_array.mean()
precision = recall_array.mean()
recall
2 * ((precision * recall)/(precision + recall))
0.8452088452088452
python
= vectorizor.get_feature_names_out()
tokens tokens[model.coef_.argmax()]
'life'
Reuse
Citation
@online{fruehwald2024,
author = {Fruehwald, Josef},
title = {Term {Frequency} - {Inverse} {Document} {Frequency}},
date = {2024-04-02},
url = {https://lin511-2024.github.io/notes/meetings/12_tf-idf.html},
langid = {en}
}