Bayes Theorem
Before we can get started with “Naive Bayes Classification”, we should maybe begin with “What is Bayes Theorem”.
Bayes Theorem
Bayes Theorem describes how we should use evidence or data as an indicator of some kind of outcome, or hypothesis. Specifically:
We need to consider how likely the data would be given the outcome (a conditional probability, \(P(d|h)\))
The probability of the outcome, generally \(P(h)\).
The probability of the data, generally \(P(d)\).
To use a linguistic example, let’s say we were listening to a recording with a lot of static, and we couldn’t make out even which language we were listening to. But then, we hear a very distinct [ð]. What is the probability that we’re listening to English?
The probability that we would hear [ð] if it were English is pretty high. It appears at the start of a lot of frequent function words.
The probability we’re listening to a recording of English, generally, depends on the context. Are we pulling random recordings from the internet? Are we listening to collection of specifically cross-linguistic data?
The probability of hearing [ð], regardless of the language, would require some more data exploration. It’s not that common a phoneme, cross-linguistically, but it is an allophone in many languages (e.g. Spanish).
The formula for calculating the probability that we’re listening to English would be:
\[ P(\text{English} | ð) = \frac{P(ð|\text{English}) P(\text{English})}{P(ð)} \]
The really important thing to remember is that inverting a conditional probability is hard.
A Bayes Failure
Wieling et al. (2016) found that many different languages were undergoing a change from preferring “UH” type filled pauses to “UM” type filled pauses. And, like many language changes where women lead the change:
\[ P(\text{UM} | \text{woman}) \gt P(\text{UM} | \text{man}) \]
The Daily Mail ran a headline that said something like “Did you say ‘Uh’? You probably support UKIP.”
This involved the inversion of a few different conditional probabilities.
\[ P(\text{UKIP support} | \text{man}) = \frac{P{(\text{man} | \text{UKIP support})P(\text{UKIP support)}}}{P(\text{man})} \]
and
\[ P(\text{man}|\text{Uh}) = \frac{P(\text{Uh}|\text{man})P(\text{man})}{P(\text{UH})} \]
They really only had access to the two conditional probabilities on the right hand side of the equations! And, additionally, the \(P(\text{UKIP support})\) probability was pretty low!
An example for document classification
python
import nltk
import numpy as np
from collections import Counter
"movie_reviews") nltk.download(
python
from nltk.corpus import movie_reviews
The movie_reviews
object is kind of idiosyncractic. Here’s how we access its contents.
Movie Reviews setup
Getting all File IDs
python
# To get all review ids:
= list(
all_ids
movie_reviews.fileids()
)
0:10] all_ids[
['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt', 'neg/cv005_29357.txt', 'neg/cv006_17022.txt', 'neg/cv007_4992.txt', 'neg/cv008_29326.txt', 'neg/cv009_29417.txt']
Getting sentiments from file ids
python
from collections import Counter
"/")[0] for x in all_ids]) Counter([x.split(
Counter({'neg': 1000, 'pos': 1000})
So, we know \(P(\text{positive}) = P(\text{negative}) = 0.5\).
Getting the words from a single review
python
movie_reviews.words(1]
all_ids[ )
['the', 'happy', 'bastard', "'", 's', 'quick', 'movie', ...]
Getting words from negative or positive reviews
python
= ["neg"]) movie_reviews.words(categories
['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]
python
= ["pos"]) movie_reviews.words(categories
['films', 'adapted', 'from', 'comic', 'books', 'have', ...]
Getting ready to classify
Let’s say we wanted to know whether or not a movie review was negative given the fact it has the word “barely” in it. That is, we want to know:
\[ P(\text{negative}|barely) \gt P(\text{positive} | barely) \]
or
\[ P(\text{negative} | barely) \lt P(\text{positive}|barely) \]
Using Bayes theorem, that means we need to find out:
\[ P(\text{sentiment}|barely) = \frac{P(barely | \text{sentiment})P(\text{sentiment})}{P(barely)} \]
Getting Counts
Our probabilities will be derived from counts:
python
# from collections import Counter
= Counter(
C_neg
movie_reviews.words(= ["neg"]
categories
)
)
= Counter(
C_pos
movie_reviews.words(= ["pos"]
categories
)
)
= Counter(
C_all
movie_reviews.words() )
Calculating probabilities
python
# Conditional probabilities
= C_pos["barely"] / C_pos.total()
P_barely_pos = C_neg["barely"] / C_neg.total()
P_barely_neg
# Base probability
= C_all["barely"] / C_all.total()
P_barely
# Sentiment probabilities
= 0.5
P_pos = 0.5 P_neg
Calculating the posterior
python
= (P_barely_pos * P_pos) / P_barely
P_pos_barely
= (P_barely_neg * P_neg) / P_barely P_neg_barely
python
P_pos_barely
0.3891150491952352
python
P_neg_barely
0.6228859645471293
Naive Bayes
Now, this is the measure of how a single word contributes the the classification of a document. But let’s grab just one of the reviews with “barely” in it.
python
= [
review_words list(movie_reviews.words(id))
for id in all_ids
]
= [
barely_reviews
review for review in review_words
if "barely" in review
]
len(barely_reviews[0])
1025
There were 1025 other words in the review. Surely they each contributed some to the total meaning.
“Bag of Words”
The most basic Naive Bayes model treats each review like a “bag of words.”
python
0]).most_common(10) Counter(barely_reviews[
[('.', 55), ('the', 54), (',', 51), ("'", 22), ('"', 22), ('and', 18), ('a', 16), ('of', 15), ('-', 14), ('(', 14)]
Then, for each token, we calculate the \(P(\text{sentiment} | w)\) and multiply them together.
\[ P(\text{sentiment} | W) = P(\text{sentiment)}\prod_{w\in W}P(w|\text{sentiment)} \]
Or, to do it in log-space
\[ \log P(\text{sentiment}|W) = \log P(\text{sentiment)} + \sum_{w\in W}\log P(w|\text{sentiment}) \]
Practicalities
All of the same issues that arose for n-gram models which required “smoothing” also arise for these Naive Bayes models, including:
Out of vocabulary items in the document you’re trying to classify.
A token if interest \(w\) not appearing in one of the categories, leading to a 0.0 probability.
“Feature Engineering”.
Another thing to take into account is that you might want to adjust or create new features based on the text to do your classification, which is commonly known as “feature engineering.”
References
Reuse
Citation
@online{fruehwald2024,
author = {Fruehwald, Josef},
title = {Bayes {Theorem}},
date = {2024-03-19},
url = {https://lin511-2024.github.io/notes/meetings/10_naive_bayes1.html},
langid = {en}
}