Bayes Theorem

compling

Author

Josef Fruehwald

Published

March 19, 2024

Modified

April 8, 2024

Before we can get started with “Naive Bayes Classification”, we should maybe begin with “What is Bayes Theorem”.

Bayes Theorem

Thomas Bayes. Public domain, via Wikimedia

Bayes Theorem describes how we should use evidence or data as an indicator of some kind of outcome, or hypothesis. Specifically:

We need to consider how likely the data would be given the outcome (a conditional probability, \(P(d|h)\))
The probability of the outcome, generally \(P(h)\).
The probability of the data, generally \(P(d)\).

To use a linguistic example, let’s say we were listening to a recording with a lot of static, and we couldn’t make out even which language we were listening to. But then, we hear a very distinct [ð]. What is the probability that we’re listening to English?

The probability that we would hear [ð] if it were English is pretty high. It appears at the start of a lot of frequent function words.
The probability we’re listening to a recording of English, generally, depends on the context. Are we pulling random recordings from the internet? Are we listening to collection of specifically cross-linguistic data?
The probability of hearing [ð], regardless of the language, would require some more data exploration. It’s not that common a phoneme, cross-linguistically, but it is an allophone in many languages (e.g. Spanish).

The formula for calculating the probability that we’re listening to English would be:

\[ P(\text{English} | ð) = \frac{P(ð|\text{English}) P(\text{English})}{P(ð)} \]

The really important thing to remember is that inverting a conditional probability is hard.

A Bayes Failure

Wieling et al. (2016) found that many different languages were undergoing a change from preferring “UH” type filled pauses to “UM” type filled pauses. And, like many language changes where women lead the change:

\[ P(\text{UM} | \text{woman}) \gt P(\text{UM} | \text{man}) \]

The Daily Mail ran a headline that said something like “Did you say ‘Uh’? You probably support UKIP.”

This involved the inversion of a few different conditional probabilities.

\[ P(\text{UKIP support} | \text{man}) = \frac{P{(\text{man} | \text{UKIP support})P(\text{UKIP support)}}}{P(\text{man})} \]

and

\[ P(\text{man}|\text{Uh}) = \frac{P(\text{Uh}|\text{man})P(\text{man})}{P(\text{UH})} \]

They really only had access to the two conditional probabilities on the right hand side of the equations! And, additionally, the \(P(\text{UKIP support})\) probability was pretty low!

An example for document classification

python

import nltk
import numpy as np
from collections import Counter
nltk.download("movie_reviews")

python

from nltk.corpus import movie_reviews

The movie_reviews object is kind of idiosyncractic. Here’s how we access its contents.

Movie Reviews setup

Getting all File IDs

python

# To get all review ids:
all_ids = list(
  movie_reviews.fileids()
  )

all_ids[0:10]

['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt', 'neg/cv005_29357.txt', 'neg/cv006_17022.txt', 'neg/cv007_4992.txt', 'neg/cv008_29326.txt', 'neg/cv009_29417.txt']

Getting sentiments from file ids

python

from collections import Counter
Counter([x.split("/")[0] for x in all_ids])

Counter({'neg': 1000, 'pos': 1000})

So, we know \(P(\text{positive}) = P(\text{negative}) = 0.5\).

Getting the words from a single review

python

movie_reviews.words(
  all_ids[1]
)

['the', 'happy', 'bastard', "'", 's', 'quick', 'movie', ...]

Getting words from negative or positive reviews

python

movie_reviews.words(categories = ["neg"])

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]

python

movie_reviews.words(categories = ["pos"])

['films', 'adapted', 'from', 'comic', 'books', 'have', ...]

Getting ready to classify

Let’s say we wanted to know whether or not a movie review was negative given the fact it has the word “barely” in it. That is, we want to know:

\[ P(\text{negative}|barely) \gt P(\text{positive} | barely) \]

\[ P(\text{negative} | barely) \lt P(\text{positive}|barely) \]

Using Bayes theorem, that means we need to find out:

\[ P(\text{sentiment}|barely) = \frac{P(barely | \text{sentiment})P(\text{sentiment})}{P(barely)} \]

Getting Counts

Our probabilities will be derived from counts:

python

# from collections import Counter
C_neg = Counter(
  movie_reviews.words(
    categories = ["neg"]
    )
  )

C_pos = Counter(
  movie_reviews.words(
    categories = ["pos"]
    )
  )  

C_all = Counter(
  movie_reviews.words()
)

Calculating probabilities

python

# Conditional probabilities
P_barely_pos = C_pos["barely"] / C_pos.total()
P_barely_neg = C_neg["barely"] / C_neg.total()

# Base probability
P_barely = C_all["barely"] / C_all.total()

# Sentiment probabilities
P_pos = 0.5
P_neg = 0.5

Calculating the posterior

python

P_pos_barely = (P_barely_pos * P_pos) / P_barely

P_neg_barely = (P_barely_neg * P_neg) / P_barely

python

P_pos_barely

0.3891150491952352

python

P_neg_barely

0.6228859645471293

Naive Bayes

Now, this is the measure of how a single word contributes the the classification of a document. But let’s grab just one of the reviews with “barely” in it.

python

review_words = [
  list(movie_reviews.words(id)) 
  for id in all_ids
  ]

barely_reviews = [
  review 
  for review in review_words 
  if "barely" in review
  ] 

len(barely_reviews[0])

There were 1025 other words in the review. Surely they each contributed some to the total meaning.

“Bag of Words”

The most basic Naive Bayes model treats each review like a “bag of words.”

python

Counter(barely_reviews[0]).most_common(10)

[('.', 55), ('the', 54), (',', 51), ("'", 22), ('"', 22), ('and', 18), ('a', 16), ('of', 15), ('-', 14), ('(', 14)]

Then, for each token, we calculate the \(P(\text{sentiment} | w)\) and multiply them together.

\[ P(\text{sentiment} | W) = P(\text{sentiment)}\prod_{w\in W}P(w|\text{sentiment)} \]

Or, to do it in log-space

\[ \log P(\text{sentiment}|W) = \log P(\text{sentiment)} + \sum_{w\in W}\log P(w|\text{sentiment}) \]

Practicalities

All of the same issues that arose for n-gram models which required “smoothing” also arise for these Naive Bayes models, including:

Out of vocabulary items in the document you’re trying to classify.
A token if interest \(w\) not appearing in one of the categories, leading to a 0.0 probability.

“Feature Engineering”.

Another thing to take into account is that you might want to adjust or create new features based on the text to do your classification, which is commonly known as “feature engineering.”

References

Wieling, Martijn, Jack Grieve, Gosse Bouma, Josef Fruehwald, John Coleman, and M. Liberman. 2016. “Variation and Change in the Use of Hesitation Markers in Germanic Languages.” Language Dynamics and Change 6 (2): 199–234. https://doi.org/10.1163/22105832-00602001.

Reuse

CC-BY-SA 4.0

Citation

BibTeX citation:

@online{fruehwald2024,
  author = {Fruehwald, Josef},
  title = {Bayes {Theorem}},
  date = {2024-03-19},
  url = {https://lin511-2024.github.io/notes/meetings/10_naive_bayes1.html},
  langid = {en}
}

For attribution, please cite this work as:

Fruehwald, Josef. 2024. “Bayes Theorem.” March 19, 2024. https://lin511-2024.github.io/notes/meetings/10_naive_bayes1.html.