Making and Counting ngrams in python

python

Author

Published

February 15, 2024

Modified

April 8, 2024

Here, we’ll work through some of the practicalities of creating and counting ngrams from text. Let’s grab the book first.

book-getting function

from gutenbergpy.textget \
    import get_text_by_id,\
           strip_headers

def get_clean_book(book_id):
    """Get the cleaned book

    Args:
        book_id (str|int): The book id

    Returns:
        (str): The full book
    """
    raw_book = get_text_by_id(book_id)
    book_byte = strip_headers(raw_book)
    book_clean = book_byte.decode("utf-8")

    return book_clean

moby_dick = get_clean_book(2701)

First, unigrams

First step will be getting the “unigram” frequencies.

words in context	words to predict	total	name
0	1	1	unigram
1	1	2	bigram
2	2	3	trigram

To get any counts, we need to tokenize.

from nltk.tokenize import word_tokenize

moby_words = word_tokenize(moby_dick)

Next, we can count with collections.Counter()

from collections import Counter

moby_count = Counter(moby_words)
moby_count.most_common(10)

[(',', 19211),
 ('the', 13717),
 ('.', 7164),
 ('of', 6563),
 ('and', 6009),
 ('to', 4515),
 ('a', 4507),
 (';', 4178),
 ('in', 3915),
 ('that', 2918)]

The “unigram” probability of a word: \[ P(w_i) = \frac{C(w_i)}{\sum_{i=1}^n C(w_i)} \]

We can get \(C(w_i)\) from the counter dictionary

moby_count["whale"]

The trickier thing to get is \(\sum_{i=1}^n C(w_i)\). One way to do it is with a for loop.

total_freq = 0
for word in moby_count:
    total_freq += moby_count[word]

total_freq

With total_freq, we can calculate the probability of each token, which we can store in another dictionary.

moby_prob = {}
for word in moby_count:
    moby_prob[word] = moby_count[word] / total_freq

moby_prob["whale"]

0.0030122129411856635

Introducing numpy

numpy is a python package that allows you to do numeric computation more easilly.

## if you need to install it:
# ! pip install numpy

import numpy as np

If we just had a python list of numbers, we couldn’t quickly divide each number by the sum.

sample_list = [0, 1, 3]
sample_list / sum(sample_list)

TypeError: unsupported operand type(s) for /: 'list' and 'int'

We can do this with a numpy array.

sample_array = np.array([0, 1, 3])
sample_array / sample_array.sum()

array([0.  , 0.25, 0.75])

There’s lots of numpy methods to make life easier when working with numbers.

[
    sample_array.min(), 
    sample_array.max()
]

[0, 3]

Relating tokens, counts and probabilities

While the dictionary moby_count is convenient for quickly getting the count of a token, we’ll need separate lists and arrays for:

the text of each token
the count of each token
the probability of each token

# a list of the text of each token
word_list = [w for w in moby_count]

# an array of the count of each token
count_array = np.array(
    [
        moby_count[w] 
        for w in word_list
    ]
)

# a array of the probability of each token
prob_array = count_array / count_array.sum()

A thing to think about is how the mathematical formula on the left is being implemented in the code on the right.

\[ \frac{C(w_i)}{\sum_{i=1}^n w_i} \]

count_array / count_array.sum()

We can get a specific word’s probability like so:

prob_array[
    word_list.index("whale")
]

0.0030122129411856635

“Sampling” random words

We can sample a random word from the unigram distribution like so:

np.random.choice(
    word_list, 
    size = 10, 
    p = prob_array
)

array(['it', 'letter', 'compasses', 'and', 'cut', 'lot', 'a', 'thus',
       'flew', 'With'], dtype='<U28')

Making Bigrams

Making bigrams is a bit more complex. We need to get counts of each sequence of two tokens. Fortunately, nltk has a nice and convenient function for this.

from nltk import ngrams
sent1 = ["Call", "me", "Ishmael", "."]

list(
    ngrams(sent1, n = 2)
)

[('Call', 'me'), ('me', 'Ishmael'), ('Ishmael', '.')]

This is a list of “tuples”. Tuples are kind of like lists, but you’re not able to edit them after you create them. We can use Counter() on a list of tuples just like we did a list of tokens.

moby_bigram_count = Counter(
    ngrams(moby_words, 2)
)

moby_bigram_count.most_common(10)

[((',', 'and'), 2630),
 (('of', 'the'), 1869),
 (('’', 's'), 1784),
 (('in', 'the'), 1122),
 ((',', 'the'), 916),
 ((';', 'and'), 857),
 (('to', 'the'), 715),
 (('.', 'But'), 596),
 (('.', '“'), 594),
 ((',', 'that'), 583)]

bigram_list = [bigram for bigram in moby_bigram_count]
bigram_count = np.array(
    [moby_bigram_count[bigram] for bigram in bigram_list]
)

Getting conditional probabilities

Now, getting the probability of a single token isn’t as straightforward, since we’re looking at conditional probabilities.

\[ P(w_i | w_{i-1}) = \frac{C(w_{i-1}w_i)}{\sum C(w_{i-1}w)} \]

To use concrete words for a second:

\[ P(\text{whale} | \text{the}) = \frac{C(\text{the whale})}{\sum C(\text{the }w)} \]

So, to calculate the conditional probability, we need to get

The count of the bigram “the whale”
A list (or array) of the count of every bigram that begins with “the”.
The sum of this second list.

# the first word in the
# 2 word sequence
w_0 = "the"

w_0w = [
    bigram 
    for bigram in moby_bigram_count
    if bigram[0] == w_0
]

w_0w[0:10]

[('the', 'Whale'),
 ('the', 'Monstrous'),
 ('the', 'Less'),
 ('the', 'True'),
 ('the', 'Hand'),
 ('the', 'Arsacides'),
 ('the', 'Carpenter'),
 ('the', 'Cabin'),
 ('the', 'End'),
 ('the', 'First')]

C_w_0w = np.array(
    [
        moby_bigram_count[bigram]
        for bigram in w_0w
    ]
)

total_w_0 = C_w_0w.sum()

total_w_0

We can now get the specific conditional probability for the word “whale”

moby_bigram_count[
    ("the", "whale")
]

moby_bigram_count[
    ("the", "whale")
] / total_w_0

0.02369322738208063

To randomly generate any word following “the”, we need to get the probability distribution across bigrams.

P_w_0w = C_w_0w / C_w_0w.sum()

w_1 = [bigram[1] for bigram in w_0w]

np.random.choice(
    w_1,
    size = 4,
    p = P_w_0w
)

array(['hammers', 'steak', 'chances', 'house'], dtype='<U21')

To generate a sequence, starting with a specific word, we need to encapsulate our logic above into a single function.

def generate_sequence(
        bigram_count:dict,
        w_0:str = "The", 
        n:int = 10
        )->list[str]:
    """Generate a sequence of words from 
    a bigram model

    Args:
        w_0 (str): The initial token
        n (int): The size of the sequence
            to generate

    Returns:
        (list[str]): The generated sequence
    """
    ## start out with the seed token
    sequence = [w_0]

    for i in range(n):
        ## The new seed token should be
        ## the last one added
        w_0 = sequence[-1]

        ## get all bigrams beginning 
        ## with the seed token
        w_0w = [
            bigram 
            for bigram in bigram_count
            if bigram[0] == w_0
        ]

        ## get the counts of all bigrams
        C_w_0w = np.array([
            bigram_count[bigram]
            for bigram in w_0w
        ])

        ## get the probabilities of all bigrams
        P_w_0w = C_w_0w / C_w_0w.sum()

        ## get the second token 
        ## from every bigram
        w_1 = [
            bigram[1]
            for bigram in w_0w
        ]

        ## sample a new token
        chosen = np.random.choice(
            w_1,
            size = 1,
            p = P_w_0w
        )

        ## add the sampled token to the 
        ## sequence
        sequence.append(chosen[0])

    return sequence

generate_sequence(moby_bigram_count)

['The',
 'heavers',
 'singing',
 'in',
 'his',
 'suit',
 ',',
 'filled',
 'for',
 'their',
 'spirits']

Padding

The bigram sequence generator above has to start out on a specific word, and it will keep going for as many loops as we tell it.

If we wanted a generator that doesn’t need a specific word to start on, and will auto-stop when it gets to the end of a sentence, we’ll need to pre-process our data differently, so that there are special “start” and “stop” symbols, or “padding”.

from nltk.tokenize import sent_tokenize
moby_sentences = sent_tokenize(moby_dick)

print(moby_sentences[500])

Deep into distant woodlands
winds a mazy way, reaching to overlapping spurs of mountains bathed in
their hill-side blue.

moby_sent_words = [
    word_tokenize(sentence) 
    for sentence in moby_sentences
]

moby_sent_words[500]

['Deep',
 'into',
 'distant',
 'woodlands',
 'winds',
 'a',
 'mazy',
 'way',
 ',',
 'reaching',
 'to',
 'overlapping',
 'spurs',
 'of',
 'mountains',
 'bathed',
 'in',
 'their',
 'hill-side',
 'blue',
 '.']

moby_sent_words_padded = [
    ["<s>"] + sent + ["</s>"]
    for sent in moby_sent_words
]

moby_sent_words_padded[500]

['<s>',
 'Deep',
 'into',
 'distant',
 'woodlands',
 'winds',
 'a',
 'mazy',
 'way',
 ',',
 'reaching',
 'to',
 'overlapping',
 'spurs',
 'of',
 'mountains',
 'bathed',
 'in',
 'their',
 'hill-side',
 'blue',
 '.',
 '</s>']

moby_words2 = [
    token 
    for sent in moby_sent_words_padded
        for token in sent
]

moby_bigrams2 = ngrams(moby_words2, n = 2)
moby_bigram_count2 = Counter(moby_bigrams2)

generate_sequence(
    moby_bigram_count2, 
    w_0 = "<s>", 
    n = 20
)

['<s>',
 'With',
 'what',
 'business',
 ',',
 'Starbuck',
 'caught',
 'one',
 'foot',
 'part—what',
 'a',
 'chess-man',
 'beside',
 'the',
 'rolling',
 'on',
 'the',
 'boats',
 'and',
 'sunk',
 '!']

Doing it faster with nltk

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE

ngram_size = 3

train, vocab = padded_everygram_pipeline(ngram_size, moby_sent_words)
lm = MLE(ngram_size)
lm.fit(train, vocab)

sequence = ["<s>", "<s>"]
for i in range(100):
    w_0 = sequence[-2:]
    new = lm.generate(num_words=1, text_seed=w_0)
    sequence.append(new)
    if new == "</s>":
        break

sequence

['<s>',
 '<s>',
 'had',
 'turned',
 ',',
 'and',
 'continually',
 'set',
 'in',
 'a',
 'calm—give',
 'us',
 'a',
 'blue',
 'hanging',
 'tester',
 'of',
 'smoke',
 ',',
 'illuminated',
 'by',
 'the',
 'terms',
 'of',
 'my',
 'own',
 'voice',
 '.',
 '</s>']

Reuse

CC-BY-SA 4.0

Citation

BibTeX citation:

@online{fruehwald2024,
  author = {Fruehwald, Josef},
  title = {Making and {Counting} Ngrams in Python},
  date = {2024-02-15},
  url = {https://lin511-2024.github.io/notes/programming/05_ngrams.html},
  langid = {en}
}

For attribution, please cite this work as:

Fruehwald, Josef. 2024. “Making and Counting Ngrams in Python.” February 15, 2024. https://lin511-2024.github.io/notes/programming/05_ngrams.html.