Here, we’ll work through some of the practicalities of creating and counting ngrams from text. Let’s grab the book first.
book-getting function
from gutenbergpy.textget \import get_text_by_id,\ strip_headersdef get_clean_book(book_id):"""Get the cleaned book Args: book_id (str|int): The book id Returns: (str): The full book """ raw_book = get_text_by_id(book_id) book_byte = strip_headers(raw_book) book_clean = book_byte.decode("utf-8")return book_clean
moby_dick = get_clean_book(2701)
First, unigrams
First step will be getting the “unigram” frequencies.
words in context
words to predict
total
name
0
1
1
unigram
1
1
2
bigram
2
2
3
trigram
To get any counts, we need to tokenize.
from nltk.tokenize import word_tokenizemoby_words = word_tokenize(moby_dick)
Next, we can count with collections.Counter()
from collections import Countermoby_count = Counter(moby_words)moby_count.most_common(10)
There’s lots of numpy methods to make life easier when working with numbers.
[ sample_array.min(), sample_array.max()]
[0, 3]
Relating tokens, counts and probabilities
While the dictionary moby_count is convenient for quickly getting the count of a token, we’ll need separate lists and arrays for:
the text of each token
the count of each token
the probability of each token
# a list of the text of each tokenword_list = [w for w in moby_count]# an array of the count of each tokencount_array = np.array( [ moby_count[w] for w in word_list ])# a array of the probability of each tokenprob_array = count_array / count_array.sum()
A thing to think about is how the mathematical formula on the left is being implemented in the code on the right.
\[
\frac{C(w_i)}{\sum_{i=1}^n w_i}
\]
count_array / count_array.sum()
We can get a specific word’s probability like so:
prob_array[ word_list.index("whale")]
0.0030122129411856635
“Sampling” random words
We can sample a random word from the unigram distribution like so:
np.random.choice( word_list, size =10, p = prob_array)
Making bigrams is a bit more complex. We need to get counts of each sequence of two tokens. Fortunately, nltk has a nice and convenient function for this.
from nltk import ngramssent1 = ["Call", "me", "Ishmael", "."]list( ngrams(sent1, n =2))
This is a list of “tuples”. Tuples are kind of like lists, but you’re not able to edit them after you create them. We can use Counter() on a list of tuples just like we did a list of tokens.
To generate a sequence, starting with a specific word, we need to encapsulate our logic above into a single function.
def generate_sequence( bigram_count:dict, w_0:str="The", n:int=10 )->list[str]:"""Generate a sequence of words from a bigram model Args: w_0 (str): The initial token n (int): The size of the sequence to generate Returns: (list[str]): The generated sequence """## start out with the seed token sequence = [w_0]for i inrange(n):## The new seed token should be## the last one added w_0 = sequence[-1]## get all bigrams beginning ## with the seed token w_0w = [ bigram for bigram in bigram_countif bigram[0] == w_0 ]## get the counts of all bigrams C_w_0w = np.array([ bigram_count[bigram]for bigram in w_0w ])## get the probabilities of all bigrams P_w_0w = C_w_0w / C_w_0w.sum()## get the second token ## from every bigram w_1 = [ bigram[1]for bigram in w_0w ]## sample a new token chosen = np.random.choice( w_1, size =1, p = P_w_0w )## add the sampled token to the ## sequence sequence.append(chosen[0])return sequence
The bigram sequence generator above has to start out on a specific word, and it will keep going for as many loops as we tell it.
If we wanted a generator that doesn’t need a specific word to start on, and will auto-stop when it gets to the end of a sentence, we’ll need to pre-process our data differently, so that there are special “start” and “stop” symbols, or “padding”.
from nltk.tokenize import sent_tokenizemoby_sentences = sent_tokenize(moby_dick)print(moby_sentences[500])
Deep into distant woodlands
winds a mazy way, reaching to overlapping spurs of mountains bathed in
their hill-side blue.
moby_sent_words = [ word_tokenize(sentence) for sentence in moby_sentences]moby_sent_words[500]
moby_words2 = [ token for sent in moby_sent_words_paddedfor token in sent]moby_bigrams2 = ngrams(moby_words2, n =2)moby_bigram_count2 = Counter(moby_bigrams2)
generate_sequence( moby_bigram_count2, w_0 ="<s>", n =20)