Working with Tokenizers

python
Author
Published

February 8, 2024

Modified

April 8, 2024

Let’s get a little bit practical with

For this lesson, we’re going to use gutenbergpy and nltk, but if you try to import them right now, like they were in the course notes, you’re going to get an error.

---------------------------------------------------------
ModuleNotFoundError      Traceback (most recent call last)
Cell In[2], line 1
----> 1 import nltk

ModuleNotFoundError: No module named 'nltk'

Installing gutenbergpy

We’ll need to install these packages. We’ll start with gutenbergpy.

python
! pip install gutenbergpy

Now, we can import the functions to get Project Gutenberg books. The url for Moby Dick on Project Gutenberg is https://www.gutenberg.org/ebooks/2701. That last part of the url is the ID of the book, which we can pass to get_text_by_id() to download the book.

python
from gutenbergpy.textget import get_text_by_id, strip_headers

book_id = 2701

raw_book = get_text_by_id(book_id)

raw_book contains the book with all of its legal headers and footers. we can remove the headers and footers with strip_headers()

python
book_byte = strip_headers(raw_book)

One last hitch here has to do with “character encoding”. We need to “decode” it.

python
book_clean = book_byte.decode("utf-8")

Let’s wrap that up into one function we can re-run on new IDs

python
def get_clean_book(book_id):
    """Get the cleaned book

    Args:
        book_id (str|int): The book id

    Returns:
        (str): The full book
    """
    raw_book = get_text_by_id(book_id)
    book_byte = strip_headers(raw_book)
    book_clean = book_byte.decode("utf-8")

    return book_clean

Go ahead and point get_clean_book() at another book id.

NLTK tokenization

Let’s tokenize one of our books with nltk.tokenize.word_tokenize().

Steps

  1. Install nltk.
  2. Try tokenizing your book.

It might not go right at first. You can double check what to do here in the course notes.

Lets try spacy

To work with spacy, we need to:

  1. Install spacy
  2. Install one of the spacy models.

The steps

  1. Go to the spacy website
  2. Can you find the code to successfully install it and its language model?
python
## Installation

Let’s tokenize a book.

python
import spacy
nlp = spacy.load("en_core_web_sm")
python
import re
first_para = re.findall(
    r"Call me Ishmael.*?\n\n", 
    book_clean, 
    re.DOTALL)[0]
python
para_doc = nlp(first_para)

The output of nlp is actually a complex object enriched with a lot of information that we can access a few different ways.

python
para_doc
Call me Ishmael. Some years ago—never mind how long precisely—having
little or no money in my purse, and nothing particular to interest me
on shore, I thought I would sail about a little and see the watery part
of the world. It is a way I have of driving off the spleen and
regulating the circulation. Whenever I find myself growing grim about
the mouth; whenever it is a damp, drizzly November in my soul; whenever
I find myself involuntarily pausing before coffin warehouses, and
bringing up the rear of every funeral I meet; and especially whenever
my hypos get such an upper hand of me, that it requires a strong moral
principle to prevent me from deliberately stepping into the street, and
methodically knocking people’s hats off—then, I account it high time to
get to sea as soon as I can. This is my substitute for pistol and ball.
With a philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in this. If they
but knew it, almost all men in their degree, some time or other,
cherish very nearly the same feelings towards the ocean with me.

To get any particular token out, you can do ordinary indexing.

python
para_doc[2]
Ishmael

To get the actual text of a token, we need to get its .text attribute.

python
para_doc[2].text
'Ishmael'

There’s lots of great stuff we can get out, like each sentence.

python
list(para_doc.sents)[0]
Call me Ishmael.

Or the parts of speech of each token.

python
first_sent = list(para_doc.sents)[0]
[x.pos_ for x in first_sent]
['VERB', 'PRON', 'PROPN', 'PUNCT']
python
[x.morph for x in first_sent]
[VerbForm=Inf,
 Case=Acc|Number=Sing|Person=1|PronType=Prs,
 Number=Sing,
 PunctType=Peri]

Byte Pair Encoding

We can install and use the byte pair encoder from Open AI like so:

python
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
python
first_encoded = enc.encode(first_para)
first_encoded[0:10]
[7368, 757, 57704, 1764, 301, 13, 4427, 1667, 4227, 2345]

This looks like a bunch of numbers, because this is actually saying “The first word is the 7368th token in the vocabulary list.” To get the actual text of this token, we need to “decode” it.

python
enc.decode([7368])
'Call'

You can just grab random tokens from the vocabulary like this

python
enc.decode([2024])
' ter'

Training your own byte pair encoding

We can train or own byte pair encoder with the sentencepiece library.

python
import sentencepiece as spm
from pathlib import Path
python
output = Path("book_clean.txt")
output.write_text(book_clean)
1218929
python
spm.SentencePieceTrainer.train(
    input = output,
    model_prefix = "mine",
    vocab_size = 1000,
    model_type = "bpe"
)
python
my_spm = spm.SentencePieceProcessor(model_file='mine.model')
python
my_para = my_spm.encode_as_pieces(first_para)
my_para[0:20]
['▁C',
 'all',
 '▁me',
 '▁I',
 'sh',
 'm',
 'a',
 'el',
 '.',
 '▁S',
 'ome',
 '▁years',
 '▁ag',
 'o',
 '—',
 'n',
 'ever',
 '▁mind',
 '▁how',
 '▁long']
python
my_spm.encode_as_pieces("Who is Josef Fruehwald")
['▁Wh', 'o', '▁is', '▁J', 'ose', 'f', '▁F', 'r', 'ue', 'h', 'w', 'a', 'ld']
Back to top

Reuse

CC-BY-SA 4.0

Citation

BibTeX citation:
@online{fruehwald2024,
  author = {Fruehwald, Josef},
  title = {Working with {Tokenizers}},
  date = {2024-02-08},
  url = {https://lin511-2024.github.io/notes/programming/04_tokenizers.html},
  langid = {en}
}
For attribution, please cite this work as:
Fruehwald, Josef. 2024. “Working with Tokenizers.” February 8, 2024. https://lin511-2024.github.io/notes/programming/04_tokenizers.html.