Working with Tokenizers

python

Author

Josef Fruehwald

Published

February 8, 2024

Modified

April 8, 2024

Let’s get a little bit practical with

getting text
getting a tokenizer
using the tokenizer

For this lesson, we’re going to use gutenbergpy and nltk, but if you try to import them right now, like they were in the course notes, you’re going to get an error.

---------------------------------------------------------
ModuleNotFoundError      Traceback (most recent call last)
Cell In[2], line 1
----> 1 import nltk

ModuleNotFoundError: No module named 'nltk'

Installing `gutenbergpy`

We’ll need to install these packages. We’ll start with gutenbergpy.

python

! pip install gutenbergpy

Now, we can import the functions to get Project Gutenberg books. The url for Moby Dick on Project Gutenberg is https://www.gutenberg.org/ebooks/2701. That last part of the url is the ID of the book, which we can pass to get_text_by_id() to download the book.

python

from gutenbergpy.textget import get_text_by_id, strip_headers

book_id = 2701

raw_book = get_text_by_id(book_id)

raw_book contains the book with all of its legal headers and footers. we can remove the headers and footers with strip_headers()

python

book_byte = strip_headers(raw_book)

One last hitch here has to do with “character encoding”. We need to “decode” it.

python

book_clean = book_byte.decode("utf-8")

Let’s wrap that up into one function we can re-run on new IDs

python

def get_clean_book(book_id):
    """Get the cleaned book

    Args:
        book_id (str|int): The book id

    Returns:
        (str): The full book
    """
    raw_book = get_text_by_id(book_id)
    book_byte = strip_headers(raw_book)
    book_clean = book_byte.decode("utf-8")

    return book_clean

Go ahead and point get_clean_book() at another book id.

NLTK tokenization

Let’s tokenize one of our books with nltk.tokenize.word_tokenize().

Steps

Install nltk.
Try tokenizing your book.

It might not go right at first. You can double check what to do here in the course notes.

Lets try `spacy`

To work with spacy, we need to:

Install spacy
Install one of the spacy models.

The steps

Go to the spacy website
Can you find the code to successfully install it and its language model?

python

## Installation

Let’s tokenize a book.

python

import spacy
nlp = spacy.load("en_core_web_sm")

python

import re
first_para = re.findall(
    r"Call me Ishmael.*?\n\n", 
    book_clean, 
    re.DOTALL)[0]

python

para_doc = nlp(first_para)

The output of nlp is actually a complex object enriched with a lot of information that we can access a few different ways.

python

para_doc

Call me Ishmael. Some years ago—never mind how long precisely—having
little or no money in my purse, and nothing particular to interest me
on shore, I thought I would sail about a little and see the watery part
of the world. It is a way I have of driving off the spleen and
regulating the circulation. Whenever I find myself growing grim about
the mouth; whenever it is a damp, drizzly November in my soul; whenever
I find myself involuntarily pausing before coffin warehouses, and
bringing up the rear of every funeral I meet; and especially whenever
my hypos get such an upper hand of me, that it requires a strong moral
principle to prevent me from deliberately stepping into the street, and
methodically knocking people’s hats off—then, I account it high time to
get to sea as soon as I can. This is my substitute for pistol and ball.
With a philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in this. If they
but knew it, almost all men in their degree, some time or other,
cherish very nearly the same feelings towards the ocean with me.

To get any particular token out, you can do ordinary indexing.

python

para_doc[2]

Ishmael

To get the actual text of a token, we need to get its .text attribute.

python

para_doc[2].text

'Ishmael'

There’s lots of great stuff we can get out, like each sentence.

python

list(para_doc.sents)[0]

Call me Ishmael.

Or the parts of speech of each token.

python

first_sent = list(para_doc.sents)[0]
[x.pos_ for x in first_sent]

['VERB', 'PRON', 'PROPN', 'PUNCT']

python

[x.morph for x in first_sent]

[VerbForm=Inf,
 Case=Acc|Number=Sing|Person=1|PronType=Prs,
 Number=Sing,
 PunctType=Peri]

Byte Pair Encoding

We can install and use the byte pair encoder from Open AI like so:

python

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")

python

first_encoded = enc.encode(first_para)
first_encoded[0:10]

[7368, 757, 57704, 1764, 301, 13, 4427, 1667, 4227, 2345]

This looks like a bunch of numbers, because this is actually saying “The first word is the 7368th token in the vocabulary list.” To get the actual text of this token, we need to “decode” it.

python

enc.decode([7368])

'Call'

You can just grab random tokens from the vocabulary like this

python

enc.decode([2024])

' ter'

Training your own byte pair encoding

We can train or own byte pair encoder with the sentencepiece library.

python

import sentencepiece as spm
from pathlib import Path

python

output = Path("book_clean.txt")
output.write_text(book_clean)

python

spm.SentencePieceTrainer.train(
    input = output,
    model_prefix = "mine",
    vocab_size = 1000,
    model_type = "bpe"
)

python

my_spm = spm.SentencePieceProcessor(model_file='mine.model')

python

my_para = my_spm.encode_as_pieces(first_para)
my_para[0:20]

['▁C',
 'all',
 '▁me',
 '▁I',
 'sh',
 'm',
 'a',
 'el',
 '.',
 '▁S',
 'ome',
 '▁years',
 '▁ag',
 'o',
 '—',
 'n',
 'ever',
 '▁mind',
 '▁how',
 '▁long']

python

my_spm.encode_as_pieces("Who is Josef Fruehwald")

['▁Wh', 'o', '▁is', '▁J', 'ose', 'f', '▁F', 'r', 'ue', 'h', 'w', 'a', 'ld']

Reuse

CC-BY-SA 4.0

Citation

BibTeX citation:

@online{fruehwald2024,
  author = {Fruehwald, Josef},
  title = {Working with {Tokenizers}},
  date = {2024-02-08},
  url = {https://lin511-2024.github.io/notes/programming/04_tokenizers.html},
  langid = {en}
}

For attribution, please cite this work as:

Fruehwald, Josef. 2024. “Working with Tokenizers.” February 8, 2024. https://lin511-2024.github.io/notes/programming/04_tokenizers.html.

Installing gutenbergpy