python
! pip install gutenbergpy
Let’s get a little bit practical with
For this lesson, we’re going to use gutenbergpy
and nltk
, but if you try to import them right now, like they were in the course notes, you’re going to get an error.
---------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[2], line 1
----> 1 import nltk
ModuleNotFoundError: No module named 'nltk'
gutenbergpy
We’ll need to install these packages. We’ll start with gutenbergpy
.
Now, we can import the functions to get Project Gutenberg books. The url for Moby Dick on Project Gutenberg is https://www.gutenberg.org/ebooks/2701. That last part of the url is the ID of the book, which we can pass to get_text_by_id()
to download the book.
python
raw_book
contains the book with all of its legal headers and footers. we can remove the headers and footers with strip_headers()
One last hitch here has to do with “character encoding”. We need to “decode” it.
Let’s wrap that up into one function we can re-run on new IDs
python
Go ahead and point get_clean_book()
at another book id.
Let’s tokenize one of our books with nltk.tokenize.word_tokenize()
.
nltk
.It might not go right at first. You can double check what to do here in the course notes.
spacy
To work with spacy
, we need to:
spacy
Let’s tokenize a book.
The output of nlp
is actually a complex object enriched with a lot of information that we can access a few different ways.
python
Call me Ishmael. Some years ago—never mind how long precisely—having
little or no money in my purse, and nothing particular to interest me
on shore, I thought I would sail about a little and see the watery part
of the world. It is a way I have of driving off the spleen and
regulating the circulation. Whenever I find myself growing grim about
the mouth; whenever it is a damp, drizzly November in my soul; whenever
I find myself involuntarily pausing before coffin warehouses, and
bringing up the rear of every funeral I meet; and especially whenever
my hypos get such an upper hand of me, that it requires a strong moral
principle to prevent me from deliberately stepping into the street, and
methodically knocking people’s hats off—then, I account it high time to
get to sea as soon as I can. This is my substitute for pistol and ball.
With a philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in this. If they
but knew it, almost all men in their degree, some time or other,
cherish very nearly the same feelings towards the ocean with me.
To get any particular token out, you can do ordinary indexing.
To get the actual text of a token, we need to get its .text
attribute.
There’s lots of great stuff we can get out, like each sentence.
Or the parts of speech of each token.
python
['VERB', 'PRON', 'PROPN', 'PUNCT']
We can install and use the byte pair encoder from Open AI like so:
python
[7368, 757, 57704, 1764, 301, 13, 4427, 1667, 4227, 2345]
This looks like a bunch of numbers, because this is actually saying “The first word is the 7368th token in the vocabulary list.” To get the actual text of this token, we need to “decode” it.
You can just grab random tokens from the vocabulary like this
We can train or own byte pair encoder with the sentencepiece
library.
python
python
['▁C',
'all',
'▁me',
'▁I',
'sh',
'm',
'a',
'el',
'.',
'▁S',
'ome',
'▁years',
'▁ag',
'o',
'—',
'n',
'ever',
'▁mind',
'▁how',
'▁long']
@online{fruehwald2024,
author = {Fruehwald, Josef},
title = {Working with {Tokenizers}},
date = {2024-02-08},
url = {https://lin511-2024.github.io/notes/programming/04_tokenizers.html},
langid = {en}
}