More Python: Poetry, Lists, Loops

python
Author
Published

January 30, 2024

Modified

April 8, 2024

This lesson has two goals:

  1. Begin walking you through processing text in Python, with an eye towards our next computational linguistics topic: ngram models.
  2. To introduce you to additional datastructures that we didn’t get to cover in the original python workthrough.

Finding and reading in files

In order to read in a file, first we need to tell python where a file is. We’ll do this with pathlib.Path(). For a recap of what paths are, see Understanding Directories.

python
from pathlib import Path

Let’s just call Path() and see what it gives us.

python
here = Path()
here
PosixPath('.')

Some detail of what gets printed out for here will depend on your operating system, but one thing that should be the same across the board is that '.' is a placeholder meaning “the dirtory you’re currently in.

To see the full path to your current directory, we’ll use the .resolve() method.

python
here.resolve()
PosixPath('/workspaces/Lin511-2024.github.io/notes/programming')

What this full path looks like really depends on where you are running your code.

Getting directory contents

To get the full contents of a directory, we’ll use the the .glob() method. To get readable output, we need to wrap list() around everything.

python
list(
    here.glob(pattern = "*")
)
[PosixPath('03_python_plus.ipynb'),
 PosixPath('assets'),
 PosixPath('_02_Paths.ipynb'),
 PosixPath('01_Python_basics.ipynb'),
 PosixPath('00_setup.qmd'),
 PosixPath('poems'),
 PosixPath('02_regex.ipynb')]

*Lists (a new data structure)

The code above created a “list”, which is, like the name suggests, a list of values.

python
here_list = list(
    here.glob(pattern = "*")
)

We can create our own lists by hand, like so:

python
my_list = ["a", "b", "c", "d", "e"] 

Doing things with lists

Getting their length

To get the length of anything, including lists, len().

python
len(here_list)
7
python
len(my_list)
5

Getting a value from a list

To get a value out of a list in python, we need to “index” it. To get just the first value from a list, we’ll use the index 0.

python
here_list[0]
PosixPath('03_python_plus.ipynb')
python
my_list[0]
'a'

To get a range of values, we’ll use a “slice” operator :.

python
my_list[0:3]
['a', 'b', 'c']
python
my_list[1:4]
['b', 'c', 'd']

How to think about indices

The index for 'a' is 0, and the index for 'c' is 2.

python
my_list.index("a")
0
python
my_list.index("c")
2

But to get out every value between 'a' and 'c', we need to use 0:3.

python
my_list[0:3]
['a', 'b', 'c']

The way to conceptualize these numeric indices is that they come between the values in a list.

Index locations

When we give just a single number as an index, we get back the value immediately to the right of the index.

Single value index

When we give an index slice, we get back every value between the start of the slice and the end.

Slice indexing
Negative indexing

Sometimes, you know you’ll want the last, or second to last value from a list, so there are also negative indicies defined.

Negative indexing

These work in the same way. A single index returns the value immediately to the left, a slice returns the values between the first and the second.

python
my_list[-1]
'e'
python
my_list[-3:-1]
['c', 'd']
python
my_list[-2:]
['d', 'e']

Reading in a text file

I’ve put two poems in the poems directory right next to the current notebook. We can get a list of them by

  • adding poems/ to the here path
  • Globbing the new path.
python
poems_dir = here.joinpath("poems")
poems_files = list(poems_dir.glob("*"))
poems_files
[PosixPath('poems/the_tiger.txt'), PosixPath('poems/this_is_just_to_say.txt')]

Let’s read in poems/this_is_just_to_say.txt.

python
plums_path = poems_files[1]
plums_path
PosixPath('poems/this_is_just_to_say.txt')

Right now plums_path just contains information about where the file is. In order to read its contents into python, we need to

  1. Open it.
  2. Read it.
  3. Close it.

We’ll manage all three steps with the following code.

python
plums_poem = plums_path.read_text()
print(plums_poem)
This Is Just To Say
By William Carlos Williams

I have eaten
the plums
that were in
the icebox

and which
you were probably
saving
for breakfast

Forgive me
they were delicious
so sweet
and so cold

Doing things to strings

Right now, plums_poem is just one long string. When we print it out, we get some nice formatting, but if we look at the unformatted output, you’ll see it’s one long string with newline characters: \n.

python
plums_poem
'This Is Just To Say\nBy William Carlos Williams\n\nI have eaten\nthe plums\nthat were in\nthe icebox\n\nand which\nyou were probably\nsaving\nfor breakfast\n\nForgive me\nthey were delicious\nso sweet\nand so cold'

To start doing computational linguistics things to the poem, we’ll need to start splitting it up into pieces, like separate lines, or separate words.

Upper and Lower case

One thing that can make life a little easier is to “case fold” text data, which we can do with the .lower() and .upper() methods.

python
plums_lower = plums_poem.lower()
print(plums_lower[0:46])
this is just to say
by william carlos williams

Splitting into lines

We can also split the poem into separate lines with the .split() method, which will return a list with the poem split by whatever value we give it.

python
# split according to newlines
plums_lines = plums_poem.split("\n")
plums_lines
['This Is Just To Say',
 'By William Carlos Williams',
 '',
 'I have eaten',
 'the plums',
 'that were in',
 'the icebox',
 '',
 'and which',
 'you were probably',
 'saving',
 'for breakfast',
 '',
 'Forgive me',
 'they were delicious',
 'so sweet',
 'and so cold']

Now, we can get each individual line by indexing.

python
plums_lines[0]
'This Is Just To Say'

Looping through the poem

Above, I ran .split() on the non-case folded poem. What if I wanted to lowercase each line of the poem now, after the fact. We can use a for loop to do this. Let’s build up to it.

for-loop skeleton

python
for line in plums_lines:
    myspace_line = "xXx " + line + " xXx"
    print(myspace_line)
xXx This Is Just To Say xXx
xXx By William Carlos Williams xXx
xXx  xXx
xXx I have eaten xXx
xXx the plums xXx
xXx that were in xXx
xXx the icebox xXx
xXx  xXx
xXx and which xXx
xXx you were probably xXx
xXx saving xXx
xXx for breakfast xXx
xXx  xXx
xXx Forgive me xXx
xXx they were delicious xXx
xXx so sweet xXx
xXx and so cold xXx

Let’s unpack the first line of the for loop:

for line in plums_line:

This tells python to take each value in plums_line, assign it to a new variable called line, then do something. After it has done something, it goes and grabs the next value from plums_line, assigns it to a variable called line, and starts over.

Some common misconceptions
  1. What we tell python to name the variable it creates does not affect the way the for loop works. We called it line to make the code readable. If we really wanted to pull out every noun from the poem, and said for noun in plums_lines:, it is not going to pull out every noun.
python
for noun in plums_lines[0:2]:
    print(noun)
This Is Just To Say
By William Carlos Williams
  1. What we do inside of the loop doen’t have to involve the line variable at all.
python
for line in plums_lines:
    print("plum")
plum
plum
plum
plum
plum
plum
plum
plum
plum
plum
plum
plum
plum
plum
plum
plum
plum
  1. What the for-loop did does not get saved anywhere. The only “state” that gets preserved at all is the final one.
python
myspace_line
'xXx and so cold xXx'

This last point is super important. To save, or remember, what happened inside of a for-loop, you need to do two things.

  1. Create a variable outside of the loop.
  2. Update the variable inside of the loop.

For example, if we wanted to get the total number of characters in the peom, we’d need to

  1. Create a collector variable outside of the loop.
  2. Add update its total value inside of the loop.
python
total_chars = 0

print(f"The total number of chars is now {total_chars}")

for line in plums_lines:
    line_len = len(line)
    total_chars = total_chars + line_len
    print(f"The total number of chars is now {total_chars}")

print(f"The final number of chars is {total_chars}")
The total number of chars is now 0
The total number of chars is now 19
The total number of chars is now 45
The total number of chars is now 45
The total number of chars is now 57
The total number of chars is now 66
The total number of chars is now 78
The total number of chars is now 88
The total number of chars is now 88
The total number of chars is now 97
The total number of chars is now 114
The total number of chars is now 120
The total number of chars is now 133
The total number of chars is now 133
The total number of chars is now 143
The total number of chars is now 162
The total number of chars is now 170
The total number of chars is now 181
The final number of chars is 181

This use of a for-loop makes the “loop” part clearer.

stateDiagram
    state "for line in lines:" as for
    state if_line <<choice>>
    state fork <<fork>>
    state fork1 <<fork>>
    [*] --> fork1
    fork1 --> total_chars: set to 0
    fork1 --> for
    for --> if_line
    if_line --> fork: If another line
    if_line --> [*]: If no more lines
    fork --> total_chars: +len(line)
    fork --> for

To actually lowercase the poem now, we need to

  1. Create an empty list outside of the loop.
  2. Add the lowercased line to this list inside of the loop.
python
lower_lines = []

for line in plums_lines:
    lower_lines.append(
        line.lower()
    )

lower_lines
['this is just to say',
 'by william carlos williams',
 '',
 'i have eaten',
 'the plums',
 'that were in',
 'the icebox',
 '',
 'and which',
 'you were probably',
 'saving',
 'for breakfast',
 '',
 'forgive me',
 'they were delicious',
 'so sweet',
 'and so cold']

Filtering the poem

There are a few lines of the poem that are blank and have a length of 0.

python
print(lower_lines[2])
python
len(lower_lines[2])
0

We can create a new list without these lines by

  1. Creating an empty list outside of the loop.
  2. Adding lines to that list inside the loop, only if they have a length greater than 0.

This introduces an if control block.

python
full_lines = []

for line in lower_lines:
    if len(line) > 0:
        full_lines.append(line)
    else:
        print("dumped an empty line")
dumped an empty line
dumped an empty line
dumped an empty line
python
full_lines
['this is just to say',
 'by william carlos williams',
 'i have eaten',
 'the plums',
 'that were in',
 'the icebox',
 'and which',
 'you were probably',
 'saving',
 'for breakfast',
 'forgive me',
 'they were delicious',
 'so sweet',
 'and so cold']

The if block:

  1. Checks to see if the logical statement is True or False.
  2. If True, it executes the code inside.

In this case, I’ve also included an else block. This executes its code if the if block above was False.

Nesting loops

We can nest for loops and if statements to an arbitrary level, but good programming practice tries to keep it to a minumum. For example, to get every word in the poem, based on where the spaces are, we can

  1. Create an empty words list outside of a loop.
  2. Split each line using spaces.
  3. Start a loop over that list.
  4. Update the words list inside this loop.
python
all_words = []

for line in full_lines:
    words = line.split(" ")
    for word in words:
        all_words.append(word)

print(all_words[0:19])
['this', 'is', 'just', 'to', 'say', 'by', 'william', 'carlos', 'williams', 'i', 'have', 'eaten', 'the', 'plums', 'that', 'were', 'in', 'the', 'icebox']

A slightly nicer way

A slightly nicer way to do some of the operations we worked on here is with “list comprehensions”. These are special instructions to build a list that wind up saving us a bit of typing.

Here’s the code we used to make all of the lines lowercase again.

python
lower_lines = []

for line in plums_lines:
    lower_lines.append(line.lower())

With a list comprehension, we could rewrite this for-loop like so:

python
lower_lines2 = [
    line.lower() 
    for line in plums_lines
]
lower_lines2
['this is just to say',
 'by william carlos williams',
 '',
 'i have eaten',
 'the plums',
 'that were in',
 'the icebox',
 '',
 'and which',
 'you were probably',
 'saving',
 'for breakfast',
 '',
 'forgive me',
 'they were delicious',
 'so sweet',
 'and so cold']

We can even include the empty line filtering into the list comprehension.

python
full_lines2 = [
    line.lower() 
    for line in plums_lines 
    if len(line)>0
]

full_lines2
['this is just to say',
 'by william carlos williams',
 'i have eaten',
 'the plums',
 'that were in',
 'the icebox',
 'and which',
 'you were probably',
 'saving',
 'for breakfast',
 'forgive me',
 'they were delicious',
 'so sweet',
 'and so cold']

Nested for statements also work inside of list comprehensions.

python
all_words2 = [
    word
    for line in full_lines2
    for word in line.split(" ")
]

all_words[0:19]
['this',
 'is',
 'just',
 'to',
 'say',
 'by',
 'william',
 'carlos',
 'williams',
 'i',
 'have',
 'eaten',
 'the',
 'plums',
 'that',
 'were',
 'in',
 'the',
 'icebox']
Back to top

Reuse

CC-BY-SA 4.0

Citation

BibTeX citation:
@online{fruehwald2024,
  author = {Fruehwald, Josef},
  title = {More {Python:} {Poetry,} {Lists,} {Loops}},
  date = {2024-01-30},
  url = {https://lin511-2024.github.io/notes/programming/03_python_plus.html},
  langid = {en}
}
For attribution, please cite this work as:
Fruehwald, Josef. 2024. “More Python: Poetry, Lists, Loops.” January 30, 2024. https://lin511-2024.github.io/notes/programming/03_python_plus.html.