Regular Expressions, Quickly

python

Author

Published

January 25, 2024

Modified

April 8, 2024

Look at Github

The real content is in your regex-in-class-<username> repository on github. This is meant to be more of a reference sheet.

This is just a support function

import urllib.parse
from IPython.display import display, Markdown

def get_regex_url(regex: str):
    base = "https://regexper.com/#"
    url = base + urllib.parse.quote(regex)
    url_markdown = f"[{regex}]({url})"
    #return url_markdown
    display(Markdown(url_markdown))

Setting up for using regular expressions in python

We’ll need to import the re module
Unlike simple strings, we’ll need to write our regular expressions with a preceding r

import re

r"regex"

'regex'

Important `re` functions

Two ways to use re to search strings are

re.search(): Return structured information about where the regex matches.
re.findall(): Return all actual matching substrings

sentence1 = "The speaker is speaking."

`re.search()`

re.search(r"speak", sentence1)

<re.Match object; span=(4, 9), match='speak'>

sentence1[4:9]

'speak'

`re.findall()`

re.findall(r"speak", sentence1)

['speak', 'speak']

Simple character searches

Like the examples above, literally the characters you want to use will match.

speak_regex = r"speak"
get_regex_url(speak_regex)

speak

Options

If you want some characters to be chosen from a set of options, place them in [].

vowels_regex = r"[aeiou]"
get_regex_url(vowels_regex)

[aeiou]

re.findall(r"[aeiou]", sentence1)

['e', 'e', 'a', 'e', 'i', 'e', 'a', 'i']

the_regex = r"[Tt]he"
get_regex_url(the_regex)

[Tt]he

re.findall(the_regex, sentence1)

['The']

Ranges

Ranges of characters or numbers can be given inside [] like so

get_regex_url(r"[a-z]")
get_regex_url(r"[A-Z]")
get_regex_url(r"[0-9]")
get_regex_url(r"[A-Za-z]")

“Metacharacters”

\w == [A-Za-z0-9_]
- word characters
\W == [^A-Za-z0-9_]
- non-word characters
\d == [0-9]
- digits
\D == [^0-9]
- non-digits
\s == [ \t\n]
- Any whitespace character
\S == [^ \t\n]
- non-whitespace

Any Character

To match any character (letter, number, punctuation, space, etc.) use . or “dot”

re.findall(
    # return every word character and 
    # the following character
    r"\w.",
    sentence1
)

['Th', 'e ', 'sp', 'ea', 'ke', 'r ', 'is', 'sp', 'ea', 'ki', 'ng']

Escaping special symbols

If you wanted to find the actual period in sentence1, you’d have to “escape” the . with a preceding `.

# compare
get_regex_url(r".")
get_regex_url(r"\.")

re.findall(
    "\.",
    sentence1
)

['.']

Modifiers

Modifiers come after the definition of a single character, and define how many times that character can appear.

a? = zero or one a
a+ = one or more a
a* = zero or more a

get_regex_url(r"bana?na")
get_regex_url(r"bana+na")
get_regex_url(r"bana*na")

bana?na

bana+na

bana*na

Grouping

You can define groupings within regular expressions. The effect of these groupings depends what kind of regex function you’re using. For re.findall(), it’ll find the whole string, but return just the text from the grouping.

sentence2 = "The big bear and the small bear ran away."

get_regex_url(r"[Tt]he (\w+) bear")

[Tt]he (+) bear

re.findall(
    r"[Tt]he (\w+) bear",
    sentence2
)

['big', 'small']

Boundaries

^the == Finds “the” at the start of a string.
the$ == Finds ” the” at the end of a string.
\bthe\b == Finds “the” in between word boundaries.

get_regex_url(r"^the ")
get_regex_url(r" the$")
get_regex_url(r"\bthe\b")

^the

the$

[(https://regexper.com/#%5Cbthe%5Cb)

sentence3 = "I saw the other bear."
re.findall(
    r"the",
    sentence3
)

['the', 'the']

The second “the” there comes from inside “other”

re.findall(
    r"\bthe\b",
    sentence3
)

['the']

sentence3

'I saw the other bear.'

Reuse

CC-BY-SA 4.0

Citation

BibTeX citation:

@online{fruehwald2024,
  author = {Fruehwald, Josef},
  title = {Regular {Expressions,} {Quickly}},
  date = {2024-01-25},
  url = {https://lin511-2024.github.io/notes/programming/02_regex.html},
  langid = {en}
}

For attribution, please cite this work as:

Fruehwald, Josef. 2024. “Regular Expressions, Quickly.” January 25, 2024. https://lin511-2024.github.io/notes/programming/02_regex.html.

Setting up for using regular expressions in python

Important re functions

re.search()

re.findall()

Simple character searches

Options

Ranges

“Metacharacters”

Any Character

Escaping special symbols

Modifiers

Grouping

Boundaries

Reuse

Citation

Important `re` functions

`re.search()`

`re.findall()`