Regular Expressions, Quickly

python
Author
Published

January 25, 2024

Modified

April 8, 2024

Look at Github

The real content is in your regex-in-class-<username> repository on github. This is meant to be more of a reference sheet.

import urllib.parse
from IPython.display import display, Markdown

def get_regex_url(regex: str):
    base = "https://regexper.com/#"
    url = base + urllib.parse.quote(regex)
    url_markdown = f"[{regex}]({url})"
    #return url_markdown
    display(Markdown(url_markdown))

Setting up for using regular expressions in python

  • We’ll need to import the re module
  • Unlike simple strings, we’ll need to write our regular expressions with a preceding r
import re

r"regex"
'regex'

Important re functions

Two ways to use re to search strings are

re.search()
Return structured information about where the regex matches.
re.findall()
Return all actual matching substrings
sentence1 = "The speaker is speaking."

re.search()

re.search(r"speak", sentence1)
<re.Match object; span=(4, 9), match='speak'>
sentence1[4:9]
'speak'

re.findall()

re.findall(r"speak", sentence1)
['speak', 'speak']

Simple character searches

Like the examples above, literally the characters you want to use will match.

speak_regex = r"speak"
get_regex_url(speak_regex)

Options

If you want some characters to be chosen from a set of options, place them in [].

vowels_regex = r"[aeiou]"
get_regex_url(vowels_regex)
re.findall(r"[aeiou]", sentence1)
['e', 'e', 'a', 'e', 'i', 'e', 'a', 'i']
the_regex = r"[Tt]he"
get_regex_url(the_regex)
re.findall(the_regex, sentence1)
['The']

Ranges

Ranges of characters or numbers can be given inside [] like so

get_regex_url(r"[a-z]")
get_regex_url(r"[A-Z]")
get_regex_url(r"[0-9]")
get_regex_url(r"[A-Za-z]")

“Metacharacters”

  • \w == [A-Za-z0-9_]
    • word characters
  • \W == [^A-Za-z0-9_]
    • non-word characters
  • \d == [0-9]
    • digits
  • \D == [^0-9]
    • non-digits
  • \s == [ \t\n]
    • Any whitespace character
  • \S == [^ \t\n]
    • non-whitespace

Any Character

To match any character (letter, number, punctuation, space, etc.) use . or “dot”

re.findall(
    # return every word character and 
    # the following character
    r"\w.",
    sentence1
)
['Th', 'e ', 'sp', 'ea', 'ke', 'r ', 'is', 'sp', 'ea', 'ki', 'ng']

Escaping special symbols

If you wanted to find the actual period in sentence1, you’d have to “escape” the . with a preceding `.

# compare
get_regex_url(r".")
get_regex_url(r"\.")

.

.

re.findall(
    "\.",
    sentence1
)
['.']

Modifiers

Modifiers come after the definition of a single character, and define how many times that character can appear.

  • a? = zero or one a
  • a+ = one or more a
  • a* = zero or more a
get_regex_url(r"bana?na")
get_regex_url(r"bana+na")
get_regex_url(r"bana*na")

Grouping

You can define groupings within regular expressions. The effect of these groupings depends what kind of regex function you’re using. For re.findall(), it’ll find the whole string, but return just the text from the grouping.

sentence2 = "The big bear and the small bear ran away."
get_regex_url(r"[Tt]he (\w+) bear")
re.findall(
    r"[Tt]he (\w+) bear",
    sentence2
)
['big', 'small']

Boundaries

  • ^the == Finds “the” at the start of a string.

  • the$ == Finds ” the” at the end of a string.

  • \bthe\b == Finds “the” in between word boundaries.

get_regex_url(r"^the ")
get_regex_url(r" the$")
get_regex_url(r"\bthe\b")

[(https://regexper.com/#%5Cbthe%5Cb)

sentence3 = "I saw the other bear."
re.findall(
    r"the",
    sentence3
)
['the', 'the']

The second “the” there comes from inside “other”

re.findall(
    r"\bthe\b",
    sentence3
)
['the']
sentence3
'I saw the other bear.'
Back to top

Reuse

CC-BY-SA 4.0

Citation

BibTeX citation:
@online{fruehwald2024,
  author = {Fruehwald, Josef},
  title = {Regular {Expressions,} {Quickly}},
  date = {2024-01-25},
  url = {https://lin511-2024.github.io/notes/programming/02_regex.html},
  langid = {en}
}
For attribution, please cite this work as:
Fruehwald, Josef. 2024. “Regular Expressions, Quickly.” January 25, 2024. https://lin511-2024.github.io/notes/programming/02_regex.html.