import urllib.parse
from IPython.display import display, Markdown
def get_regex_url(regex: str):
base = "https://regexper.com/#"
url = base + urllib.parse.quote(regex)
url_markdown = f"[{regex}]({url})"
#return url_markdown
display(Markdown(url_markdown))Regular Expressions, Quickly
The real content is in your regex-in-class-<username> repository on github. This is meant to be more of a reference sheet.
Setting up for using regular expressions in python
- We’ll need to import the
remodule - Unlike simple strings, we’ll need to write our regular expressions with a preceding
r
import re
r"regex"'regex'
Important re functions
Two ways to use re to search strings are
re.search()- Return structured information about where the regex matches.
re.findall()- Return all actual matching substrings
sentence1 = "The speaker is speaking."re.search()
re.search(r"speak", sentence1)<re.Match object; span=(4, 9), match='speak'>
sentence1[4:9]'speak'
re.findall()
re.findall(r"speak", sentence1)['speak', 'speak']
Simple character searches
Like the examples above, literally the characters you want to use will match.
speak_regex = r"speak"
get_regex_url(speak_regex)Options
If you want some characters to be chosen from a set of options, place them in [].
vowels_regex = r"[aeiou]"
get_regex_url(vowels_regex)re.findall(r"[aeiou]", sentence1)['e', 'e', 'a', 'e', 'i', 'e', 'a', 'i']
the_regex = r"[Tt]he"
get_regex_url(the_regex)re.findall(the_regex, sentence1)['The']
Ranges
Ranges of characters or numbers can be given inside [] like so
“Metacharacters”
\w==[A-Za-z0-9_]- word characters
\W==[^A-Za-z0-9_]- non-word characters
\d==[0-9]- digits
\D==[^0-9]- non-digits
\s==[ \t\n]- Any whitespace character
\S==[^ \t\n]- non-whitespace
Any Character
To match any character (letter, number, punctuation, space, etc.) use . or “dot”
re.findall(
# return every word character and
# the following character
r"\w.",
sentence1
)['Th', 'e ', 'sp', 'ea', 'ke', 'r ', 'is', 'sp', 'ea', 'ki', 'ng']
Escaping special symbols
If you wanted to find the actual period in sentence1, you’d have to “escape” the . with a preceding `.
re.findall(
"\.",
sentence1
)['.']
Modifiers
Modifiers come after the definition of a single character, and define how many times that character can appear.
a?= zero or oneaa+= one or moreaa*= zero or morea
Grouping
You can define groupings within regular expressions. The effect of these groupings depends what kind of regex function you’re using. For re.findall(), it’ll find the whole string, but return just the text from the grouping.
sentence2 = "The big bear and the small bear ran away."get_regex_url(r"[Tt]he (\w+) bear")re.findall(
r"[Tt]he (\w+) bear",
sentence2
)['big', 'small']
Boundaries
^the== Finds “the” at the start of a string.the$== Finds ” the” at the end of a string.\bthe\b== Finds “the” in between word boundaries.
get_regex_url(r"^the ")
get_regex_url(r" the$")
get_regex_url(r"\bthe\b")[(https://regexper.com/#%5Cbthe%5Cb)
sentence3 = "I saw the other bear."
re.findall(
r"the",
sentence3
)['the', 'the']
The second “the” there comes from inside “other”
re.findall(
r"\bthe\b",
sentence3
)['the']
sentence3'I saw the other bear.'
Reuse
Citation
@online{fruehwald2024,
author = {Fruehwald, Josef},
title = {Regular {Expressions,} {Quickly}},
date = {2024-01-25},
url = {https://lin511-2024.github.io/notes/programming/02_regex.html},
langid = {en}
}