import urllib.parse
from IPython.display import display, Markdown
def get_regex_url(regex: str):
= "https://regexper.com/#"
base = base + urllib.parse.quote(regex)
url = f"[{regex}]({url})"
url_markdown #return url_markdown
display(Markdown(url_markdown))
Regular Expressions, Quickly
The real content is in your regex-in-class-<username>
repository on github. This is meant to be more of a reference sheet.
Setting up for using regular expressions in python
- We’ll need to import the
re
module - Unlike simple strings, we’ll need to write our regular expressions with a preceding
r
import re
r"regex"
'regex'
Important re
functions
Two ways to use re
to search strings are
re.search()
- Return structured information about where the regex matches.
re.findall()
- Return all actual matching substrings
= "The speaker is speaking." sentence1
re.search()
r"speak", sentence1) re.search(
<re.Match object; span=(4, 9), match='speak'>
4:9] sentence1[
'speak'
re.findall()
r"speak", sentence1) re.findall(
['speak', 'speak']
Simple character searches
Like the examples above, literally the characters you want to use will match.
= r"speak"
speak_regex get_regex_url(speak_regex)
Options
If you want some characters to be chosen from a set of options, place them in []
.
= r"[aeiou]"
vowels_regex get_regex_url(vowels_regex)
r"[aeiou]", sentence1) re.findall(
['e', 'e', 'a', 'e', 'i', 'e', 'a', 'i']
= r"[Tt]he"
the_regex get_regex_url(the_regex)
re.findall(the_regex, sentence1)
['The']
Ranges
Ranges of characters or numbers can be given inside []
like so
“Metacharacters”
\w
==[A-Za-z0-9_]
- word characters
\W
==[^A-Za-z0-9_]
- non-word characters
\d
==[0-9]
- digits
\D
==[^0-9]
- non-digits
\s
==[ \t\n]
- Any whitespace character
\S
==[^ \t\n]
- non-whitespace
Any Character
To match any character (letter, number, punctuation, space, etc.) use .
or “dot”
re.findall(# return every word character and
# the following character
r"\w.",
sentence1 )
['Th', 'e ', 'sp', 'ea', 'ke', 'r ', 'is', 'sp', 'ea', 'ki', 'ng']
Escaping special symbols
If you wanted to find the actual period in sentence1
, you’d have to “escape” the .
with a preceding `.
re.findall("\.",
sentence1 )
['.']
Modifiers
Modifiers come after the definition of a single character, and define how many times that character can appear.
a?
= zero or onea
a+
= one or morea
a*
= zero or morea
Grouping
You can define groupings within regular expressions. The effect of these groupings depends what kind of regex function you’re using. For re.findall()
, it’ll find the whole string, but return just the text from the grouping.
= "The big bear and the small bear ran away." sentence2
r"[Tt]he (\w+) bear") get_regex_url(
re.findall(r"[Tt]he (\w+) bear",
sentence2 )
['big', 'small']
Boundaries
^the
== Finds “the” at the start of a string.the$
== Finds ” the” at the end of a string.\bthe\b
== Finds “the” in between word boundaries.
r"^the ")
get_regex_url(r" the$")
get_regex_url(r"\bthe\b") get_regex_url(
[(https://regexper.com/#%5Cbthe%5Cb)
= "I saw the other bear."
sentence3
re.findall(r"the",
sentence3 )
['the', 'the']
The second “the” there comes from inside “other”
re.findall(r"\bthe\b",
sentence3 )
['the']
sentence3
'I saw the other bear.'
Reuse
Citation
@online{fruehwald2024,
author = {Fruehwald, Josef},
title = {Regular {Expressions,} {Quickly}},
date = {2024-01-25},
url = {https://lin511-2024.github.io/notes/programming/02_regex.html},
langid = {en}
}