Regular Expressions
Validate phone numbers in a contact list
This function is supposed to accept only strings that are phone numbers and
reject everything else. Call it with "call 555-867-5309 now" and with
"555-867-53090000". Do both results match your expectations?
import re
PATTERN = r"\d{3}-\d{3}-\d{4}"
def is_valid_phone(text):
return bool(re.search(PATTERN, text))
if __name__ == "__main__":
tests = [
"555-867-5309",
"call 555-867-5309 now",
"555-867-53090000",
"not a number",
]
for t in tests:
print(f"{'Valid' if is_valid_phone(t) else 'Invalid':8s}: {t!r}")
Show explanation
The bug is using re.search instead of re.fullmatch. re.search finds the
pattern anywhere in the string, so a sentence that contains a phone number
passes the check even though it is not a phone number, and a string of extra
digits such as "555-867-53090000" also passes because the pattern matches
the first ten digits and ignores the rest.
Shows: the difference between re.search (match anywhere), re.match
(match at the start), and re.fullmatch (match the entire string), and
why validation functions almost always need re.fullmatch.
To find it: print re.search(PATTERN, "call 555-867-5309 now").span().
The span (9, 21) shows the match starts at character nine, not character
zero, confirming that the match is embedded inside the string rather than
being the whole string. Replace re.search with re.fullmatch so that only
strings that are entirely a phone number pass.
Detect timestamps in log entries
This function is supposed to return True for any log line that contains a
date. Call it with "ERROR on 2024-01-15: disk full". Does it return the
result you expect?
import re
PATTERN = r"\d{4}-\d{2}-\d{2}"
def contains_date(line):
return bool(re.match(PATTERN, line))
if __name__ == "__main__":
lines = [
"2024-01-15: server started",
"ERROR on 2024-01-15: disk full",
"no date in this line",
]
for line in lines:
label = "has date" if contains_date(line) else "no date "
print(f"{label}: {line!r}")
Show explanation
The bug is using re.match instead of re.search. re.match only checks
whether the pattern appears at the start of the string, so a date that
appears after other text is never found.
Shows: re.match is not a whole-string check; it is an anchored-at-the-start
check. The complement to re.match for finding a pattern anywhere is
re.search.
To find it: call re.match(PATTERN, "ERROR on 2024-01-15: disk full") and
print the result. It returns None because "ERROR" does not match
\d{4}-\d{2}-\d{2}. Replace re.match with re.search to check anywhere
in the string.
Count action items in source code
Run this script and compare the count it prints to the number of action items you can see in the source string. Do the counts agree?
import re
PATTERN = r"# TODO:"
def find_todos(source):
return re.findall(PATTERN, source)
if __name__ == "__main__":
source = (
"x = 1 # TODO: replace with config value\n"
"y = 2 # todo: remove this line\n"
"z = 3 # Todo: clean up before release\n"
)
found = find_todos(source)
print(f"Found {len(found)} TODO comment(s), expected 3")
Show explanation
The bug is that the pattern # TODO: is case-sensitive, so # todo: and
# Todo: are not matched, and the function reports one instead of three.
Shows: Python's re module is case-sensitive by default, and the same word
spelled with different capitalisation is treated as a different pattern.
To find it: call re.findall(r"# TODO:", "# todo: example") and observe
that it returns an empty list. Add re.IGNORECASE as a second argument to
re.findall to make the match case-insensitive.
Find headings in a document
Run this script and count the Markdown headings in the document string by hand. Does the function find all of them?
import re
PATTERN = r"^#{1,3} .+"
def find_headings(text):
return re.findall(PATTERN, text)
if __name__ == "__main__":
doc = "# Introduction\n## Background\n### Methods\nParagraph text.\n## Results"
headings = find_headings(doc)
print(f"Found {len(headings)} heading(s): {headings}")
print("Expected 4 headings")
Show explanation
The bug is that ^ without re.MULTILINE matches only the very start of the
entire string, not the start of each line. The first heading "# Introduction"
is found because it appears at position zero; all subsequent headings are
missed because they begin after a newline character, not at the string's start.
Shows: in Python's re module, ^ and $ match the start and end of the
string by default. Pass re.MULTILINE to make them match the start and end
of each line instead.
To find it: print bool(re.search(r"^## Background", doc)) — it returns
False. Then print bool(re.search(r"^## Background", doc, re.MULTILINE))
— it returns True. Add re.MULTILINE to the re.findall call to fix the
function.
Find error messages in a log file
Run this script on the sample log string. The exception message spans more than one line. Does the function find it?
import re
PATTERN = r"EXCEPTION: (.+) END"
def find_exceptions(log):
return re.findall(PATTERN, log)
if __name__ == "__main__":
log = "EXCEPTION: ValueError\n line 42 in process\n line 10 in main\n END"
results = find_exceptions(log)
print(f"Found {len(results)} exception(s), expected 1")
print(f"Matches: {results}")
Show explanation
The bug is that the . metacharacter does not match newline characters by
default, so the pattern never bridges the line break between "ValueError"
and "END".
Shows: . matches any character except a newline unless re.DOTALL is
passed. This is the source of many silent failures when patterns are applied
to multi-line strings.
To find it: print re.search(r".", "\n") — it returns None, confirming
that . does not match \n. Pass re.DOTALL as a third argument to
re.findall to make . match newlines and let the pattern span lines.
Recognize column names in a CSV header
Run this script and compare the list it prints to the column names in the header string. Which columns are missing from the output?
import re
COLUMN_RE = re.compile(r"name|email|phone")
def find_columns(header):
return [col for col in header.split(",") if COLUMN_RE.search(col)]
if __name__ == "__main__":
header = "Name,Email,Phone,Address"
found = find_columns(header)
print(f"Recognized columns: {found}")
print("Expected: ['Name', 'Email', 'Phone']")
Show explanation
The bug is that re.compile is called without re.IGNORECASE, so the
compiled pattern only matches lowercase letters. The header uses title-case
column names ("Name", "Email", "Phone"), none of which match the
all-lowercase pattern.
Shows: flags must be supplied at compile time when using re.compile; they
cannot be added later. A compiled pattern without a flag behaves differently
from one compiled with it.
To find it: call COLUMN_RE.search("Name") and observe it returns None.
Then call re.search(r"name", "Name", re.IGNORECASE) and observe it returns
a match. Pass re.IGNORECASE as a second argument to re.compile to fix
the pattern.
Extract prices from a web page
Run this script and count the <amount> tags in the HTML string by hand.
Does the function return the same number of prices?
import re
PATTERN = r"<amount>(.*)</amount>"
def extract_amounts(html):
return re.findall(PATTERN, html)
if __name__ == "__main__":
html = "<amount>9.99</amount><tax>1.00</tax><amount>14.99</amount>"
amounts = extract_amounts(html)
print(f"Amounts: {amounts}")
print(f"Expected 2 amounts, got {len(amounts)}")
Show explanation
The bug is that .* is greedy: it matches as many characters as possible.
The pattern <amount>(.*)</amount> starts at the first <amount> and
expands .* until it reaches the last </amount>, capturing everything in
between — including the middle tag and the tax field — as a single match.
Shows: greedy quantifiers extend as far right as possible; lazy quantifiers
(.*?) extend only as far as necessary. Greedy behaviour is the default and
is often wrong when the surrounding delimiters appear more than once.
To find it: print the single item returned by extract_amounts and observe
that it contains the entire middle of the string. Replace .* with .*? to
make the quantifier lazy, so each <amount> tag matches only up to its own
</amount>.
Filter empty attributes from an HTML element
Run this script and inspect the list it returns. Should an empty id
attribute be included in that list?
import re
PATTERN = r'id="(\w*)"'
def find_ids(html):
return re.findall(PATTERN, html)
if __name__ == "__main__":
html = '<div id="main"><p id="">text</p><span id="footer"></span>'
ids = find_ids(html)
print(f"IDs: {ids}")
print("Expected only non-empty IDs: ['main', 'footer']")
Show explanation
The bug is \w*, which matches zero or more word characters. An empty
id="" attribute satisfies \w* (with zero characters), so the empty string
"" is captured and appears in the results.
Shows: * and + are easy to confuse. Use \w+ when the match must contain
at least one character and \w* only when an empty match is acceptable.
To find it: test bool(re.search(r"\w*", "")) — it returns True, confirming
that \w* accepts an empty string. Replace \w* with \w+ in the pattern to
require at least one word character inside the attribute value.
Validate postal codes in an address database
Run this script and check which strings it identifies as ZIP codes. Examine the input and the output side by side.
import re
PATTERN = r"\b\d{4}\b"
def find_zipcodes(text):
return re.findall(PATTERN, text)
if __name__ == "__main__":
text = "Ship to 90210 or 02134; apartment number is 4201"
found = find_zipcodes(text)
print(f"ZIP codes found: {found}")
print("Expected: ['90210', '02134']")
Show explanation
The bug is \d{4}, which matches exactly four digits, but US ZIP codes
contain five. None of the valid ZIP codes in the input are found, and the
four-digit apartment number 4201 is matched instead.
Shows: off-by-one errors appear in {n} quantifiers as well as in loop
indices. Always verify the exact length required by the format you are
matching.
To find it: evaluate bool(re.search(r"\d{4}", "90210")) — it returns
True, but only because \d{4} matches the first four digits of the five-
digit ZIP code, not all five. Change {4} to {5} so the pattern requires
exactly five digits.
Filter server addresses from a configuration
This function is supposed to accept only valid IP address strings. Call it
with "192X168X1X1". Does it reject that input?
import re
PATTERN = r"\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}"
def is_ip_address(text):
return bool(re.fullmatch(PATTERN, text))
if __name__ == "__main__":
candidates = [
"192.168.1.1",
"192X168X1X1",
"10.0.0.256",
"not-an-ip",
]
for c in candidates:
print(f"{'Valid' if is_ip_address(c) else 'Invalid':8s}: {c!r}")
Show explanation
The bug is that the dots separating the octets are not escaped. In a regular
expression, . matches any character, so \d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}
accepts any single character between the digit groups — including X.
Shows: literal characters that are also regex metacharacters (. * + ? [ ] { }
( ) ^ $ | \) must be escaped with a backslash to match themselves.
To find it: test bool(re.fullmatch(r".", "X")) — it returns True, showing
that an unescaped dot accepts "X". Replace each . in the pattern with \.
so only a literal period is accepted between the digit groups.
Remove markup from user submissions
Run this script and look at what strip_tags returns. Does the HTML it
receives end up as plain text?
import re
PATTERN = r"<[<>]+>"
def strip_tags(html):
return re.sub(PATTERN, "", html)
if __name__ == "__main__":
html = "<b>Hello</b>, <i>world</i>!"
result = strip_tags(html)
print(f"Result: {result!r}")
print("Expected: 'Hello, world!'")
Show explanation
The bug is the character class [<>], which matches only the characters <
and >. Because normal HTML tags contain letters, not angle brackets, the
pattern <[<>]+> never matches any real tag, and nothing is removed.
Shows: inside a character class, ^ at the very start negates the class.
[<>] means "either < or >", while [^<>] means "any character that
is not < or >".
To find it: test bool(re.search(r"<[<>]+>", "<b>")) — it returns False,
because the character b is not in [<>]. Change [<>] to [^<>] so the
pattern matches any sequence of non-bracket characters between the two angle
brackets.
Classify links by protocol
Run this script and check the scheme reported for the HTTPS URL. Does the label match the URL?
import re
PATTERN = r"http|https"
def get_scheme(url):
m = re.match(PATTERN, url)
return m.group(0) if m else None
if __name__ == "__main__":
urls = [
"http://example.com/page",
"https://secure.example.com/login",
]
for url in urls:
print(f"scheme={get_scheme(url)!r} url={url!r}")
Show explanation
The bug is that "http" appears before "https" in the alternation. Regular
expression alternation is ordered: the engine tries each alternative from left
to right and stops at the first match. Because "http" is a prefix of
"https", it matches the first four characters of any HTTPS URL and the
engine never tries the longer alternative.
Shows: when one alternative is a prefix of another, put the longer one first.
r"https|http" or the equivalent shorthand r"https?" both work correctly.
To find it: evaluate re.match(r"http|https", "https://example.com").group(0)
— it returns "http", not "https". Swap the order to r"https|http" or
rewrite the pattern as r"https?".
Count a word in a paragraph
Run this script and count the standalone occurrences of "log" in the text
by hand. Does the function return the same number?
import re
PATTERN = r"log"
def count_word(text, pattern):
return len(re.findall(pattern, text))
if __name__ == "__main__":
text = "Check the log file before login; the dialog shows the catalog entry"
count = count_word(text, PATTERN)
print(f"Count of 'log': {count}")
print("Expected: 1 (only the standalone word 'log')")
Show explanation
The bug is that the pattern has no word boundaries, so "log" matches inside
"login", "dialog", and "catalog" as well as the standalone word, giving
a count of four instead of one.
Shows: \b is a zero-width assertion that matches the position between a word
character and a non-word character. Use r"\blog\b" to match the whole word
only.
To find it: print re.findall(r"log", text) and examine the list — you will
see four items. Then print re.findall(r"\blog\b", text) and see only one.
Add \b before and after the word in the pattern.
Reformat names in a mailing list
Run this script and check whether the output names are in "First Last" order. Are they?
import re
PATTERN = r"(\w+),\s*(\w+)"
def reformat_name(name):
return re.sub(PATTERN, r"\1 \2", name)
if __name__ == "__main__":
names = ["Smith, Alice", "Jones, Bob", "Garcia, Carmen"]
for name in names:
print(f"{name!r} -> {reformat_name(name)!r}")
Show explanation
The bug is that the replacement string r"\1 \2" keeps the groups in their
original order. Group 1 captures the last name (everything before the comma)
and group 2 captures the first name, so r"\1 \2" produces "Last First"
rather than "First Last".
Shows: groups are numbered left to right by opening parenthesis. Drawing a quick table of which group captures which field before writing the replacement string prevents this error.
To find it: add a print statement inside reformat_name to show
re.search(PATTERN, name).groups() for the first input — it returns
('Smith', 'Alice'), confirming that group 1 is the last name. Change the
replacement to r"\2 \1" to put the first name before the last name.
Redact sensitive fields in a report
Run this script and inspect the repr() of the result. Are the sensitive
labels preserved in the output as expected?
import re
PATTERN = r"(SSN|DOB):\s*\S+"
def redact(text):
return re.sub(PATTERN, "\1: [redacted]", text)
if __name__ == "__main__":
text = "Patient SSN: 123-45-6789, DOB: 1990-01-01"
result = redact(text)
print(repr(result))
print("Expected: 'Patient SSN: [redacted], DOB: [redacted]'")
Show explanation
The bug is that the replacement string "\1: [redacted]" is a regular Python
string, not a raw string. Python's string parser converts \1 to the SOH
control character (the byte with value 1) before re.sub ever sees it. The
re module then interprets that byte as a backreference, but the result is a
garbled character instead of the matched label.
Shows: replacement strings for re.sub that contain backreferences must be
raw strings (prefix r) so Python does not interpret the backslash before
re.sub processes it.
To find it: print repr("\1") — it shows '\x01', confirming the string
contains a control character, not a backslash followed by 1. Change the
replacement to r"\1: [redacted]" to pass the literal two-character sequence
\1 to re.sub.
List settings from a configuration string
Run this script and look at the values it finds. What is missing from each item in the list?
import re
PATTERN = r"\w+=(\w+)"
def list_values(text):
return re.findall(PATTERN, text)
if __name__ == "__main__":
config = "host=localhost port=8080 debug=true"
values = list_values(config)
print(f"Found: {values}")
print("Expected: ['host=localhost', 'port=8080', 'debug=true']")
Show explanation
The bug is that a capturing group (\w+) is present in the pattern. When
re.findall finds a pattern that contains one or more groups, it returns the
group contents rather than the full match strings. Here every item in the
result is just the value portion, with the key and the = sign missing.
Shows: re.findall has two modes — when there are no groups it returns a list
of full match strings; when there is at least one group it returns a list of
the group contents (or a list of tuples when there are multiple groups).
To find it: evaluate re.findall(r"\w+=(\w+)", "a=1 b=2") — it returns
['1', '2'], not ['a=1', 'b=2']. Remove the parentheses from the pattern
to get full matches, or capture both key and value in separate groups and
handle the tuples in the calling code.
Pull dates from a report
Run this script and examine what it prints. Are the values in the list complete dates, or something shorter?
import re
PATTERN = r"(\d{4})-\d{2}-\d{2}"
def find_dates(text):
return re.findall(PATTERN, text)
if __name__ == "__main__":
text = "Events scheduled for 2024-01-15 and 2024-03-22."
dates = find_dates(text)
print(f"Dates: {dates}")
print("Expected: ['2024-01-15', '2024-03-22']")
Show explanation
The bug is a capturing group around the year portion of the pattern. Because
the group exists, re.findall returns only what the group captured — the
four-digit year — rather than the full date string that matched.
Shows: even a single capturing group changes what re.findall returns. Use a
non-capturing group (?:...) when grouping is needed for structure but the
full match is what you want to collect.
To find it: compare re.findall(r"(\d{4})-\d{2}-\d{2}", text) with
re.findall(r"\d{4}-\d{2}-\d{2}", text) using the same input. The first
returns years; the second returns full dates. Either remove the parentheses or
replace them with (?:...) to restore the full-match behaviour.
Check password strength
Run this script. Which passwords does it mark as strong? Are any of those passwords actually weak by the stated rules?
import re
PATTERN = r"(?=.*\d)[A-Za-z\d]{8,}"
def is_strong(password):
return bool(re.fullmatch(PATTERN, password))
if __name__ == "__main__":
passwords = ["Password1", "password1", "PASSWORD1", "12345678"]
for pw in passwords:
print(f"{'Strong' if is_strong(pw) else 'Weak':8s}: {pw!r}")
Show explanation
The bug is that the pattern checks for at least one digit using
(?=.*\d) but never checks for an uppercase letter. The intended rule
requires both a digit and an uppercase letter, so "password1" (no
uppercase) and "12345678" (no letter) both pass incorrectly.
Shows: a lookahead (?=...) is a zero-width assertion — it checks a condition
without consuming characters. Each independent requirement needs its own
lookahead; a single lookahead can only verify one condition at a time.
To find it: evaluate bool(re.fullmatch(PATTERN, "password1")) — it returns
True even though "password1" contains no uppercase letter. Add a second
lookahead (?=.*[A-Z]) immediately after the first to enforce both
requirements: r"(?=.*\d)(?=.*[A-Z])[A-Za-z\d]{8,}".