-
Notifications
You must be signed in to change notification settings - Fork 0
REGULAR EXPRESSION
Also known as regex
.
It is a sequence of characters that define a search pattern.
It can be used to check if a string contains the specified search pattern.
While it is a very powerful tool, it is also very dangerous because of its complexity.
Different languages use different regex engines. This means that a regex in Python for example will be interpreted differently in JavaScript.
Practice Regex here:
Python has a built in package called re
, which can be used to work with regular expressions.
It provides functions for searching, splitting, and replacing strings based on regular expression patterns.
Some of the most commonly used functions in the re
module include:
-
re.search(pattern, string)
: Searches for the first occurrence of the pattern in the string and returns a match object if a match is found, orNone
if no match is found.import re text = "The price of the item is $10." pattern = r"\$\d+" match = re.search(pattern, text) if match: print("Match found:", match.group()) else: print("No match found.") #Output: Match found: $10
The pattern represents a dolar sign followed by one or more digits. The
group()
method of the match object returns the matching string. -
re.findall(pattern, string)
: Returns a list of all non-overlapping occurrences of the pattern in the string as a list of strings. If no matches are found, an empty list is returned.import re text = "The email addresses are john@example.com and jane@example.com." pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]\.[a-zA-Z{2,}" emails = re.findall(pattern, text) print("Email addresses:", emails) #Output Email addresses: ['[email protected]', '[email protected]']
The pattern
r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
is a regular expression pattern for matching email addresses. Here's how it works:-
[a-zA-Z0-9._%+-]+
: This part of the pattern matches one or more characters that are either lowercase or uppercase letters, digits, dot, underscore, percent, plus, or hyphen. This pattern represents thelocal part
of an email address (e.g. john in [email protected]). -
@
: This matches the at symbol(@)
that separates the local part and the domain part of an email address. -
[a-zA-Z0-9.-]+
: This part of the pattern matches one or more characters that are either lowercase or uppercase letters, digits, dot, or hyphen. This pattern represents thedomain part
of an email address (e.g. example.com in [email protected]). -
\.
: This matches a dot (.) symbol. The backslash(\)
is used to escape the special meaning of the dot in regular expressions (which is to match any character) -
[a-zA-Z]{2,}
: This part of the pattern matches two or more characters that are either lowercase or uppercase letters. This pattern represents thetop-level domain
of an email address (e.g. com in [email protected]).Together, this pattern matches a string that has the form of an email address, with the local part, the at symbol, the domain part, and the top-level domain separated by the appropriate symbols.
-
-
re.split(pattern, string)
: Splits the string into a list of strings by separating it at the occurrence of the pattern.import re text = "The price of the product is $100.00" pattern = r"\$\d+\.\d+" result = re.split(pattern, text) print(result) #Output: ['The price of the product is ', '']
The pattern
r"\$\d+\.\d+"
is a regular expression pattern that matches strings that represent monetary values- The
\$
part of the pattern matches the dollar sign. The backslash is used to escape the special meaning of the dollar sign in regular expressions( which is to match the end of the line) - The
\d+
part of the pattern matches one or more digits (0-9). The\d
sequence is a shorthand character class that matches any digit, and the+
symbol after it means "one or more." - The
\.
part of the pattern matches a decimal point(.)
. The backslash(\)
is used to escape the special meaning of the decimal point in regular expressions (which is to match any character). - The last
\d+
matches one or more digits after the decimal point.
- The
-
re.sub(pattern, repl, string)
: Replaces all occurrences of the pattern in the string with therepl
string and returns the new string.import re text = "The price of an item is $10 but the sale price is $5." pattern = r"\$\d+" new_text = re.sub(pattern, "USD", text) print("New text:", new_text) # outputNew text: The price of the item is USD, but the sale price is USD.
- The
\$
part of the pattern matches the dollar sign. The backslash is used to escape the special meaning of the dollar sign in regular expressions (which is to match the end of a line). - The
\d+
part of the pattern matches one or more digits(0-9). The\d
sequence is a shorthand character class that matches any digit, and the+
symbol after it means "one or more".
- The
-
re.compile(pattern)
: Compiles the pattern into a regular expression pattern object.
The compiled peattern object can be used for multiple searches in the same string, making it more efficient than compiling the pattern for each search.import re pattern = r"\$\d+\.\d+" price_pattern = re.compile(pattern) text = "The price of the product is $100.00" result = price_pattern.search(text) print(result.group()) #output: $100.00
In this example, the
re.compile()
function is used to compile the pattern into a regular expression pattern object namedprice_pattern
. Theprice_pattern
object is then used to search for a match in the text string using thesearch()
method. Thegroup()
method is used to extract the matched string from the result. -
re.finditer(pattern, string)
:The method searches the string for all matches to the pattern and returns an iterator containing match objects for each match found.
It is similar tore.findall(pattern, string)
, but instead of returning a list of all matches as strings, it returns an iterator of match objects, allowing you to access more information about each match, such as the matched string and its position in the input string.import re text = "The email addresses are john@example.com and jane@example.com." pattern = r"([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,})" emails = re.finditer(pattern, text) for email in emails: print("Groups:",email.groups()) print("Email:",email.group(0)) print("Group1:",email.group(1)) print("Group2:",email.group(2)) print("Group3:",email.group(3))
-
re.match(pattern, string)
: It returns a match object if there is a match at the beginning of the string, and None if there is no match.
import re
text = "The date is 12/23/2022"
pattern = r"\d+/\d+/\d+"
result = re.match(pattern, text)
if result:
print("Match found:", result.group())
print("Start index:", result.start())
print("End index:", result.end())
else:
print("Match not found.")
#Output: Match not found
A match object is an object returned by the re
module functions such as search()
, match()
, findall()
, and finditer()
, representing the result of a successful match of a regular expression pattern in a string.
The object contains information about the search and the result.
If there is no match, the value None is returned instead of the match object.
Here are some common methods available on a match object:
-
group()
: Returns the string that was matched by the regular expression. -
start()
: Returns the starting index of the match in the string. -
end()
: Returns the ending index of the match in the string. -
span()
: Returns a tuple of the starting and ending indices of the match in the string.import re text = "The price of the product is $100.00" pattern = r"\$\d+\.\d+" result = re.search(pattern, text) print("Matched string:", result.group()) print("Start index:", result.start()) print("End index:", result.end()) print("Span:", result.span())
This will output:
Matched string: $100.00 Start index: 22 End index: 29 Span: (22, 29)
-
groups()
: Returns a tuple of the subgroups matched by the regular expression.
Capturing groups are a powerful feature of regular expressions in Python. They allow you to extract parts of a matching string as separate groups A subgroup is a portion of the matched string that is enclosed within a set of parentheses()
in the regular expression pattern.
A group is therefore created by placing a section of the regex pattern in a set of parentheses.
We can use thegroup()
method of a match object to extract all the group results separately by specifying a group index.
Group numbering always starts with 1. The group with the index 0/without an argument is always the entire match.import re text = "The date is 12/23/2022" pattern = r"(\d+)/(\d+)/(\d+)" result = re.search(pattern, text) print("Groups:", result.groups()) print("Month:", result.group(1)) print("Day:", result.group(2)) print("Year:", result.group(3)) print("Match:", result.group(0))
This will output:
Groups: (12, 23, 2022) Month: 12 Day: 23 Year: 2022 Match: 12/23/2022
Note that
group(0)
returns the entire match, andgroup(1)
,group(2)
, etc. return the corresponding subgroup matches.
You can also name your capturing groups, making it easier to refer to them later.
Example:
import re
text = "The date is 12/23/2022"
pattern = r"(?P<month>\d+)/(?P<day>\d+)/(?P<year>\d+)"
result = re.search(pattern, text)
print("Month:", result.group("month"))
print("Day:", result.group("day"))
print("Year:", result.group("year"))
This will output:
Month: 12
Day: 23
Year: 2022
The pattern r"(?P\d+)/(?P\d+)/(?P\d+)" is a regular expression pattern used to match a date in the format month/day/year.
-
\d+
: matches one or more digits. -
/
: matches the forward slash character. -
(?P<month>\d+)
: creates a named capturing groupmonth
that matches the first sequence of digits. -
(?P<day>\d+)
: creates a named capturing groupday
that matches the second sequence of digits. -
(?P<year>\d+)
: creates a named capturing groupyear
that matches the third sequence of digits.
These are characters in a regular expression that have a special meaning. Some common metacharacters in Python regex include:
-
.
(dot): Matches any single character except a newline character. -
^
(caret): Matches the start of a string. -
$
(dollar): Matches the end of a string. -
*
(asterisk): Matches zero or more occurrences of the preceding character or group. -
+
(plus): Matches one or more occurrences of the preceding character or group. -
?
(question mark): Matches zero or one occurrence of the preceding character or group. -
{m,n}
(curly braces): Matches the preceding character or group m to n times. -
[]
(square brackets): Defines a character set. Matches any one character in the set. -[^]
(negated square brackets): Matches any one character not in the set. -
|
(vertical bar or pipe): Matches either the preceding or following expression. -
( )
(parentheses): Defines a group. -
\
(backslash): Escapes a special character.
A special sequence is a \
followed by a character that has a special meaning.
-
\d
: Matches any decimal digit. Equivalent to[0-9]
. -
\D
: Matches any non-digit character. Equivalent to[^0-9]
. -
\w
: Matches any word character(alphanumeric) (letters, digits, and underscores). Equivalent to[a-zA-Z0-9_]
-
\W
: Matches any non-word character. Equivalent to [^a-zA-Z0-9_]). -
\s
: Matches any white-space character e.g. tab, space, line feed. -
\S
: Matches any non-white-space character.import re text = "The date is 05/07/2021." # Matching digits pattern = r"\d+" dates = re.findall(pattern, text) print(dates) # Output: ['05', '07', '2021'] # Matching non-digits pattern = r"\D+" words = re.findall(pattern, text) print(words) # Output: ['The date is ', '/', '/', '.'] # Matching white-space characters pattern = r"\s+" spaces = re.findall(pattern, text) print(spaces) # Output: [' ', ' '] # Matching non-white-space characters pattern = r"\S+" nonspaces = re.findall(pattern, text) print(nonspaces) # Output: ['The', 'date', 'is', '05/07/2021.'] # Matching alphanumeric characters pattern = r"\w+" words = re.findall(pattern, text) print(words) # Output: ['The', 'date', 'is', '05', '07', '2021'] # Matching non-alphanumeric characters pattern = r"\W+" nonspaces = re.findall(pattern, text) print(nonspaces) # Output: [' ', ' ', ' ', '/', '/', '.']
-
\b
- matches a word boundary. A word boundary is defined as the position between a word character (letters and digits) and a non- word character. In other words, it matches the empty string, but only at the beginning or end of a word.import re text = "Hello world! 123" # Match the word "Hello" at the beginning of the string pattern = r"\bHello\b" match = re.search(pattern, text) print(match.group()) # Output: Hello # Match the word "world" anywhere in the string pattern = r"\bworld\b" match = re.search(pattern, text) print(match.group()) # Output: world # Match the word "123" at the end of the string pattern = r"\b123\b" match = re.search(pattern, text) print(match.group()) # Output: 123 # No match for the string "1234" because it's not surrounded by word boundaries pattern = r"\b1234\b" match = re.search(pattern, text) print(match) # Output: None
-
\B
- matches the empty string, but not at the start or end of a word.
It's important to keep in mind that metacharacters need to be escaped if you want to match the characters literally.