Skip to content

REGULAR EXPRESSION

Irvine Sunday edited this page Feb 16, 2023 · 11 revisions

INTRO

Also known as regex.
It is a sequence of characters that define a search pattern.
It can be used to check if a string contains the specified search pattern.
While it is a very powerful tool, it is also very dangerous because of its complexity.
Different languages use different regex engines. This means that a regex in Python for example will be interpreted differently in JavaScript.
Practice Regex here:

THE REGEX MODULE

Python has a built in package called re, which can be used to work with regular expressions.
It provides functions for searching, splitting, and replacing strings based on regular expression patterns.
Some of the most commonly used functions in the re module include:

  • re.search(pattern, string): Searches for the first occurrence of the pattern in the string and returns a match object if a match is found, or None if no match is found.

    import re
    text = "The price of the item is $10."
    pattern = r"\$\d+"
    match = re.search(pattern, text)
    
    if match:
        print("Match found:", match.group())
    else:
        print("No match found.")
    
    #Output: Match found: $10

    The pattern represents a dolar sign followed by one or more digits. The group() method of the match object returns the matching string.

  • re.findall(pattern, string): Returns a list of all non-overlapping occurrences of the pattern in the string as a list of strings. If no matches are found, an empty list is returned.

    import re
    text = "The email addresses are john@example.com and 
    jane@example.com."
    pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]\.[a-zA-Z{2,}"
    
    emails = re.findall(pattern, text)
    print("Email addresses:", emails)
    
    #Output Email addresses: ['[email protected]', '[email protected]']

    The pattern r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" is a regular expression pattern for matching email addresses. Here's how it works:

    • [a-zA-Z0-9._%+-]+: This part of the pattern matches one or more characters that are either lowercase or uppercase letters, digits, dot, underscore, percent, plus, or hyphen. This pattern represents the local part of an email address (e.g. john in [email protected]).

    • @: This matches the at symbol (@) that separates the local part and the domain part of an email address.

    • [a-zA-Z0-9.-]+: This part of the pattern matches one or more characters that are either lowercase or uppercase letters, digits, dot, or hyphen. This pattern represents the domain part of an email address (e.g. example.com in [email protected]).

    • \.: This matches a dot (.) symbol. The backslash (\) is used to escape the special meaning of the dot in regular expressions (which is to match any character)

    • [a-zA-Z]{2,}: This part of the pattern matches two or more characters that are either lowercase or uppercase letters. This pattern represents the top-level domain of an email address (e.g. com in [email protected]).

      Together, this pattern matches a string that has the form of an email address, with the local part, the at symbol, the domain part, and the top-level domain separated by the appropriate symbols.

  • re.split(pattern, string): Splits the string into a list of strings by separating it at the occurrence of the pattern.

    import re
    
    text = "The price of the product is $100.00"
    pattern = r"\$\d+\.\d+"
    result = re.split(pattern, text)
    print(result)
    
    #Output: ['The price of the product is ', '']

    The pattern r"\$\d+\.\d+" is a regular expression pattern that matches strings that represent monetary values

    • The \$ part of the pattern matches the dollar sign. The backslash is used to escape the special meaning of the dollar sign in regular expressions( which is to match the end of the line)
    • The \d+ part of the pattern matches one or more digits (0-9). The \d sequence is a shorthand character class that matches any digit, and the + symbol after it means "one or more."
    • The \. part of the pattern matches a decimal point (.). The backslash (\) is used to escape the special meaning of the decimal point in regular expressions (which is to match any character).
    • The last \d+ matches one or more digits after the decimal point.
  • re.sub(pattern, repl, string): Replaces all occurrences of the pattern in the string with the repl string and returns the new string.

    import re
    text = "The price of an item is $10 but the sale price is $5."
    pattern = r"\$\d+"
    new_text = re.sub(pattern, "USD", text)
    print("New text:", new_text)
    
    # outputNew text: The price of the item is USD, but the sale price 
    is USD. 
    • The \$ part of the pattern matches the dollar sign. The backslash is used to escape the special meaning of the dollar sign in regular expressions (which is to match the end of a line).
    • The \d+ part of the pattern matches one or more digits(0-9). The \d sequence is a shorthand character class that matches any digit, and the + symbol after it means "one or more".
  • re.compile(pattern): Compiles the pattern into a regular expression pattern object.
    The compiled peattern object can be used for multiple searches in the same string, making it more efficient than compiling the pattern for each search.

    import re
    
    pattern = r"\$\d+\.\d+"
    price_pattern = re.compile(pattern)
    text = "The price of the product is $100.00"
    result = price_pattern.search(text)
    print(result.group())
    
    #output: $100.00

    In this example, the re.compile() function is used to compile the pattern into a regular expression pattern object named price_pattern. The price_pattern object is then used to search for a match in the text string using the search() method. The group() method is used to extract the matched string from the result.

  • re.finditer(pattern, string):The method searches the string for all matches to the pattern and returns an iterator containing match objects for each match found.
    It is similar to re.findall(pattern, string), but instead of returning a list of all matches as strings, it returns an iterator of match objects, allowing you to access more information about each match, such as the matched string and its position in the input string.

    import re
    
    text = "The email addresses are john@example.com and 
    jane@example.com."
    
    pattern = r"([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,})"
    
    emails = re.finditer(pattern, text)
    
    for email in emails:
      print("Groups:",email.groups())
      print("Email:",email.group(0))
      print("Group1:",email.group(1))
      print("Group2:",email.group(2))
      print("Group3:",email.group(3))
  • re.match(pattern, string): It returns a match object if there is a match at the beginning of the string, and None if there is no match.

import re

text = "The date is 12/23/2022"
pattern = r"\d+/\d+/\d+"

result = re.match(pattern, text)

if result:
    print("Match found:", result.group())
    print("Start index:", result.start())
    print("End index:", result.end())
else:
    print("Match not found.")

#Output: Match not found

MATCH OBJECT METHODS

A match object is an object returned by the re module functions such as search(), match(), findall(), and finditer(), representing the result of a successful match of a regular expression pattern in a string.
The object contains information about the search and the result.
If there is no match, the value None is returned instead of the match object.
Here are some common methods available on a match object:

  • group(): Returns the string that was matched by the regular expression.

  • start(): Returns the starting index of the match in the string.

  • end(): Returns the ending index of the match in the string.

  • span(): Returns a tuple of the starting and ending indices of the match in the string.

    import re
    
    text = "The price of the product is $100.00"
    pattern = r"\$\d+\.\d+"
    
    result = re.search(pattern, text)
    
    print("Matched string:", result.group())
    print("Start index:", result.start())
    print("End index:", result.end())
    print("Span:", result.span())

    This will output:

    Matched string: $100.00
    Start index: 22
    End index: 29
    Span: (22, 29)
  • groups(): Returns a tuple of the subgroups matched by the regular expression.
    Capturing groups are a powerful feature of regular expressions in Python. They allow you to extract parts of a matching string as separate groups A subgroup is a portion of the matched string that is enclosed within a set of parentheses () in the regular expression pattern.
    A group is therefore created by placing a section of the regex pattern in a set of parentheses.
    We can use the group() method of a match object to extract all the group results separately by specifying a group index.
    Group numbering always starts with 1. The group with the index 0/without an argument is always the entire match.

    import re
    
    text = "The date is 12/23/2022"
    pattern = r"(\d+)/(\d+)/(\d+)"
    
    result = re.search(pattern, text)
    
    print("Groups:", result.groups())
    print("Month:", result.group(1))
    print("Day:", result.group(2))
    print("Year:", result.group(3))
    print("Match:", result.group(0))

    This will output:

    Groups: (12, 23, 2022)
    Month: 12
    Day: 23
    Year: 2022
    Match: 12/23/2022

    Note that group(0) returns the entire match, and group(1), group(2), etc. return the corresponding subgroup matches.

You can also name your capturing groups, making it easier to refer to them later.
Example:

import re

text = "The date is 12/23/2022"
pattern = r"(?P<month>\d+)/(?P<day>\d+)/(?P<year>\d+)"

result = re.search(pattern, text)

print("Month:", result.group("month"))
print("Day:", result.group("day"))
print("Year:", result.group("year"))

This will output:

Month: 12
Day: 23
Year: 2022

The pattern r"(?P\d+)/(?P\d+)/(?P\d+)" is a regular expression pattern used to match a date in the format month/day/year.

  • \d+: matches one or more digits.
  • /: matches the forward slash character.
  • (?P<month>\d+): creates a named capturing group month that matches the first sequence of digits.
  • (?P<day>\d+): creates a named capturing group day that matches the second sequence of digits.
  • (?P<year>\d+): creates a named capturing group year that matches the third sequence of digits.

METACHARACTERS

These are characters in a regular expression that have a special meaning. Some common metacharacters in Python regex include:

  • . (dot): Matches any single character except a newline character.
  • ^ (caret): Matches the start of a string.
  • $ (dollar): Matches the end of a string.
  • * (asterisk): Matches zero or more occurrences of the preceding character or group.
  • + (plus): Matches one or more occurrences of the preceding character or group.
  • ? (question mark): Matches zero or one occurrence of the preceding character or group.
  • {m,n} (curly braces): Matches the preceding character or group m to n times.
  • [] (square brackets): Defines a character set. Matches any one character in the set. -[^] (negated square brackets): Matches any one character not in the set.
  • | (vertical bar or pipe): Matches either the preceding or following expression.
  • ( ) (parentheses): Defines a group.
  • \ (backslash): Escapes a special character.

SPECIAL SEQUENCE

A special sequence is a \ followed by a character that has a special meaning.

  • \d: Matches any decimal digit. Equivalent to [0-9].

  • \D: Matches any non-digit character. Equivalent to [^0-9].

  • \w: Matches any word character(alphanumeric) (letters, digits, and underscores). Equivalent to [a-zA-Z0-9_]

  • \W: Matches any non-word character. Equivalent to [^a-zA-Z0-9_]).

  • \s: Matches any white-space character e.g. tab, space, line feed.

  • \S: Matches any non-white-space character.

    import re
    
    text = "The date is 05/07/2021."
    
    # Matching digits
    pattern = r"\d+"
    dates = re.findall(pattern, text)
    print(dates)
    # Output: ['05', '07', '2021']
    
    # Matching non-digits
    pattern = r"\D+"
    words = re.findall(pattern, text)
    print(words)
    # Output: ['The date is ', '/', '/', '.']
    
    # Matching white-space characters
    pattern = r"\s+"
    spaces = re.findall(pattern, text)
    print(spaces)
    # Output: [' ', ' ']
    
    # Matching non-white-space characters
    pattern = r"\S+"
    nonspaces = re.findall(pattern, text)
    print(nonspaces)
    # Output: ['The', 'date', 'is', '05/07/2021.']
    
    # Matching alphanumeric characters
    pattern = r"\w+"
    words = re.findall(pattern, text)
    print(words)
    # Output: ['The', 'date', 'is', '05', '07', '2021']
    
    # Matching non-alphanumeric characters
    pattern = r"\W+"
    nonspaces = re.findall(pattern, text)
    print(nonspaces)
    # Output: [' ', ' ', ' ', '/', '/', '.']
  • \b - matches a word boundary. A word boundary is defined as the position between a word character (letters and digits) and a non- word character. In other words, it matches the empty string, but only at the beginning or end of a word.

    import re
    
    text = "Hello world! 123"
    
    # Match the word "Hello" at the beginning of the string
    pattern = r"\bHello\b"
    match = re.search(pattern, text)
    print(match.group())  # Output: Hello
    
    # Match the word "world" anywhere in the string
    pattern = r"\bworld\b"
    match = re.search(pattern, text)
    print(match.group())  # Output: world
    
    # Match the word "123" at the end of the string
    pattern = r"\b123\b"
    match = re.search(pattern, text)
    print(match.group())  # Output: 123
    
    # No match for the string "1234" because it's not surrounded by word 
    boundaries
    pattern = r"\b1234\b"
    match = re.search(pattern, text)
    print(match)  # Output: None
  • \B - matches the empty string, but not at the start or end of a word.

It's important to keep in mind that metacharacters need to be escaped if you want to match the characters literally.

Clone this wiki locally