Regular Expressions for String Manipulation

 
Strings and Serialization
 
Damian Gordon
 
REGULAR EXPRESSIONS
 
Regular Expressions
 
A regular expression is a sequence of characters that define a
search pattern, mainly for use in pattern matching with strings,
or string matching.
 
Regular expressions originated in 1956, when mathematician
Stephen Cole Kleene described regular languages using his
mathematical notation called regular sets.
 
Regular Expressions
 
Basic Patterns
Logical OR
: A vertical bar separates alternatives. For example,
gray|grey can match "gray" or "grey".
 
Grouping
: Parentheses are used to define the scope and precedence
of the operators. For example, gr(a|e)y
 
Quantification
: A quantifier after a token (such as a character) or
group specifies how often that preceding element is allowed to occur.
 
 
Regular Expressions
 
Qualifications
?
: indicates zero or one occurrences of the preceding element. For
example, colou?r matches both "color" and "colour".
 
*
: indicates zero or more occurrences of the preceding element. For
example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on.
 
+
:  indicates one or more occurrences of the preceding element. For
example, ab+c matches "abc", "abbc", "abbbc", and so on, but not
"ac".
 
Regular Expressions
 
Qualifications
{n}
: The preceding item is matched exactly n times.
 
{min,}
: The preceding item is matched min or more times.
 
{min,max}
: The preceding item is matched at least min times, but not
more than max times.
 
Regular Expressions
 
The Python Standard Library module for regular expressions is
called 
re
, for example:
 
# PROGRAM MatchingPatterns:
import re
search_string = "hello world"
pattern       = "hello world"
match = re.match(pattern, search_string)
if match:
    # THEN
    print("regex matches")
# ENDIF;
# END.
 
Regular Expressions
 
Bear in mind that the match function matches the
pattern to the beginning of the string.
Thus, if the pattern were "ello world", no match
would be found.
With confusing asymmetry, the parser stops searching
as soon as it finds a match, so the pattern "hello wo"
matches successfully.
 
Regular Expressions
 
So with this code:
 
import re
pattern       = "hello world"
search_string = "hello world"
match = re.match(pattern, search_string)
if match:
    template = "'{}' matches pattern '{}'"
else:
    template = "'{}' does not match pattern '{}'"
# ENDIF;
# END.
print(template.format(search_string, pattern))
 
Regular Expressions
 
For
pattern       = "hello world"
search_string = "hello world"
For
pattern       = "hello worl"
search_string = "hello world"
For
pattern       = "ello world"
search_string = "hello world"
MATCH
MATCH
NO MATCH
 
Matching Single Characters
 
Regular Expressions
 
The period character, when used in a regular expression pattern, can
match any single character. Using a period in the string means you don't
care what the character is, just that there is a character there.
 
'hello world' matches pattern 'hel.o world'
'helpo world' matches pattern 'hel.o world'
'hel o world' matches pattern 'hel.o world'
'helo world' does not match pattern 'hel.o world'
 
Regular Expressions
 
The square brackets, when used in a regular expression pattern,
can match any one of a list of single characters.
 
'hello world' matches pattern 'hel[lp]o world'
'helpo world' matches pattern 'hel[lp]o world'
'helPo world' does not match pattern 'hel[lp]o world'
 
Regular Expressions
 
The square brackets, when used in a regular expression pattern,
can match a range of single characters.
 
'hello world' does not match pattern 'hello [a-z] world'
'hello b world' matches pattern 'hello [a-z] world'
'hello B world' matches pattern 'hello [a-zA-Z] world'
'hello 2 world' matches pattern 'hello [a-zA-Z0-9] world'
 
Regular Expressions
 
But what happens if we want to match the period character or the
square bracket?
 
We use the backslash:
 
'.' matches pattern '\.'
‘[' matches pattern '\['
‘]' matches pattern '\]‘
‘(' matches pattern '\(‘
‘)' matches pattern '\)‘
 
Regular Expressions
 
Other backslashes character:
 
Regular Expressions
 
So for example.
 
'(abc]' matches pattern '\(abc\]'
' 1a' matches pattern '\s\d\w'
'\t5n' does not match pattern '\s\d\w'
‘ 5n' matches pattern '\s\d\w'
 
Matching Multiple Characters
 
Regular Expressions
 
The asterisk (*) character says that the previous character can be
matched zero or more times.
 
'hello' matches pattern 'hel*o'
'heo' matches pattern 'hel*o'
'helllllo' matches pattern 'hel*o'
 
Regular Expressions
 
[a-z]* matches any collection of lowercase words, including the
empty string:
 
'A string.' matches pattern '[A-Z][a-z]* [a-z]*\.'
'No .' matches pattern '[A-Z][a-z]* [a-z]*\.'
'' matches pattern '[a-z]*.*'
 
Regular Expressions
 
The plus (+) sign in a pattern behaves similarly to an asterisk; it
states that the previous character can be repeated one or more
times, but, unlike the asterisk is not optional.
 
The question mark (?) ensures a character shows up exactly zero or
one times, but not more.
 
Regular Expressions
 
Some examples:
 
'0.4' matches pattern '\d+\.\d+'
'1.002' matches pattern '\d+\.\d+'
'1.' does not match pattern '\d+\.\d+'
'1%' matches pattern '\d?\d%'
'99%' matches pattern '\d?\d%'
'999%' does not match pattern '\d?\d%'
 
Regular Expressions
 
If we want to check for a repeating sequence of characters, by
enclosing any set of characters in parenthesis,  we can treat them
as a single pattern:
 
'abccc' matches pattern 'abc{3}'
'abccc' does not match pattern '(abc){3}'
'abcabcabc' matches pattern '(abc){3}'
 
Regular Expressions
 
Combined with complex patterns, this grouping feature greatly
expands our pattern-matching repertoire:
 
'Eat.' matches pattern '[A-Z][a-z]*( [a-z]+)*\.$'
'Eat more good food.' matches pattern '[A-Z][a-z]*( [a-z]+)*\.$'
'A good meal.' matches pattern '[A-Z][a-z]*( [a-z]+)*\.$'
The first word starts with a capital, followed by zero or more lowercase letters.
Then, we enter a parenthetical that matches a single space followed by a word of
one or more lowercase letters. This entire parenthetical is repeated zero or more
times, and the pattern is terminated with a period. There cannot be any other
characters after the period, as indicated by the $ matching the end of string.
 
Regular Expressions
 
Let’s write a Python program to determine if a particular string is a
valid e-mail address or not, and if it is an e-mail address, to return
the domain name part of the e-mail address.
 
In terms of the regular expression for a valid e-mail format:
 
pattern = "^[a-zA-Z.]+@([a-z.]*\.[a-z]+)$"
 
Regular Expressions
 
Python's re module provides an object-oriented interface to
enter the regular expression engine.
 
We've been checking whether the 
re.match
 function returns
a valid object or not. If a pattern does not match, that function
returns 
None
. If it does match, however, it returns a useful
object that we can introspect for information about the
pattern.
 
Regular Expressions
 
Let’s test which of the following addresses are valid:
 
search_string = "Damian.Gordon@dit.ie"
search_string = "Damian.Gordon@ditie"
search_string = "DamianGordon@dit.ie"
search_string = "Damian.Gordondit.ie"
 
Regular Expressions
 
# PROGRAM DomainDetection:
import re
def DetectDomain(searchstring):
    pattern = "^[a-zA-Z.]+@([a-z.]*\.[a-z]+)$"
    match = re.match(pattern, searchstring)
 
    if match != None:
       domain = match.groups()[0]
        print("<<", domain, ">>", "is a legimate domain")
    else:
        print("<<", search_string, ">>", "is not an e-mail address")
    # ENDIF;
# END DetectDomain
 
Regular Expressions
 
# PROGRAM DomainDetection:
import re
def DetectDomain(searchstring):
    pattern = "^[a-zA-Z.]+@([a-z.]*\.[a-z]+)$"
    match = re.match(pattern, searchstring)
 
    if match != None:
       domain = match.groups()[0]
        print("<<", domain, ">>", "is a legimate domain")
    else:
        print("<<", search_string, ">>", "is not an e-mail address")
    # ENDIF;
# END DetectDomain
Regular expression search string for a
valid e-mail address, with domain
element in parenthesis
Match returns None if there is no
match, and an tuples in the search
string otherwise
The regular expression above has the
domain elements in parenthesis, so
Groups() returns just the domain
 
Regular Expressions
 
In addition to the match function, the re module provides a
couple other useful functions, 
search
, and 
findall
.
 
The 
search
 function finds the first instance of a matching pattern,
relaxing the restriction that the pattern start at the first letter of the
string.
The 
findall
 function behaves similarly to search, except that it
finds all non-overlapping instances of the matching pattern, not just
the first one.
 
Regular Expressions
 
>>> import re
 
>>> re.findall('a.', 'abacadefagah')
['ab', 'ac', 'ad', 'ag', 'ah']
 
>>> re.findall('a(.)', 'abacadefagah')
['b', 'c', 'd', 'g', 'h']
 
>>> re.findall('(a)(.)', 'abacadefagah')
[('a', 'b'), ('a', 'c'), ('a', 'd'), ('a', 'g'), ('a', 'h')]
 
>>> re.findall('((a)(.))', 'abacadefagah')
[('ab', 'a', 'b'), ('ac', 'a', 'c'), ('ad', 'a', 'd'), ('ag', 'a',
'g'), ('ah', 'a', 'h')]
 
etc.
 
Slide Note
Embed
Share

Regular expressions are powerful tools for defining search patterns in strings. They consist of basic patterns like logical OR, grouping, and quantification, as well as qualifications like zero or more occurrences. The Python Standard Library provides the 're' module for working with regular expressions. It's important to bear in mind how the match function functions, as it matches the pattern to the beginning of the string.

  • Regular Expressions
  • String Manipulation
  • Python
  • Search Patterns

Uploaded on Sep 11, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Strings and Serialization Damian Gordon

  2. REGULAR EXPRESSIONS

  3. Regular Expressions A regular expression is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching. Regular expressions originated in 1956, when mathematician Stephen Cole Kleene described regular languages using his mathematical notation called regular sets.

  4. Regular Expressions Basic Patterns Logical OR: A vertical bar separates alternatives. For example, gray|grey can match "gray" or "grey". Grouping: Parentheses are used to define the scope and precedence of the operators. For example, gr(a|e)y Quantification: A quantifier after a token (such as a character) or group specifies how often that preceding element is allowed to occur.

  5. Regular Expressions Qualifications ?: indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour". *: indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on. +: indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".

  6. Regular Expressions Qualifications {n}: The preceding item is matched exactly n times. {min,}: The preceding item is matched min or more times. {min,max}: The preceding item is matched at least min times, but not more than max times.

  7. Regular Expressions The Python Standard Library module for regular expressions is called re, for example: # PROGRAM MatchingPatterns: import re search_string = "hello world" pattern = "hello world" match = re.match(pattern, search_string) if match: # THEN print("regex matches") # ENDIF; # END.

  8. Regular Expressions Bear in mind that the match function matches the pattern to the beginning of the string. Thus, if the pattern were "ello world", no match would be found. With confusing asymmetry, the parser stops searching as soon as it finds a match, so the pattern "hello wo" matches successfully.

  9. Regular Expressions So with this code: import re pattern = "hello world" search_string = "hello world" match = re.match(pattern, search_string) if match: template = "'{}' matches pattern '{}'" else: template = "'{}' does not match pattern '{}'" # ENDIF; # END. print(template.format(search_string, pattern))

  10. Regular Expressions For pattern = "hello world" search_string = "hello world" For pattern = "hello worl" search_string = "hello world" For pattern = "ello world" search_string = "hello world" MATCH MATCH NO MATCH

  11. Matching Single Characters

  12. Regular Expressions The period character, when used in a regular expression pattern, can match any single character. Using a period in the string means you don't care what the character is, just that there is a character there. 'hello world' matches pattern 'hel.o world' 'helpo world' matches pattern 'hel.o world' 'hel o world' matches pattern 'hel.o world' 'helo world' does not match pattern 'hel.o world'

  13. Regular Expressions The square brackets, when used in a regular expression pattern, can match any one of a list of single characters. 'hello world' matches pattern 'hel[lp]o world' 'helpo world' matches pattern 'hel[lp]o world' 'helPo world' does not match pattern 'hel[lp]o world'

  14. Regular Expressions The square brackets, when used in a regular expression pattern, can match a range of single characters. 'hello world' does not match pattern 'hello [a-z] world' 'hello b world' matches pattern 'hello [a-z] world' 'hello B world' matches pattern 'hello [a-zA-Z] world' 'hello 2 world' matches pattern 'hello [a-zA-Z0-9] world'

  15. Regular Expressions But what happens if we want to match the period character or the square bracket? We use the backslash: '.' matches pattern '\.' [' matches pattern '\[' ]' matches pattern '\] (' matches pattern '\( )' matches pattern '\)

  16. Regular Expressions Other backslashes character: Character \n Description newlines \t tabs \s whitespace character letters, numbers, and underscores Digit \w \d

  17. Regular Expressions So for example. '(abc]' matches pattern '\(abc\]' ' 1a' matches pattern '\s\d\w' '\t5n' does not match pattern '\s\d\w' 5n' matches pattern '\s\d\w'

  18. Matching Multiple Characters

  19. Regular Expressions The asterisk (*) character says that the previous character can be matched zero or more times. 'hello' matches pattern 'hel*o' 'heo' matches pattern 'hel*o' 'helllllo' matches pattern 'hel*o'

  20. Regular Expressions [a-z]* matches any collection of lowercase words, including the empty string: 'A string.' matches pattern '[A-Z][a-z]* [a-z]*\.' 'No .' matches pattern '[A-Z][a-z]* [a-z]*\.' '' matches pattern '[a-z]*.*'

  21. Regular Expressions The plus (+) sign in a pattern behaves similarly to an asterisk; it states that the previous character can be repeated one or more times, but, unlike the asterisk is not optional. The question mark (?) ensures a character shows up exactly zero or one times, but not more.

  22. Regular Expressions Some examples: '0.4' matches pattern '\d+\.\d+' '1.002' matches pattern '\d+\.\d+' '1.' does not match pattern '\d+\.\d+' '1%' matches pattern '\d?\d%' '99%' matches pattern '\d?\d%' '999%' does not match pattern '\d?\d%'

  23. Regular Expressions If we want to check for a repeating sequence of characters, by enclosing any set of characters in parenthesis, we can treat them as a single pattern: 'abccc' matches pattern 'abc{3}' 'abccc' does not match pattern '(abc){3}' 'abcabcabc' matches pattern '(abc){3}'

  24. Regular Expressions Combined with complex patterns, this grouping feature greatly expands our pattern-matching repertoire: 'Eat.' matches pattern '[A-Z][a-z]*( [a-z]+)*\.$' 'Eat more good food.' matches pattern '[A-Z][a-z]*( [a-z]+)*\.$' 'A good meal.' matches pattern '[A-Z][a-z]*( [a-z]+)*\.$' The first word starts with a capital, followed by zero or more lowercase letters. Then, we enter a parenthetical that matches a single space followed by a word of one or more lowercase letters. This entire parenthetical is repeated zero or more times, and the pattern is terminated with a period. There cannot be any other characters after the period, as indicated by the $ matching the end of string.

  25. Regular Expressions Let s write a Python program to determine if a particular string is a valid e-mail address or not, and if it is an e-mail address, to return the domain name part of the e-mail address. In terms of the regular expression for a valid e-mail format: pattern = "^[a-zA-Z.]+@([a-z.]*\.[a-z]+)$"

  26. Regular Expressions Python's re module provides an object-oriented interface to enter the regular expression engine. We've been checking whether the re.match function returns a valid object or not. If a pattern does not match, that function returns None. If it does match, however, it returns a useful object that we can introspect for information about the pattern.

  27. Regular Expressions Let s test which of the following addresses are valid: search_string = "Damian.Gordon@dit.ie" search_string = "Damian.Gordon@ditie" search_string = "DamianGordon@dit.ie" search_string = "Damian.Gordondit.ie"

  28. Regular Expressions # PROGRAM DomainDetection: import re def DetectDomain(searchstring): pattern = "^[a-zA-Z.]+@([a-z.]*\.[a-z]+)$" match = re.match(pattern, searchstring) if match != None: domain = match.groups()[0] print("<<", domain, ">>", "is a legimate domain") else: print("<<", search_string, ">>", "is not an e-mail address") # ENDIF; # END DetectDomain

  29. Regular Expressions Regular expression search string for a valid e-mail address, with domain element in parenthesis # PROGRAM DomainDetection: import re def DetectDomain(searchstring): pattern = "^[a-zA-Z.]+@([a-z.]*\.[a-z]+)$" match = re.match(pattern, searchstring) Match returns None if there is no match, and an tuples in the search string otherwise The regular expression above has the domain elements in parenthesis, so Groups() returns just the domain if match != None: domain = match.groups()[0] print("<<", domain, ">>", "is a legimate domain") else: print("<<", search_string, ">>", "is not an e-mail address") # ENDIF; # END DetectDomain

  30. Regular Expressions In addition to the match function, the re module provides a couple other useful functions, search, and findall. The search function finds the first instance of a matching pattern, relaxing the restriction that the pattern start at the first letter of the string. The findall function behaves similarly to search, except that it finds all non-overlapping instances of the matching pattern, not just the first one.

  31. Regular Expressions >>> import re >>> re.findall('a.', 'abacadefagah') ['ab', 'ac', 'ad', 'ag', 'ah'] >>> re.findall('a(.)', 'abacadefagah') ['b', 'c', 'd', 'g', 'h'] >>> re.findall('(a)(.)', 'abacadefagah') [('a', 'b'), ('a', 'c'), ('a', 'd'), ('a', 'g'), ('a', 'h')] >>> re.findall('((a)(.))', 'abacadefagah') [('ab', 'a', 'b'), ('ac', 'a', 'c'), ('ad', 'a', 'd'), ('ag', 'a', 'g'), ('ah', 'a', 'h')]

  32. etc.

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#