Regular Expressions and Pattern Matching

undefined
 
1
 
http://xkcd.com
/208/
undefined
 
Basic Text
Processing
 
Regular Expressions
 
Regular expressions
 
A formal language for specifying text strings
How can we search for any of these?
woodchuck
woodchucks
Woodchuck
Woodchucks
 
Regular Expressions: Disjunctions
 
Letters inside square brackets []
 
 
 
Ranges
 
[A-Z]
 
 
Regular Expressions: Negation in Disjunction
 
Negations
 [^Ss]
Carat means negation only when first in []
 
Regular Expressions: More Disjunction
 
Woodchucks is another name for groundhog
!
The pipe | for disjunction
 
Regular Expressions: 
?
    
*  +  .
 
 
 
Stephen C Kleene
 
Kleene *,   Kleene +
 
Regular Expressions: Anchors  
^   $
 
Example
 
Find me all instances of the word “the” in a text.
the
                                                Misses capitalized examples
[tT]he
                                                Incorrectly returns 
other
 or 
theology
[^a-zA-Z]
[tT]
he
[^a-zA-Z]
 
Errors
 
The process we just went through was based on 
fixing
two kinds of errors
Matching strings that we should not have matched (
the
re,
the
n, o
the
r)
False positives (Type I)
Not matching things that we should have matched (The)
False negatives (Type II)
 
Errors cont.
 
In NLP we are always dealing with these kinds of
errors.
Reducing the error rate for an application often
involves two antagonistic efforts:
Increasing accuracy or precision 
(minimizing false positives)
Increasing coverage or recall 
(minimizing false negatives).
 
Summary
 
Regular expressions play a surprisingly large role
Sophisticated sequences of regular expressions are often the first model
for any text processing text
For many hard tasks, we use machine learning classifiers
But regular expressions are used as features in the classifiers
Can be very useful in capturing generalizations
 
12
 
Islamic Historian Maxim Romanov showing
off his regular expression!
 
13
undefined
 
Basic Text
Processing
 
Regular Expressions
Slide Note
Embed
Share

Learn about regular expressions, pattern matching, and text processing through examples and explanations provided in Dan Jurafsky's materials. Discover how to use disjunctions, negations, anchors, and other functionalities to search and manipulate text strings effectively.

  • Regular Expressions
  • Pattern Matching
  • Text Processing
  • Dan Jurafsky
  • Language Processing

Uploaded on Sep 29, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. http://xkcd.com /208/ 1

  2. Basic Text Processing Regular Expressions

  3. Dan Jurafsky Regular expressions A formal language for specifying text strings How can we search for any of these? woodchuck woodchucks Woodchuck Woodchucks

  4. Dan Jurafsky Regular Expressions: Disjunctions Letters inside square brackets [] Pattern [wW]oodchuck [1234567890] Matches Woodchuck, woodchuck Any digit Ranges[A-Z] Pattern [A-Z] [a-z] [0-9] Matches Drenched Blossoms my beans were impatient Chapter 1: Down the Rabbit Hole An upper case letter A lower case letter A single digit

  5. Dan Jurafsky Regular Expressions: Negation in Disjunction Negations [^Ss] Carat means negation only when first in [] Pattern Matches Not an upper case letter Oyfn pripetchik [^A-Z] [^Ss] [^e^] a^b I have no exquisite reason Look here Look up a^b now Neither S nor s Neither e nor ^ The pattern a carat b

  6. Dan Jurafsky Regular Expressions: More Disjunction Woodchucks is another name for groundhog! The pipe | for disjunction Pattern groundhog|woodchuck yours|mine Matches yours mine a|b|c [gG]roundhog|[Ww]oodchuck = [abc]

  7. Dan Jurafsky Regular Expressions: ?* + . Pattern colou?r Matches color colour Optional previous char oh! ooh! oooh! ooooh! oo*h! 0 or more of previous char oh! ooh! oooh! ooooh! o+h! 1 or more of previous char Stephen C Kleene baa baaa baaaa baaaaa begin begun begun beg3n baa+ beg.n Kleene *, Kleene +

  8. Dan Jurafsky Regular Expressions: Anchors ^ $ Pattern ^[A-Z] ^[^A-Za-z] \.$ .$ Matches Palo Alto 1 Hello The end. The end? The end!

  9. Dan Jurafsky Example Find me all instances of the word the in a text. the Misses capitalized examples [tT]he Incorrectly returns other or theology [^a-zA-Z][tT]he[^a-zA-Z]

  10. Dan Jurafsky Errors The process we just went through was based on fixing two kinds of errors Matching strings that we should not have matched (there, then, other) False positives (Type I) Not matching things that we should have matched (The) False negatives (Type II)

  11. Dan Jurafsky Errors cont. In NLP we are always dealing with these kinds of errors. Reducing the error rate for an application often involves two antagonistic efforts: Increasing accuracy or precision (minimizing false positives) Increasing coverage or recall (minimizing false negatives).

  12. Dan Jurafsky Summary Regular expressions play a surprisingly large role Sophisticated sequences of regular expressions are often the first model for any text processing text For many hard tasks, we use machine learning classifiers But regular expressions are used as features in the classifiers Can be very useful in capturing generalizations 12

  13. Dan Jurafsky Islamic Historian Maxim Romanov showing off his regular expression! 13

  14. Basic Text Processing Regular Expressions

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#