Regular Expressions in Natural Language Processing

 
Natural Language
Processing
 
Lecture 4
 
Basic Text Processing
 
Regular expressions
 
A formal language for specifying text strings
How can we search for any of these?
woodchuck
woodchucks
Woodchuck
Woodchucks
 
 
Regular Expressions: Disjunctions
 
Letters inside square brackets []
 
 
 
 
Ranges [A-Z]
 
 
Regular Expressions: Negation in Disjunction
 
Negations [^Ss]
Carat means negation only when first in []
 
Regular Expressions: More Disjunction
 
Woodchucks is another name for groundhog!
The pipe | for disjunction
 
Regular Expressions: ? * + .
 
Stephen C Kleene
 
Kleene *,   Kleene +
 
Regular Expressions: Anchors  ^   $
 
Example
 
Find me all instances of the word “the” in a text.
 
the            
Misses capitalized examples
 
[tT]he
                        Incorrectly returns 
other
 or 
theology
 
 
 
[^a-zA-Z]
[tT]
he
[^a-zA-Z]
 
Errors
 
The process we just went through was based on fixing two kinds of errors
Matching strings that we should not have matched (there, then, other)
 
False positives (Type I)
 
Not matching things that we should have matched (The)
 
False negatives (Type II)
 
In NLP we are always dealing with these kinds of errors.
Reducing the error rate for an application often involves two antagonistic
efforts:
 
Increasing accuracy or precision (minimizing false positives)
Increasing coverage or recall (minimizing false negatives).
 
11
 
Precision and Recall
 
Precision: How many selected
items are relevant
 
Recall: How many relevant
items are selected
 
Accuracy and Precision
 
Accuracy
 refers to the 
closeness of a measured value to a
standard or known value. 
For example, if in lab you obtain a
weight measurement of 3.2 kg for a given substance, but the
actual or known weight is 10 kg, then your measurement is not
accurate. In this case, your measurement is not close to the
known value.
 
Precision
 refers to the 
closeness of two or more
measurements 
to each other. Using the example above, if you
weigh a given substance five times, and get 3.2 kg each time,
then your measurement is very precise. Precision is
independent of accuracy. You can be very precise but
inaccurate, as described above. You can also be accurate but
imprecise.
https://labwrite.ncsu.edu/Experimental%20Design/accuracyprecision.htm
 
12
 
Accuracy refers to how close a measurement is to the true value.
Precision
 is how consistent results are when measurements are
repeated
An easy way to remember the difference between accuracy and
precision is:
 
A
C
curate is 
C
orrect (or 
C
lose to real value)
 
P
R
ecise is 
R
epeating (or 
R
epeatable)
.
 
13
 
Accuracy and Precision
 
 
14
 
Word tokenization
 
Text Normalization
 
Every NLP task needs to do text normalization:
Segmenting/tokenizing words in running text
Normalizing word formats
Segmenting sentences in running text
 
 
 
Tokenization
 
Given a character sequence and a defined document unit,
tokenization is the task of chopping it up into pieces,
called 
tokens
 , perhaps at the same time throwing away
certain characters, such as punctuation. Here is an example of
tokenization
Input:     Friends, Romans, Countrymen, lend me your ears;
Output:
 
 
17
 
How many words?
 
I do uh main- mainly business data processing
Fragments, filled pauses
 
Seuss’s cat in the hat is different from other cats!
Lemma: same stem, part of speech, rough word sense
cat and cats = same lemma
Wordform: the full inflected surface form
cat and cats = different wordforms
 
How many words?
 
they lay back on the San Francisco grass and looked at
the stars and their
 
Type
: an element of the vocabulary.
Token
: an instance of that type in running text.
How many?
15 tokens (or 14)
13 types (or 12) (or 11?)
 
How many words?
 
N = number of tokens
V = vocabulary = set of types
|V| is the size of the vocabulary
 
 
 
 
 
 
Church and Gale (1990)
: |V| > O(N
½
)
 
Linux TR Command for basic text processing
 
tr is an UNIX utility for translating, or deleting, or squeezing
repeated characters. It will read from STDIN and write to
STDOUT.
tr stands for Translation.
Syntax
 
$ tr [OPTION] SET1 [SET2]
If both the SET1 and SET2 are specified then tr command
will replace each characters in SET1 with each character in
same position in SET2.
 
 
 
21
 
complement of a pattern
 
echo abcdefghijklmnop | tr -c 'a' 'z'
 
azzzzzzzzzzzzzzzz
delete specific characters
 
echo 'clean this up' |  tr  -d  'up'
 
clean this
the –c  option is combined with the –d delete option to extract a phone
number.
 
echo 'Phone: 01234 567890'  |  tr  -cd  '[:digit:]'
 
01234567890
squeeze repeating characters –s
In the following example a string with too many spaces is squeezed to
remove them.
 
echo 'too     many      spaces      here‘  |  tr  -s  '[:space:]'
 
too many spaces here
 
22
 
Linux TR Command for basic text processing
 
To 
truncate
 the first set to the second set use the –t option. By Default
tr will repeat the last character of the second set if the first and second
sets are different lengths. Consider the following example.
 
echo 'the cat in the hat' | tr 'abcdefgh' '123'
 
t33 31t in t33 31t
The last character of the second set is repeated to match any letter from
c-h. Using the truncate option limits the matching to the length of the
second set.
 
echo 'the cat in the hat' | tr  -t  'abcdefgh'  '123'
 
the 31t in the h1t
find and replace
echo "some_url_that_I_have" | tr "_" "-"
some-url-that-I-have
 
23
 
Linux TR Command for basic text processing
 
translate between "upper" and "lower" case characters is to use the "
[:lower:]
"
and 
"[:upper:]"
 options:
$ 
echo 'welcome to Computer Science' | tr "[:lower:]" "[:upper:]" WELCOME TO
COMPUTER SCIENCE
$ 
echo 'welcome to Computer Science' | tr "a-z" "A-Z“
WELCOME TO COMPUTER SCIENCE
Delete newline:
cat shakes.txt
Line one
Line two
….
cat shakes.txt | tr –d  “\n”
White Space to Tabs
tr [:space:] '\t‘ < shakes.txt | more
White Space to NewLine
    tr [:space:] '\n‘< shakes.txt | more
 
24
 
Linux TR Command for basic text processing
 
$ tr [:upper:] [:lower:] < items.txt > output.txt
$ cat output.txt
 
25
 
Linux TR Command for basic text processing
 
Simple Tokenization in UNIX
 
 
tr -sc A-Za-z \n < shakes.txt | sort | uniq -c
 
Given a text file, output the word tokens and their frequencies
 
tr -sc 
A-Za-z \n 
< shakes.txt
 
| 
sort
 
| uniq -c
Change all non-alpha to newlines
Sort in alphabetical order
Merge and count each type
 
Please study:
https://www.thegeekstuff.com/2012/12/linux-tr-command/
 
More counting
 
Merging upper and lower case
tr A-Z a-z  < shakes.txt | tr -sc A-Za-z \n | sort | uniq -c
 
Sorting the counts
tr A-Z a-z < shakes.txt | tr -sc A-Za-z \n | sort | uniq -c | sort -nr
 
23243 the
22225 i
18618 and
16339 to
15687 of
12780 a
12163 you
10839 my
10005 in
8954  d
 
Issues in Tokenization
 
Finland’s capital 
   
 
  Finland Finlands Finland’s  ?
what’re, I’m, isn’t  
 
  
What are, I am, is not
Hewlett-Packard     
 
 
  Hewlett Packard ?
state-of-the-art     
 
  state of the art ?
Lowercase
   
  lower-case lowercase lower case ?
San Francisco
  
  one token or two?
m.p.h., PhD.
  
  ??
 
Tokenization: language issues
 
French
L'ensemble 
 one token or two?
L ? L’ ? Le ?
Want l’ensemble to match with un ensemble
 
German noun compounds are not segmented
Lebensversicherungsgesellschaftsangestellter
‘life insurance company employee’
German information retrieval needs compound splitter
 
Tokenization: language issues
 
Chinese and Japanese no spaces between words:
莎拉波娃现在居住在美国东南部的佛罗里达。
莎拉波娃
  
现在
   
居住
  
    
美国
   
东南部
     
    
佛罗里达
Sharapova now     lives in       US       southeastern     Florida
Further complicated in Japanese, with multiple alphabets
intermingled
Dates/amounts in multiple formats
 
フォーチュン
500
社は情報不足のため時間あた
$500K(
6,000
万円
)
 
End-user can express query entirely in hiragana!
 
Word Tokenization in Chinese
 
Also called Word Segmentation
Chinese words are composed of characters
Characters are generally 1 syllable and 1 morpheme.
Average word is 2.4 characters long.
 
Standard baseline segmentation algorithm:
Maximum Matching  (also called Greedy)
 
Maximum Matching
Word Segmentation Algorithm
 
Given a wordlist of Chinese, and a string.
Start a pointer at the beginning of the string
Find the longest word in dictionary that matches the string starting
at pointer
Move the pointer over the word in string
Go to 2
 
 
پشاور،بنوں نمائندہ جنگ،اے ایف پی بنوں میں اقوام میراخیل اور
 
33
 
Max-match segmentation illustration
 
Thecatinthehat
Thetabledownthere
 
 
 
Doesn’t generally work in English!
 
But works astonishingly well in Chinese
莎拉波娃现在居住在美国东南部的佛罗里达。
莎拉波娃
  
现在
   
居住
   
  
美国
   
东南部
     
  
佛罗里达
Modern probabilistic segmentation algorithms even better
 
the table down there
 
the cat in the hat
 
theta bled own there
Slide Note
Embed
Share

Dive into the world of regular expressions for text processing in NLP, learning about disjunctions, negations, patterns, anchors, and practical examples. Discover how to efficiently search for specific text strings using these powerful tools.

  • Regular Expressions
  • Text Processing
  • NLP Techniques
  • Data Science
  • Language Processing

Uploaded on Sep 24, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Natural Language Processing Lecture 4

  2. Basic Text Processing

  3. Regular expressions A formal language for specifying text strings How can we search for any of these? woodchuck woodchucks Woodchuck Woodchucks

  4. Regular Expressions: Disjunctions Letters inside square brackets [] Pattern Matches [wW]oodchuck Woodchuck,woodchuck [1234567890] Any digit Ranges [A-Z] Pattern Matches Drenched Blossoms [A-Z] An upper case letter my beans were impatient [a-z] A lower case letter Chapter 1: Down the Rabbit Hole [0-9] A single digit

  5. Regular Expressions: Negation in Disjunction Negations [^Ss] Carat means negation only when first in [] Pattern Matches Oyfn pripetchik [^A-Z] Not an upper case letter I have no exquisite reason [^Ss] Neither S nor s Look here [^e^] Neither e nor ^ Look up a^b now a^b The pattern a carat b

  6. Regular Expressions: More Disjunction Woodchucks is another name for groundhog! The pipe | for disjunction Pattern Matches groundhog|woodchuck yours mine yours|mine a|b|c = [abc] [gG]roundhog|[Ww]oodchuck

  7. Regular Expressions: ? * + . Pattern Matches color colour colou?r Optional previous char ab! aab! aaab! aaaab! aa*b! 0 or more of previous char ab! aab! aaab! aaaab! a+b! 1 or more of previous char baa baaa baaaa baaaaa baa+ begin begun begun beg3n beg.n Stephen C Kleene Kleene *, Kleene +

  8. Regular Expressions: Anchors ^ $ Pattern ^[A-Z] ^[^A-Za-z] \.$ .$ Matches Palo Alto 1 Hello The end. The end? The end!

  9. Example Find me all instances of the word the in a text. the Misses capitalized examples [tT]he Incorrectly returns other or theology [^a-zA-Z][tT]he[^a-zA-Z]

  10. Errors The process we just went through was based on fixing two kinds of errors Matching strings that we should not have matched (there, then, other) False positives (Type I) Not matching things that we should have matched (The) False negatives (Type II) In NLP we are always dealing with these kinds of errors. Reducing the error rate for an application often involves two antagonistic efforts: Increasing accuracy or precision (minimizing false positives) Increasing coverage or recall (minimizing false negatives).

  11. Precision and Recall Precision: How many selected items are relevant Recall: How many relevant items are selected 11

  12. Accuracy and Precision Accuracy refers to the closeness of a measured value to a standard or known value. For example, if in lab you obtain a weight measurement of 3.2 kg for a given substance, but the actual or known weight is 10 kg, then your measurement is not accurate. In this case, your measurement is not close to the known value. Precision refers to the closeness of two or more measurements to each other. Using the example above, if you weigh a given substance five times, and get 3.2 kg each time, then your measurement is very precise. Precision is independent of accuracy. You can be very precise but inaccurate, as described above. You can also be accurate but imprecise. https://labwrite.ncsu.edu/Experimental%20Design/accuracyprecision.htm 12

  13. Accuracy and Precision Accuracy refers to how close a measurement is to the true value. Precision is how consistent results are when measurements are repeated An easy way to remember the difference between accuracy and precision is: ACcurate is Correct (or Close to real value) PRecise is Repeating (or Repeatable) . 13

  14. 14

  15. Word tokenization

  16. Text Normalization Every NLP task needs to do text normalization: Segmenting/tokenizing words in running text Normalizing word formats Segmenting sentences in running text

  17. Tokenization Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. Here is an example of tokenization Input: Friends, Romans, Countrymen, lend me your ears; Output: 17

  18. How many words? I do uh main- mainly business data processing Fragments, filled pauses Seuss s cat in the hat is different from other cats! Lemma: same stem, part of speech, rough word sense cat and cats = same lemma Wordform: the full inflected surface form cat and cats = different wordforms

  19. How many words? they lay back on the San Francisco grass and looked at the stars and their Type: an element of the vocabulary. Token: an instance of that type in running text. How many? 15 tokens (or 14) 13 types (or 12) (or 11?)

  20. How many words? N = number of tokens V = vocabulary = set of types |V| is the size of the vocabulary Church and Gale (1990): |V| > O(N ) Tokens = N Types = |V| Switchboard phone conversations 2.4 million 20 thousand Shakespeare 884,000 31 thousand Google N-grams 1 trillion 13 million

  21. Linux TR Command for basic text processing tr is an UNIX utility for translating, or deleting, or squeezing repeated characters. It will read from STDIN and write to STDOUT. tr stands for Translation. Syntax $ tr [OPTION] SET1 [SET2] If both the SET1 and SET2 are specified then tr command will replace each characters in SET1 with each character in same position in SET2. 21

  22. Linux TR Command for basic text processing complement of a pattern echo abcdefghijklmnop | tr -c 'a' 'z' azzzzzzzzzzzzzzzz delete specific characters echo 'clean this up' | tr -d 'up' clean this the c option is combined with the d delete option to extract a phone number. echo 'Phone: 01234 567890' | tr -cd '[:digit:]' 01234567890 squeeze repeating characters s In the following example a string with too many spaces is squeezed to remove them. echo 'too many spaces here | tr -s '[:space:]' too many spaces here 22

  23. Linux TR Command for basic text processing To truncate the first set to the second set use the t option. By Default tr will repeat the last character of the second set if the first and second sets are different lengths. Consider the following example. echo 'the cat in the hat' | tr 'abcdefgh' '123' t33 31t in t33 31t The last character of the second set is repeated to match any letter from c-h. Using the truncate option limits the matching to the length of the second set. echo 'the cat in the hat' | tr -t 'abcdefgh' '123' the 31t in the h1t find and replace echo "some_url_that_I_have" | tr "_" "-" some-url-that-I-have 23

  24. Linux TR Command for basic text processing translate between "upper" and "lower" case characters is to use the "[:lower:]" and "[:upper:]" options: $ echo 'welcome to Computer Science' | tr "[:lower:]" "[:upper:]" WELCOME TO COMPUTER SCIENCE $ echo 'welcome to Computer Science' | tr "a-z" "A-Z WELCOME TO COMPUTER SCIENCE Delete newline: cat shakes.txt Line one Line two . cat shakes.txt | tr d \n White Space to Tabs tr [:space:] '\t < shakes.txt | more White Space to NewLine tr [:space:] '\n < shakes.txt | more 24

  25. Linux TR Command for basic text processing $ tr [:upper:] [:lower:] < items.txt > output.txt $ cat output.txt 25

  26. Simple Tokenization in UNIX tr -sc A-Za-z \n < shakes.txt | sort | uniq -c Given a text file, output the word tokens and their frequencies tr -scA-Za-z \n < shakes.txt | sort | uniq -c Change all non-alpha to newlines Sort in alphabetical order Merge and count each type Please study: https://www.thegeekstuff.com/2012/12/linux-tr-command/

  27. More counting Merging upper and lower case tr A-Z a-z < shakes.txt | tr -sc A-Za-z \n | sort | uniq -c Sorting the counts tr A-Z a-z < shakes.txt | tr -sc A-Za-z \n | sort | uniq -c | sort -nr 23243 the 22225 i 18618 and 16339 to 15687 of 12780 a 12163 you 10839 my 10005 in 8954 d

  28. Issues in Tokenization Finland Finlands Finland s ? What are, I am, is not Hewlett Packard ? state of the art ? lower-case lowercase lower case ? one token or two? ?? Finland s capital what re, I m, isn t Hewlett-Packard state-of-the-art Lowercase San Francisco m.p.h., PhD.

  29. Tokenization: language issues French L'ensemble one token or two? L ? L ? Le ? Want l ensemble to match with un ensemble German noun compounds are not segmented Lebensversicherungsgesellschaftsangestellter life insurance company employee German information retrieval needs compound splitter

  30. Tokenization: language issues Chinese and Japanese no spaces between words: Sharapova now lives in US southeastern Florida Further complicated in Japanese, with multiple alphabets intermingled Dates/amounts in multiple formats 500 $500K( 6,000 ) Katakana Hiragana Kanji Romaji End-user can express query entirely in hiragana!

  31. Word Tokenization in Chinese Also called Word Segmentation Chinese words are composed of characters Characters are generally 1 syllable and 1 morpheme. Average word is 2.4 characters long. Standard baseline segmentation algorithm: Maximum Matching (also called Greedy)

  32. Maximum Matching Word Segmentation Algorithm Given a wordlist of Chinese, and a string. Start a pointer at the beginning of the string Find the longest word in dictionary that matches the string starting at pointer Move the pointer over the word in string Go to 2

  33. 33

  34. Max-match segmentation illustration the cat in the hat Thecatinthehat Thetabledownthere the table down there theta bled own there Doesn t generally work in English! But works astonishingly well in Chinese Modern probabilistic segmentation algorithms even better

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#