Understanding Tokenization, Lemmatization, and Stemming in Natural Language Processing

Slide Note

Tokenization involves splitting natural language text into wordforms or tokens, with considerations for word treatments like lowercase conversion, lemmatization, and stemming. Lemmatization focuses on determining base forms of words, while stemming simplifies wordforms using rules. The choice of word treatments varies between applications. It is crucial to define wordforms accurately based on app requirements.

joclynndu Follow

Uploaded on Sep 27, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

ECE467: Natural Language Processing Tokenization, Words, and Morphology

Tokenization This topic is largely based on Chapter 2 of the textbook, titled "Regular Expressions, Text Normalization, Edit Distance" However, I am leaving out some content and adding additional content Toward the end of the topic, I will also discuss some material from two other books (both were referenced in our intro topic) Section 2.4.3, on byte-pair encoding, will be discussed near the end of the course, along with subword embeddings Tokenization generally involves splitting natural language text into wordforms (specific forms of words), or more generally, tokens How wordforms should be treated, and even what is meant by "word", is not trivial For some applications, words are kept in their original wordforms Other applications may convert all letters to lowercase to be case-insensitive Other applications may perform lemmatization or stemming (these topics are discussed on the next two slides)

Lemmatization A lemma is the base form of a word (sometimes called the canonical form or dictionary form) Often, multiple wordforms (e.g., "sing", "sang", "sung", "singing") share the same lemma The set of words that share the same lemma must all have the same major part of speech (POS) We'll discuss POS in more detail in a later topic, but the notion also comes up several times in this topic Algorithms to perform lemmatization can involve morphological analysis of wordforms (we'll learn more about this later in the topic) However, it is probably more common to use a resource; e.g., a very popular resource for conventional NLP used to be WordNet WordNet had various other important uses as well in conventional NLP

Stemming Stemming is simpler than lemmatization Stemming involves the use of a sequence of rules to convert a wordform to a simpler form For example, the Porter stemmer has been very popular in conventional NLP; sample rules are: In theory, words that share the same lemma should map to the same stem, and words that do not share the same stem should not In practice, it sometimes works but sometimes does not

Wordforms Note that what counts as a wordform in the first place varies between apps Some applications may just use whitespace to separate wordforms Some may strip punctuation Some may count certain punctuation (e.g., periods, questions marks, etc.) as separate tokens Some applications may do more complicated splitting; e.g., some split a possessive -s ('s) into a separate token Some applications may convert all letters to lower case to achieve case insensitivity In some languages, words are not separated by spaces For example, in Chinese, words are composed of Hanzi characters Each generally represents a single morpheme and is pronounced as a single syllable

Sentence Segmentation Another component of many NLP applications is sentence segmentation (also not trivial) It may seem intuitive to split a document into sentences first, and then to tokenize sentences More often, the opposite occurs, since the result of tokenization aids sentence segmentation One complication is that periods are also used for acronyms (e.g., "U.S.A.", "m.p.h."), abbreviations (e.g., "Mr.", "Corp."), and decimal points (e.g., "$50.25") Note that acronyms and some abbreviations ending in periods can end sentences at times The process of tokenization, optionally followed by lemmatization or stemming and/or sentence segmentation, is often referred to as text normalization

The Chomsky Hierarchy The Chomsky hierarchy defines four types of formal grammars that are useful for various tasks These are unrestricted grammars (type 0), context-sensitive grammars (type 1), context-free grammars (type 2), and regular grammars (type 3) These are numbered from the most powerful / least restrictive (type 0) to the least powerful / most restrictive (type 3) It is often useful (i.e., simpler and more efficient) to use the most restrictive type of grammar that suits your purpose Regular grammars are generally powerful enough for tokenization There are various equivalent ways to define a specific instance of each type of grammar that is part of the Chomsky hierarchy For example, when we talk about context-free grammars during Part II of the course (on conventional computational linguistics), we will define them using productions, a.k.a. rewrite rules Regular grammars can also be defined using rewrite rules; or they can be defined using finite state automata; however, in this course, we will define them with regular expressions

Regular Expressions A regular expression (RE) is a grammatical formalism useful for defining regular grammars Each regular expression is a formula in a special language that specifies simple classes of strings (a string is just a sequence of symbols) Regular expressions are case sensitive RE search requires a pattern that we want to search for and a corpus of texts to search through Our textbook seems to use Perl syntax (Perl used to be a very popular programming language in the field of NLP), so we will too for our examples Although the syntax might be a bit different, if you understand how to use regular expressions, you won't have a problem using them in Python or some other language Following the book, we will typically assume that an RE search returns the first line of the document containing the pattern We can contrast this to the Unix "grep" command, which returns all matching lines

Simple Regular Expressions The simplest type of regular expression is just a sequence of one or more characters For example, to search for "woodchuck", you would just use the following: /woodchuck/ Brackets can be used to distinguish a disjunction of characters to match For example, /[wW]oodchuck/ would search for the word starting with a lowercase or capital 'w' A dash can help to specify a range For example, /[A-Za-z]/ searches for any uppercase or lowercase letter If the first character within square brackets is the caret ('^'), this means that we are specifying what the characters cannot be For example, /[^A-Z]/ matches any character except a capital letter

Special Characters A question mark ('?') can be used to match the preceding character or RE, or nothing (i.e., zero or one instance of the preceding character or RE) For example, /woodchucks?/ matches the singular or plural of the word "woodchuck" The Kleene star ('*') indicates zero or more occurrences of the previous character or RE The Kleene + ('+') means one or more of the previous character or RE As an example of why these are useful, let's say you want to represent all strings representing sheep sounds In other words, we want to represent the language consisting of the strings "baa!", "baaa!", "baaaa!", "baaaaa!", etc. Two regular expressions defining this language are /baaa*!/ and /baa+!/ This language is an example of a formal language, which is a set of strings adhering to specific rules One very important special character is the period ('.'); this is a wildcard expression that matches any single character (except an end-of-line character) For example, to find a line in which the string "aardvark" appears twice, you can use /aardvark.*aardvark/ To match an actual period, you can use "\." within an RE

Anchors Anchors are special characters that match particular places in a string For example, the caret ('^') and dollar sign ('$') can be used to match the start of a line or end of a line, respectively Example: /^The/ matches the word "The" at the start of a line Example: / $/ matches a line ending with a space (the textbook uses a character, ' ', to represent spaces) Example: /^The dog\.$/ matches a line that contains only the phrase "The dog." Recall that the caret also has other meanings Two other anchors: \b matches a word boundary, and \B matches any non- boundary position A "word" in this context is a sequence of letters, digits, and/or underscores

Disjunction, Parentheses, and Precedence The ('|') character is called the disjunction operator, a.k.a. the pipe symbol For example, the pattern /cat|dog/ matches either the string "cat" or the string "dog" You can use parentheses to help specify precedence For example, to search for the singular or plural of the word "guppy", you can use /gupp(y|ies)/ Unlike the | operator, the * and + operators apply by default to a single character By putting the expression before these operators in parentheses, you make the operator apply to the whole expression

Precedence Hierarchy The operator precedence hierarchy for regular expressions, according to our textbook, is as follows: 1. Parenthesis: ( ) 2. Counters: * + ? { } (the curly braces can be used to specify ranges, more on this soon) 3. Sequences and anchors (e.g., the, ^my, end$, etc.) 4. Disjunction: | The list of classes, with the specified order above, is pretty typical Some sources list brackets as having even higher precedence than parentheses, and a backslash to indicate special characters above that

RE Example The textbook steps through an example of searching for the word "the" in a text without any false positives or false negatives They speculate that we want any non-letter to count as a word separator That is, to the immediate left of the 't' and to the immediate right of the 'e', any non-letter may be present We also need to consider that "the" may occur at the start or end of a line Finally, they allow the 't' to be capital or lowercase, but the 'h' and 'e' must be lowercase They end up with the following expression: /(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)/

Other RE Constructs Aliases for common sets include: \d (any digit) \D (a non-digit) \w (any alphanumeric or underscore) \W (non-alphanumeric) \s (any whitespace, e.g., space or tab) \S (non-whitespace) The characters \t and \n represent tab and newline Other special characters can be specified themselves by preceding them with a backslash (e.g., /\./, /\*/, /\\/, etc.) Curly braces specify ranges of repetition: {n} means exactly n occurrences of the previous character or expression {n,m} means from n to m occurrences {n,} means at least n occurrences

Substitutions and Registers One important use of regular expressions is substitutions We indicate a desired substation by placing an 's' before the first '/' For example, s/colour/color/ would replace the British spelling with the American spelling Parentheses can be used to capture a part of the searched for expression Each captured match is stored in a register Text matching sub-expressions within parentheses are numbed from left to right For example, to match expressions such as "the faster they ran, the faster we ran" (but the adjective and verb can vary), you can use: /the (.*)er they (.*), the \1er we \2/ As an example of substitution, suppose we want to surround all integers with angled brackets (e.g., change "the 35 boxes" to "the <35> boxes"), you could use: s/([0-9]+)/<\1>/ The syntax for referring to captured text outside of the RE differs across programming languages Not mentioned in the textbook: With registers, REs are more powerful than standard formalisms for expressing regular grammars

Morphology Book: "Morphology is the study of the way words are built up from smaller meaning- bearing units called morphemes." The current edition of the textbook very briefly discusses morphological parsing, which is necessary for "sophisticated methods for lemmatization" Recall that in practice, a wordform can instead be looked up in an appropriate resource to retrieve the lemma WordNet is an example of such a resource that was very popular in conventional NLP Keeping such resources current (e.g., by adding new words) involves a lot of manual effort Also recall that we can avoid lemmatization by applying stemming instead, which is much simpler (but doesn't always work as well) The current edition of the textbook dropped most of its discussion of morphology We will discuss it in more detail than the book (but significantly less than I used to); some of this content comes from the previous edition of the textbook

Rules of Morphology Orthographic rules are general rules that deal with spelling and tell us how to transform words; some examples are: To pluralize a noun ending in "y", change the "y" to an "i" and add "es" (e.g., "bunnies") A single consonant letter is often doubled before adding "-ing" or "-ed" suffixes (e.g., "begging", "begged") A "c" is often changed to "ck" when adding "-ing" and "-ed" (e.g., "picnicking", "picnicked") Morphological rules deal with exceptions; e.g., "fish" is its own plural, "goose" becomes "geese" Morphological parsing uses both types of rules in order to break down a word into its component morphemes A morpheme is the smallest part of the word that has a semantic meaning For example, given the wordform, "going", the parsed form can be represented as: "VERB-go + GERUND-ing" Conventionally, morphological parsing sometimes played an important role for POS tagging For morphologically complex languages (we'll discuss an example later in the topic), it can also play an important role for web search

Stems and Affixes Two board classes of morphemes are stems and affixes The stem is the main (i.e., the central, most important, or most significant) morpheme of the word Affixes add additional meanings of various kinds Affixes can be further divided into prefixes, suffixes, infixes, and circumfixes English debatably does not have circumfixes and proper English probably does not have any infixes A word can have more than one affix; for example, the word "unbelievably" has a stem ("believe"), a prefix ("un-"), and two suffixes ("-able" and "-ly") English rarely stacks more than four or fix affixes to a word, but languages like Turkish have words with nine or ten We will discuss an example of a morphologically complex language later

Combining Morphemes Four methods of combining morphemes to create words include inflection, derivation, compounding, and cliticization Inflection is the combination of a word stem with a grammatical morpheme, usually resulting in a word of the same basic POS as the stem, usually filling some syntactic function like agreement As mentioned earlier, we will discuss POS in more detail in a later topic Examples of inflection include pluralizing a noun or changing the tense of a verb Derivation is the combination of a word stem with a morpheme, usually resulting in word of a different class, often with a meaning that is harder to predict Examples from the textbook (previous edition) include "appointee" from "appoint and "clueless" from "clue"; an example from Wikipedia is "happiness" from "happy" Compounding is the combination of multiple word stems together; for example, "doghouse" Cliticization is the combination of a word stem with a clitic A clitic is a morpheme that acts syntactically like a word but is reduced in form and attached to another word An example is the English morpheme "'ve" in words such as "I've" (substituting for "have") or the French definite article "l'" in words such as "l'opera" (substituting for "le")

Inflection of English Nouns English nouns have only two kinds of inflections: an affix that makes the word plural and an affix that marks possessive Most nouns are pluralized by adding "s" The suffix "es" is added to most nouns ending in "s", "z", "sh", "ch", and "x" Nouns ending in "y" preceded by a consonant change the "y" to "i" and add "es" The possessive suffix usually just entails adding "'s" for regular singular nouns or plural nouns not ending in "s" (e.g., "children's") Usually, you just add a lone apostrophe for plural nouns ending in "s" (e.g., "llamas'") or names ending in "s" or "z" (e.g., "Euripides' comedies")

Inflection of English Regular Verbs English has three kinds of verbs Main verbs (e.g., "eat", "sleep", "impeach") Modal verbs ("can", "will", "should") Primary verbs (e.g., "be", "have", "do") Regular verbs in English have four inflected forms; examples are:

Inflection of English Irregular Verbs Irregular verbs typically have up to five different forms but can have as many as eight (e.g., the verb "be") or as few as three (e.g., "cut") The forms of "be" are: "be", "am", "is", "are", "was", "were", "been", "being" Other examples of irregular verbs are shown here:

Other Formalisms I used to spend time in class covering deterministic and non- deterministic finite-state automata (FSAs) Both are equivalent to regular expressions (without registers) in power I also used to cover finite-state transducers (FSTs); an FST is a finite automaton that maps between two sets of strings FSAs and FSTs are useful for implementing algorithms related to regular expressions, tokenization, and morphological parsing These topics have been dropped from the current edition of the textbook, and I also dropped them

The Language Instinct I am basing the next portion of this topic on material from chapter 5 of "The Language Instinct", by Steven Pinker The book, in general, discusses Pinker's theories related to how humans process language The title of Chapter 5 is, "Words, Words, Words" Pinker states that humans remember stems and rules related to inflection and derivation; they also remember irregular words The fact that rules of inflection are remembered separately has been demonstrated in psychological studies with children Pinker describes one such experiment known as "the wug test"

Pinker on Derivation When Pinker talks about derivation, he says the original word is called a root The new stem formed can often accept additional inflectional affixes For example, the stem "electricity" is based on the root "electric" (but note that the pronunciation of the "c" has changed) However, there are no general rules for creating new stems from roots This relates to meanings, pronunciations, and which roots can be converted to which stems Example: "complexity" is the state of being complex, but "electricity" is not the state of being electric (you wouldn't say that the electricity of a can opener makes it convenient) Pinker also points out that there are no such words as "academicity", "acrobaticity", or "alcoholicity" Actually, Pinker's third example is valid according to MS Word/PowerPoint and dictionary.com, with the apparent meaning, "alcoholic quality or strength"

Listemes Pinker proposes that stems (including those produced through derivation) generally must be memorized separately; they are part of our mental lexicon As mentioned earlier, Pinker claims that in addition to stems, irregular forms of words also have to be memorized Pinker claims that humans also memorize names, some foreign words, acronyms, etc. Pinker refers to each unit of the memorized list as a listeme He cites one study that estimated the average high school graduate knows about 45,000 listemes First, the researchers took an unabridged dictionary, and removed all inflections and some derivations if the meanings could be deduced from the parts Then, they quizzed volunteers on random samples from the remaining list, and based on the percentage of words known, they came up with their estimate Pinker considers this an underestimate of the true number of memorized listemes, because it does not include proper names, numbers, foreign words, acronyms, etc. Pinker ultimately increases his estimate to 60,000 listemes; of course, from these listemes, many other inflected (and in some cases, derived) words can be created

That's Amazing! Assuming that humans start to learn words at age 1, Pinker points out that young humans learn new words at the approximate rate of one every 90 waking minutes! To indicate how amazing this is, Pinker discusses a thought experiment from the logician and philosopher Quine He asks us to imagine a linguist studying the language a newly discovered tribe; a rabbit scurries by, and a native shouts, "Gavagai!" The linguist will probably conclude that "gavagai" means "rabbit", but why? Pinker lists several logically plausible possibilities of what "gavagai" could mean (e.g., a rabbit's foot, a furry thing, an animal, a thing that scurries, scurrying in general, "anything that is either a rabbit or a Buick", etc.) Pinker argues that children learning words make sub-conscious assumptions about the type of things that words are likely to refer to, staying away from the very specific and the very general Ultimately, though, we do learn general words such as "animal" (as well as specific words, when appropriate) Pinker discusses experiments that show that children subconsciously assume that new words do not share meanings with common words that they already know as a partial explanation for this

Ordering Suffixes Pinker also points out that there are rules that govern the order that suffixes must follow when multiple suffixes are added to a single stem Consider, for example, the stems "Darwinian" and "Darwinism", both based on the same root (in this case, a name), "Darwin" The word "Darwinian" might be defined as "pertaining to Darwin or his theories" or "relating to Darwin or his theories" The word "Darwinism" might be defined as "something related to Darwin or his theories" (you may find more specific definitions of this word, these days) From the stem "Darwinian", we can derive "Darwinianism" (it sounds good), and then using inflection, even "Darwinianisms" However, from "Darwinism", we cannot derive the word "Darwinismian" (such a word might mean "pertaining to Darwinism", but it sounds bad)

Polysynthetic Languages I am basing the rest of this topic on material from chapter 4 of "The Atoms of Language", by Mark C. Baker Chapter 4 is titled "Baking a Polysynthetic Language" The chapter's title stems from an analogy; adding yeast to a recipe makes bread seem completely different from a cracker Polysynthetic languages are morphologically complex languages that include very long and complex words; the example discussed in the chapter is the language Mohawk An example of a long word in the language is: "Washakotya'tawitsherahetkvhta'se'" This would roughly translate to the English sentence: "He made the thing that one puts on one's body [e.g., a dress] ugly for her." A single word is used in Mohawk for any garment that covers the torso The apostrophes in the Mohawk represent a so-called glottal stop, a sound not found in English

More about Mohawk In the chapter, we first learn what, at first, seem to be multiple, distinct differences between Mohawk and English; for example: In Mohawk, full subjects and objects can be left out of sentences (whereas in English, the subject must be expressed, and objects must be expressed if the main verb requires them) If full subjects and objects are expressed in Mohawk (which is optional), they can be placed either before or after the verb (i.e., at the start or end of the sentence) We soon learn that all verbs in Mohawk must include a reference to the subject of the verb as an attached morpheme The verb must also include references to the object and the indirect object, if they exist The reference to the object or indirect object can be either an attached pronoun or a full noun-phrase; the reference to the subject is always a pronoun Baker points out that objects must be attached to the verb before subjects are attached Baker claims that all languages follow "the verb-object constraint", which requires that verbs must combine with objects before subjects (either at a single-word or phrase level)

Noun incorporation The ability to combine nouns with verbs in Mohawk is called noun incorporation Although English does not have this, it does allow compound words (e.g., "doghouse" or "dishwasher") Related to the fact that full subjects cannot combine with verbs in Mohawk, Baker points out that English has a similar phenomenon When a noun representing an object combines with a noun that has been formed from a verb, it is clear that the first noun is an object, not a subject; for example: When someone hears "turkey-strangler" or "dinosaur-eater", they will think of someone who strangles turkeys or eats dinosaurs (not turkeys that strangle or dinosaurs that eat) The term "bread-cutting" sounds fine, but the term "knife-cutting" sounds odd I tried to think of an exception to this, and the best I came up with was "vacuum cleaner"; still, this is arguably two words, or it could just be an odd exception to the general rule

Dislocation Recall that the verb is Mohawk must include a reference to the subject, and it must be a pronoun Also recall that the full subject in Mohawk (if expressed separately from the verb) can come before or after the verb Baker claims that the reason that fully expressed subjects and objects can come before or after the verb in Mohawk is analogous to dislocation in English In English, if a pronoun representing a subject or object has already been expressed, the full noun phrase can be repeated for emphasis This repetition can occur either at the start or end of the sentence Examples: "Sak, he really liked that dress." "He really liked that dress, Sak."

Baker's Notion of Parameters Ultimately, Baker concludes that the all the differences between Mohawk and English discussed in the chapter are the result of a single parameter that he calls "the polysynthesis parameter" Baker's definition of this parameter: "Verbs must include some expression of each of the main participants in the event described by the verb (the subject, object, and indirect object)." At the end of the chapter, Baker points out that there are other polysynthetic languages spoken around the world None are spoken among huge populations, and the reason for that is not clear (Baker believes the opposite could have been true) Comparing the peoples that do speak polysynthetic languages, there are no clear historical, cultural, or climate-based features that they share in common In general, throughout the book, Baker shows that a language's grammar often does not share much in common with grammars of historically or culturally related languages; the opposite is true of vocabulary Also, sometimes completely unrelated languages (in terms of culture and history) share very similar grammars We'll discuss Baker's theories about languages and parameters in more detail during the second part of the course; his theories are controversial, but I find them interesting

Understanding Tokenization, Lemmatization, and Stemming in Natural Language Processing

Download Presentation

Presentation Transcript

Related

More Related Content