Lexical Analysis
Regular Expressions (REs) play a crucial role in describing regular languages, with the ability to convert any RE into a Deterministic Finite Automaton (DFA). This conversion enables the automation of lexical analysis, a fundamental aspect of language processing. Explore the implementation of REs through finite automata, distinguishing between NFAs and DFAs. Dive into the world of automata with examples of simple and complex automatons, illustrating the concept of state transitions and configurations. Delve deeper into recognizing patterns using examples like recognizing r0 through r31, showcasing the power of automata in language processing.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Lexical Analysis NFA and DFA
Note: Lexical Analysis: Regular Expressions (REs) are formulae to describe a (regular) language. Every RE can be converted to a Deterministic Finite Automaton (DFA). DFAs can automate the construction of lexical analysers.
Implementing Regular Expressions Regular expressions can be implemented using finite automata. There are two main kinds of finite automata: NFAs (nondeterministic finite automata), DFAs (deterministic finite automata), Automata are best explained by example...
A Simple Automaton A,B,C,...,Z " " start
A Simple Automaton A,B,C,...,Z " " start Each circle is a state o f the auto mato n. T he auto mato n' s configuration is determined by what state(s) it is in.
A Simple Automaton A,B,C,...,Z " " start These arrows are called transitions. changes which state( s) by following transitions. The automaton it is in
A More Complex Automaton 1 0 0 start 0, 1 1 0 1
An Even More Complex Automaton a, b c a, c b start b, c a These are called -transitions. t ransit ions are f ollowed aut omatically and without consuming any input. These
An Example (recognise r0 through r31) Register r ((0|1|2) (Digit| ) | (4|5|6|7|8|9) | (3|30|31)) 17-Feb-25 COMP36512 Lecture 4 9
An Example (recognise r0 through r31) Register r ((0|1|2) (Digit| ) | (4|5|6|7|8|9) | (3|30|31)) State r 0,1 2 3 4,5, ,9 0 1 - - - - 1 - 2 2 5 4 2(final) - 3 3 3 3 3(final) - - - - - 4(final) - - - - - 5(final) - 6 - - - 6(final) - - - - - 17-Feb-25 COMP36512 Lecture 4 10
An Example (recognise r0 through r31) Register r ((0|1|2) (Digit| ) | (4|5|6|7|8|9) | (3|30|31)) digit S2 S3 0|1|2 r 3 0|1 S0 S1 S5 S6 4|5|6|7|8|9 S4 Different (bigger) transition table. Our Deterministic Finite Automaton (DFA) recognises only r0 through r31. State r 0,1 2 3 4,5, ,9 0 1 - - - - 1 - 2 2 5 4 2(final) - 3 3 3 3 3(final) - - - - - 4(final) - - - - - 5(final) - 6 - - - 6(final) - - - - - 17-Feb-25 COMP36512 Lecture 4 11
Non-deterministic Finite Automata What about a RE such as (a | b)*abb? a | b a b b S0 S1 S2 S3 S4 This is a Non-deterministic Finite Automaton (NFA): S0 has a transition on ; S1 has two transitions on a (not possible for a DFA). A DFA is a special case of an NFA: for each state and each transition there is at most one rule. A DFA can be simulated with an NFA (obvious!) A NFA can be simulated with a DFA (less obvious). Simulate sets of possible states. Why study NFAs? DFAs can lead to faster recognisers than NFAs but may be much bigger. Converting a RE into an NFA is more direct. 17-Feb-25 COMP36512 Lecture 4 12
Automatic Lexical Analyser Construction To convert a specification into code: Write down the RE for the input language. Convert the RE to a NFA (Thompson s construction) Build the DFA that simulates the NFA (subset construction) Shrink the DFA (Hopcroft s algorithm) (for the curious: there is a full cycle - DFA to RE construction is all pairs, all paths) 17-Feb-25 COMP36512 Lecture 4 13
Automatic Lexical Analyser Construction Lexical analyser generators: lex or flex work along these lines. Algorithms are well-known and understood. Key issue is the interface to parser. 17-Feb-25 COMP36512 Lecture 4 14
RE to NFA using Thompsons construction Key idea (Ken Thompson; CACM, 1968): NFA pattern for each symbol and/or operator: join them in precedence order. a a b S0 NFA for a S1 S0 S1 S2 S3 NFA for ab a S1 S2 S5 S0 S1 S2 S3 S0 a b S3 S4 NFA for a* NFA for a | b 17-Feb-25 COMP36512 Lecture 4 15
Example: Construct the NFA of a (b|c)* a b c First: NFAs for a, b, c S0 S1 S0 S1 S0 S1 17-Feb-25 COMP36512 Lecture 4 16
Example: Construct the NFA of a (b|c)* a b c First: NFAs for a, b, c S0 S1 S0 S1 S0 S1 b b S1 S2 S2 S3 S5 S6 S7 S0 S0 S1 c c S4 S5 S3 S4 Second: NFA for b|c Third: NFA for (b|c)* 17-Feb-25 COMP36512 Lecture 4 17
Example: Construct the NFA of a (b|c)* a b c First: NFAs for a, b, c S0 S1 S0 S1 S0 S1 b b S1 S2 S2 S3 S5 S6 S7 S0 S0 S1 c c S4 S5 S3 S4 Second: NFA for b|c Third: NFA for (b|c)* Fourth: NFA for a(b|c)* b S4 S5 a S8 S9 S0 S1 S2 S3 c S6 S7 17-Feb-25 COMP36512 Lecture 4 18
Example: Construct the NFA of a (b|c)* a b c First: NFAs for a, b, c S0 S1 S0 S1 S0 S1 b b S1 S2 S2 S3 S5 S6 S7 S0 S0 S1 c c S4 S5 S3 S4 Second: NFA for b|c Third: NFA for (b|c)* Of course, a human would design a simpler one But, we can automate production of the complex one... Fourth: NFA for a(b|c)* b S4 S5 a S8 S9 S0 S1 S2 S3 c b | c S6 S7 a S0 S1 17-Feb-25 COMP36512 Lecture 4 19
NFA to DFA: two key functions move(si,a): the (union of the) set of states to which there is a transition on input symbol a from state si -closure(si): the (union of the) set of states reachable by from si. Example (see the diagram below): -closure(3)={3,4,7}; -closure({3,10})={3,4,7,10}; move( -closure({3,10}),a)=8; 10 a 4 7 8 3 The Algorithm: start with the -closure of s0 from NFA. Do for each unmarked state until there are no unmarked states: for each symbol take their -closure(move(state,symbol)) 17-Feb-25 COMP36512 Lecture 4 20
NFA to DFA with subset construction Initially, -closure is the only state in Dstates and it is unmarked. while there is an unmarked state T in Dstates mark T for each input symbol a U:= -closure(move(T,a)) if U is not in Dstates then add U as unmarked to Dstates Dtable[T,a]:=U Dstates (set of states for DFA) and Dtable form the DFA. Each state of DFA corresponds to a set of NFA states that NFA could be in after reading some sequences of input symbols. This is a fixed-point computation. It sounds more complex than it actually is! 17-Feb-25 COMP36512 Lecture 4 21
Example: NFA for (a | b)*abb a 2 3 a b b 0 1 6 7 8 9 10 b 4 5 A= -closure(0)={0,1,2,4,7} for each input symbol (that is, a and b): B= -closure(move(A,a))= -closure({3,8})={1,2,3,4,6,7,8} C= -closure(move(A,b))= -closure({5})={1,2,4,5,6,7} Dtable[A,a]=B; Dtable[A,b]=C B and C are unmarked. Repeating the above we end up with: C={1,2,4,5,6,7}; D={1,2,4,5,6,7,9}; E={1,2,4,5,6,7,10}; and Dtable[B,a]=B; Dtable[B,b]=D; Dtable[C,a]=B; Dtable[C,b]=C; Dtable[D,a]=B; Dtable[D,b]=E; Dtable[E,a]=B; Dtable[E,b]=C; no more unmarked sets at this point! 17-Feb-25 COMP36512 Lecture 4 22
Result of applying subset construction Transition table: state a A B B B C B D B E(final) B b C D C E C b b C b a A D E b a b B a a a 17-Feb-25 COMP36512 Lecture 4 23
Another NFA version of the same RE a|b a b b N0 N1 N2 N3 N4 Apply the subset construction algorithm: -closure(move(s,a)) N1,N2 N1,N2 N1,N2 N1,N2 N1,N2 -closure(move(s,b)) N1 N1,N3 N1 N1,N4 N1 Iteration 0 1 State A B C D E Contains N0,N1 N1,N2 N1 N1,N3 N1,N4 2 3 Note: iteration 3 adds nothing new, so the algorithm stops. state E contains N4 (final state) 17-Feb-25 COMP36512 Lecture 4 24
Lexical Analyser from Regular Expressions We presented algorithms to construct a DFA from a RE. The DFA is not necessarily the smallest possible. Using an (automatically generated) transition table and the standard code skeleton we can build a lexical analyser from regular expressions automatically. But, the size of the table can be large...
Challenges in Scanning How do we determine which lexemes are associated with each token? When there are multiple ways we could scan the input, how do we know which one to pick? How do we address these concerns efficiently?
Lexing Ambiguities for [A-Za-z_][A-Za-z0-9_]* T_For T_Identifier
Conflict Resolution Assume all tokens are specified as regular expressions. Algorithm: Left-to-right scan. Tiebreaking rule one: Maximal munch. Always match the longest possible prefix of the remaining text.
More Tiebreaking When two regular expressions apply, choose the one with the greater priority. Simple priority system: pick the rule that was defined first.
One Last Detail... We know what to do if multiple rules match. What if nothing matches? Trick: Add a catch-all rule that matches any character and reports an error.
Challenges in Scanning How do we determine which lexemes are associated with each token? When there are multiple ways we could scan the input, how do we know which one to pick? How do we address these concerns efficiently?
DFAs The automata we've seen so far have all been NFAs. A DFA is like an NFA, but with tighter restrictions: Every state must have exactly one transition defined for every letter. -moves are not allowed.
Challenges in Scanning How do we determine which lexemes are associated with each token? When there are multiple ways we could scan the input, how do we know which one to pick? How do we address these concerns efficiently?
Performance Concerns The NFA-to-DFA construction can introduce exponentially many states. Time/memory tradeoff: Low-memory NFA has higher scan time. High-memory DFA has lower scan time. Could use a hybrid approach by simplifying NFA before generating code.