Lexical Analysis Off-the-shelf Tools: JFlex

Lexical Analysis
Off-the-shelf tools: Jflex
2
Review
 
Lexical specification: regular expressions
Scanner generation using automata theory + extra book-
keeping
3
Scanning Scheme programs
(define foo
 
(lambda (x) (+ x 14)))
 
1: L_PAREN
1: SYMBOL(define)
1: SYMBOL(foo)
2: L_PAREN
2: SYMBOL(lambda)
2: L_PAREN
2: SYMBOL(x)
2: R_PAREN
2: L_PAREN
2: PLUS
2: SYMBOL(x)
2: INT(14)
...
Scheme program text
 
tokens
LINE: ID(VALUE)
4
Scanner implementation
 
What are the outputs on the following inputs:
ifelse
if a
.75
89
89.94
5
5
Hand-written lexer...
Token nextToken()
{
char c = getchar();
switch (c){
 
case `;`:  return Token.SEMI;
 
case `+`:  c = getchar() ;
               switch (c) {
                         case `+': return INCREMENT ;
                         case '=’  return PLUSEQ;
                         default:  ungetc(c);
   
   return PLUS;
               }
 
case `<`:
 }
6
6
Why not?
A lot of work
Boundary (pathological) cases
Error prone
Hard to debug
Exhausting
Boring
Hard to reuse
….
7
7
Scanner generator: history
 
LEX
Lexical analyzer generator, written by Lesk and Schmidt at Bell Labs in 1975 for the UNIX
operating system;
It now exists for many operating systems;
LEX produces a scanner which is a C program;
LEX accepts regular expressions and allows actions (i.e., code to be executed) to be
associated with each regular expression.
JLex
Lex that generates a scanner written in Java;
Itself is also implemented in Java.
There are many similar tools, for every programming language
8
8
Overall picture
Tokens
9
Lexical analysis with JFlex
 
JFlex – fast lexical analyzer generator
Recognizes lexical patterns in text
Breaks input character stream into tokens
Input: scanner specification file
Output: a lexical analyzer (scanner)
A Java program
JFlex
javac
Scheme.lex
Lexical 
analyzer
 
text
 
tokens
Lexer.java
10
JFlex spec. file
User code
Copied directly to Java file
JFlex directives
Define macros, state names
Lexical analysis rules
Optional state, regular expression, action
How to break input to tokens
Action when token matched
%%
%%
Possible source
of javac errors
down the road
DIGIT= [0-9]
LETTER= [a-zA-Z]
YYINITIAL
{LETTER}
({LETTER}|{DIGIT})*
11
Section 1: User code
package Scheme.Parser;
import Scheme.Parser.Symbol;
any scanner-helper Java code
12
Section 2: JFlex directives
 
Directives - control JFlex internals
%line 
switches line counting on
%char 
switches character counting on
%class class-name
 changes default name
%type token-class-name
%public 
Makes generated class public (package by default)
%function read-token-method
%scanerror exception-type-name
State definitions
%state state-name
Macro definitions
macro-name = regex
13
Regular expressions
14
14
Example macros
ALPHA=[A-Za-z_]
DIGIT=[0-9]
ALPHA_NUMERIC={ALPHA}|{DIGIT}
IDENT={ALPHA}({ALPHA_NUMERIC})*
NUMBER=({DIGIT})+
NUMBER=[0-9]+
15
Section 3: 
Lexical analysis rules
 
Rule structure
[states] regexp {action as Java code}
regexp pattern - how to break input into tokens
Action invoked when pattern matched
Priority for rule matching 
longest
 string. This can be
either good or bad, depending on context.
 
/*
*
@Javadoc
*/
Class A{…
/*end
*/
 
Int a = 1000000000000
 
More than one match for same length – priority
for rule 
appearing first
!
Example: ‘if’ matches identifiers and the reserved word
Order leads to different automata
Important: rules given in a JFlex specification
should match all possible inputs!
16
17
17
Rules Examples
<YYINITIAL> {
DIGIT}+ 
{
  return new Token(Token.NUMBER, yytext(), yyline);
}
<YYINITIAL> "-" {
  return new Token(Token.MINUS, yytext(), yyline);
}
<YYINITIAL> [a-zA-Z] ([a-zA-Z0-9]) * {
  return new Token(Token.ID, yytext(), yyline);
}
18
Action body
 
Java code
Can use special methods and vars
yytext()
– the actual token text
yyline
 
(when enabled)
Scanner state transition
yybegin(state-name)
– tells JFlex to jump to the given state
YYINITIAL
 – name given by JFlex to initial state
19
19
Rules – State
<YYINITIAL> "//" { 
yybegin
(COMMENTS); }
<COMMENTS> [^\n] { }
<COMMENTS> [\n] { 
yybegin
(YYINITIAL); }
20
import Token;
%%
%line
%char
%state COMMENTS
ALPHA=[A-Za-z_]
DIGIT=[0-9]
ALPHA_NUMERIC={ALPHA}|{DIGIT}
IDENT={ALPHA}({ALPHA_NUMERIC})*
NUMBER=({DIGIT})+
WHITE_SPACE=([\ \n\r\t\f])+
NEWLINE=\n
%{
  private int lineCounter = 0;
%}
%eofval{
  
 
System.out.println("line number=" + lineCounter);
  
 
return new Symbol(sym.EOF);
%eofval}
Example
21
%%
<YYINITIAL> {NEWLINE} { lineCounter++; }
<YYINITIAL> {NUMBER} {
  return new Token(Token.NUMBER, yytext(), yyline));
}
<YYINITIAL> {WHITE_SPACE} { }
<YYINITIAL> "+" {
  return new Token(Token.PLUS, yytext(), yyline);
}
<YYINITIAL> "-" {
  return new Token(Token.MINUS, yytext(), yyline);
}
<YYINITIAL> "*" {
  return new Token(Token.TIMES, yytext(), yyline);
}
...
<YYINITIAL> "//" { 
yybegin
(COMMENTS); }
<COMMENTS> [^\n] { }
<COMMENTS> [\n] { 
yybegin
(YYINITIAL); }
<YYINITIAL> . { return new Symbol(sym.error, null); }
22
Running
the
scanner
import java.io.*;
import Token;
public class Main {
  public static void main(String[] args) {
    Token currToken;
    try {
        FileReader txtFile = new FileReader(args[0]);
        Yylex scanner = new Yylex(txtFile);
        do {
            currToken = scanner.next_token();
            // do something with currToken
        } while (currToken.sym != Token.EOF);
    } catch (Exception e) {
        throw new RuntimeException(
  
"IO Error (brutal exit)” + e.toString());
    }
  }
}
(Just for
testing scanner
as stand-alone
program)
23
https://jflex.de/manual.html#Example
Additional Example
Slide Note
Embed
Share

In the realm of lexical analysis, off-the-shelf tools like JFlex play a crucial role. This involves utilizing tools for lexical specification, regular expressions, scanner generation, automata theory, and more. The process includes front-end and back-end compilation, semantic analysis, and executable code synthesis. Explore scanning schemes, scanner implementations, hand-written lexers, scanner generators, and the overall picture of lexical analysis with JFlex. Uncover the benefits of using tools versus hand-crafted solutions and dive into the history of lexical analyzers. Enhance your understanding of lexical patterns, token generation, and the Java programming language.

  • Lexical Analysis
  • JFlex
  • Scanner Generation
  • Automata Theory
  • Regular Expressions

Uploaded on Mar 01, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Lexical Analysis Off-the-shelf tools: Jflex

  2. Review Lexical specification: regular expressions Scanner generation using automata theory + extra book- keeping Compiler txt exe Frontend Semantic Backend Source Executable (analysis) Representation (synthesis) text code 2

  3. Scanning Scheme programs tokens LINE: ID(VALUE) Scheme program text 1: L_PAREN 1: SYMBOL(define) 1: SYMBOL(foo) 2: L_PAREN 2: SYMBOL(lambda) 2: L_PAREN 2: SYMBOL(x) 2: R_PAREN 2: L_PAREN 2: PLUS 2: SYMBOL(x) 2: INT(14) ... (define foo (lambda (x) (+ x 14))) 3

  4. Scanner implementation What are the outputs on the following inputs: ifelse if a .75 89 89.94 4

  5. Hand-written lexer... Token nextToken() { char c = getchar(); switch (c){ case `;`: return Token.SEMI; case `+`: c = getchar() ; switch (c) { case `+': return INCREMENT ; case '= return PLUSEQ; default: ungetc(c); return PLUS; } case `<`: } 5

  6. Why not? A lot of work Boundary (pathological) cases Error prone Hard to debug Exhausting Boring Hard to reuse . 6

  7. Scanner generator: history LEX Lexical analyzer generator, written by Lesk and Schmidt at Bell Labs in 1975 for the UNIX operating system; It now exists for many operating systems; LEX produces a scanner which is a C program; LEX accepts regular expressions and allows actions (i.e., code to be executed) to be associated with each regular expression. JLex Lex that generates a scanner written in Java; Itself is also implemented in Java. There are many similar tools, for every programming language 7

  8. Overall picture String stream Scanner generator Java scanner program NFA RE DFA Minimize DFA Simulate DFA Tokens 8

  9. Lexical analysis with JFlex JFlex fast lexical analyzer generator Recognizes lexical patterns in text Breaks input character stream into tokens Input: scanner specification file Output: a lexical analyzer (scanner) A Java program text Lexical analyzer Scheme.lex JFlex Lexer.java javac tokens 9

  10. JFlex spec. file Possible source of javac errors down the road User code Copied directly to Java file %% DIGIT= [0-9] LETTER= [a-zA-Z] JFlex directives Define macros, state names %% YYINITIAL Lexical analysis rules Optional state, regular expression, action How to break input to tokens Action when token matched {LETTER} ({LETTER}|{DIGIT})* 10

  11. Section 1: User code package Scheme.Parser; import Scheme.Parser.Symbol; any scanner-helper Java code 11

  12. Section 2: JFlex directives Directives - control JFlex internals %line switches line counting on %char switches character counting on %class class-name changes default name %type token-class-name %public Makes generated class public (package by default) %function read-token-method %scanerror exception-type-name State definitions %state state-name Macro definitions macro-name = regex 12

  13. Regular expressions r$ . (dot) "..." {name} * + ? (...) a|b match reg. exp. r at end of a line any character except the newline verbatim string macro expansion zero or more repetitions one or more repetitions zero or one repetitions grouping within regular expressions match a or b class of characters - any one character enclosed in brackets range of characters negated class any one not enclosed in brackets [...] a b [^ ] 13

  14. Example macros ALPHA=[A-Za-z_] DIGIT=[0-9] ALPHA_NUMERIC={ALPHA}|{DIGIT} IDENT={ALPHA}({ALPHA_NUMERIC})* NUMBER=({DIGIT})+ NUMBER=[0-9]+ 14

  15. Section 3: Lexical analysis rules Rule structure [states] regexp {action as Java code} regexp pattern - how to break input into tokens Action invoked when pattern matched Priority for rule matching longest string. This can be either good or bad, depending on context. /** @Javadoc */ Class A{ Int a = 1000000000000 /*end*/ 15

  16. More than one match for same length priority for rule appearing first! Example: if matches identifiers and the reserved word Order leads to different automata Important: rules given in a JFlex specification should match all possible inputs! 16

  17. Rules Examples <YYINITIAL> {DIGIT}+ { return new Token(Token.NUMBER, yytext(), yyline); } <YYINITIAL> "-" { return new Token(Token.MINUS, yytext(), yyline); } <YYINITIAL> [a-zA-Z] ([a-zA-Z0-9]) * { return new Token(Token.ID, yytext(), yyline); } 17

  18. Action body Java code Can use special methods and vars yytext() the actual token text yyline(when enabled) Scanner state transition yybegin(state-name) tells JFlex to jump to the given state YYINITIAL name given by JFlex to initial state 18

  19. Rules State <YYINITIAL> "//" { yybegin(COMMENTS); } <COMMENTS> [^\n] { } <COMMENTS> [\n] { yybegin(YYINITIAL); } YYINITIAL COMMENTS // ^\n \n 19

  20. import Token; %% %line %char %state COMMENTS Example ALPHA=[A-Za-z_] DIGIT=[0-9] ALPHA_NUMERIC={ALPHA}|{DIGIT} IDENT={ALPHA}({ALPHA_NUMERIC})* NUMBER=({DIGIT})+ WHITE_SPACE=([\ \n\r\t\f])+ NEWLINE=\n %{ private int lineCounter = 0; %} %eofval{ System.out.println("line number=" + lineCounter); return new Symbol(sym.EOF); %eofval} 20

  21. %% <YYINITIAL> {NEWLINE} { lineCounter++; } <YYINITIAL> {NUMBER} { return new Token(Token.NUMBER, yytext(), yyline)); } <YYINITIAL> {WHITE_SPACE} { } <YYINITIAL> "+" { return new Token(Token.PLUS, yytext(), yyline); } <YYINITIAL> "-" { return new Token(Token.MINUS, yytext(), yyline); } <YYINITIAL> "*" { return new Token(Token.TIMES, yytext(), yyline); } ... <YYINITIAL> "//" { yybegin(COMMENTS); } <COMMENTS> [^\n] { } <COMMENTS> [\n] { yybegin(YYINITIAL); } <YYINITIAL> . { return new Symbol(sym.error, null); } 21

  22. import java.io.*; import Token; Running the scanner public class Main { public static void main(String[] args) { Token currToken; try { FileReader txtFile = new FileReader(args[0]); Yylex scanner = new Yylex(txtFile); do { currToken = scanner.next_token(); // do something with currToken } while (currToken.sym != Token.EOF); } catch (Exception e) { throw new RuntimeException( "IO Error (brutal exit) + e.toString()); } } } (Just for testing scanner as stand-alone program) 22

  23. Additional Example https://jflex.de/manual.html#Example 23

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#