New Pattern Matching Algorithms for Network Security Applications by Liu Yang

New Pattern Matching Algorithms for

Network Security Applications

Department of Computer Science

Rutgers University

April 4th, 2013

Intrusion Detection Systems (IDS)

Intrusion detection

Host-based

Network-based

Anomaly-based

Signature-based

…

“

”

…

Example signature

This is an example signature from Snort, an network-based intrusion

detection system (NIDS)

(statistics …)

patterns

Network-based Intrusion Detection Systems

Network traffic

Alerts

NIDS

Network intrusion detection systems (NIDS) employ

regular expressions to represent attack signatures.

…

…

Pattern matching: detecting malicious traffic

Ideal of Pattern Matching

•

Time efficient

–

fast to keep up with network speed, e.g., Gbps

•

Space efficient

–

compact to fit into main memory

The Reality: Time-space Tradeoff

•

Deterministic Finite Automata (DFAs)

–

Fast in operation

–

Consuming large space

•

Nondeterministic Finite Automata (NFAs)

–

Space efficient

–

Slow in operation

•

Recursive backtracking (implemented by PCRE, Java,

etc)

–

Fast in general

–

Extremely slow for certain types of patterns

The Reality: Time-space Tradeoff

Space

Time

Ideal

DFA (deterministic

finite automaton)

NFA (non-deterministic

finite automaton)

Backtracking (under algorithmic

complexity attacks)

Backtracking (with

benign patterns)

My contribution

Overview of My Thesis

…

“

”

…

…

“

”

…

Regular expressions

+submatch extraction

Regular expressions

+back references

Submatch-OBDD

[ANCS’12]

NFA-backref

[to submit]

Main Contribution

•

Algorithms for time and space efficient pattern

matching

–

NFA-OBDD

•

space efficient (60MB memory for 1500+ patterns)

•

1000x faster than NFAs

–

Submatch-OBDD:

•

space efficient

•

10x faster than PCRE and Google’s RE2

–

NFA-backref:

•

space efficient

•

resisting known algorithmic attacks (1000x faster than PCRE

for certain types of patterns)

Part I:

NFA-OBDD: A Time and Space

Efficient Data Structure for Regular

Expression Matching

Joint work with R. Karim, V.

Ganapathy, and R. Smith

RAID’10, COMNET’11

Finite Automata

•

Regular expressions and finite automata are

equally expressive

Regular

expressions

NFAs

DFAs

Why not DFA?

“

”

“

”

“

”

Picture courtesy :

[Smith et al.  Oakland’08]

Why not DFA? (cont.)

Pattern:

“.*1[0|1] {3}  ”

State

explosion



O(2^n)

State explosion may happen

The value of quantifier

 is up to 255 in Snort

Pattern Set Grows Fast

Snort rule set grows 7x in 8 years

Space-efficiency of NFAs

“

”

“

”

“

”

NFAs are Slow

•

NFA frontiers

 may contain multiple states

•

. A frontier set is a set of states where NFA can be

at any instant.

NFAs of Regular Expressions

Example: regex=“a*aa”

NFA Frontier Update: Multiple Lookups

regex=“a*aa”; input=“aaaa”

{1}

{1,2}

{1,2,3}

{1,2,3}

{1,2,3}

Frontier

Can We Make NFAs Faster?

{1}

{1,2}

{1,2,3}

{1,2,3}

{1,2,3}

Frontier

Idea: Update frontiers in ONE step

regex=“a*aa”; input=“aaaa”

NFA-OBDD: Main Idea

•

Represent and operate NFA frontiers

symbolically using Boolean functions

–

Update the frontiers in ONE step: using a single

Boolean formula

–

Use ordered binary decision diagrams (OBDDs) to

represent and operate Boolean formula

Transitions as Boolean Functions

egex=“a*aa”

Match Test using Boolean Functions

{1}

Λ

Λ

T(x,i,y)

(1

Λ

Λ

1 )

V (1

Λ

Λ

2 )

{1,2}

Λ

Λ

T(x,i,y)

(1

Λ

Λ

1)

V (1

Λ

Λ

2)

V (2

Λ

Λ

3)

{1,2,3}

Λ

Λ

T(x,i,y)

(1

Λ

Λ

1)

V (1

Λ

Λ

2)

V (2

Λ

Λ

3)

Input

symbol

Start

states

Transition

relation

Next states

Current

states

aaa

aa

…

aaa

NFA Operations using Boolean Functions

•

Frontier derivation: finding new frontiers after

processing one input symbol:

Next frontiers =

•

Checking acceptance:

Ordered Binary Decision Diagram (OBDD)

[Bryant 1986]

Experimental Toolchain

•

C++

and

CUDD package for OBDDs

Regular Expression

 Sets

•

Snort

HTTP

 signature set

–

1503 regular expressions from

 March 2007

–

regular expressions

from

 October 2009

•

Snort

FTP

 signature set

–

regular expressions

 from October 2009

•

Extracted regular expressions from

pcre

and

uricontent

 fields of signatures

Traffic Traces

•

HTTP traces

–

Rutgers datasets

•

33 traces, size ranges: 5.1MB –1.24 GB

•

One week period in Aug 2009

 from Web server of the CS

department at Rutgers

–

DARPA 1999 datasets (11.7GB)

•

FTP traces

–

2 FTP traces

–

Size: 19.4MB, 24.7 MB

–

Two weeks period in March 2010

 from FTP server of

the CS department at Rutgers

Experimental Results

•

For 1503 regexes from HTTP Signatures

1645x

9-26x

10x

Summary

•

NFA-OBDD is time and space efficient

–

Outperforms NFAs by three orders of magnitude,

retaining space efficiency of NFAs

–

Outperforms or competitive with the PCRE package

–

Competitive with variants of DFAs but drastically less

memory-intensive

Part II: Extension of NFA-OBDD to Model

Submatch Extraction

[ANCS’12]

Joint work with P. Manadhata, W. Horne, P.

Rao, and V. Ganapathy

Submatch Extraction

…

“

”

…

host address 128.6.60.45

resolved by 128.6.1.1

Submatch

extraction

Extract information of interest when finding a match

Submatch Tagging: Tagged NFAs

E = (a*)aa

Tagged NFA of “(a*)aa” with submatch tagging

Transition table T(x,i,y,t) of the tagged NFA

Match Test

RE=“(a*)aa”; Input = “aaaa”

{1}

{1,2}

{1,2,3}

{1,2,3}

{1,2,3}

{t

{t

{t

{t

Frontier

Submatch Extraction

{t

{t

{t

{t

{1}

{1,2}

{1,2,3}

{1,2,3}

{1,2,3}

Frontier

Any path from an accept state to a start state

generates a valid assignment of submatches.

$1=aa

Submatch-OBDD

•

Representing tagged NFAs using Boolean

functions

–

Updating frontiers using Boolean formula

–

Finding a submatch path using Boolean operations

•

Using OBDDs to manipulate Boolean functions

Boolean Representation of Submatch Extraction

Submatch extraction: the last consecutive sequence of

symbols that are assigned with same tags

A back traversal approach: starting from the last

input symbol.

Overview of Toolchain

Toolchain in C++, interfacing with the CUDD*

Experimental Datasets

•

Snort-2009

–

Patterns: 115 regexes with capturing groups from HTTP

rules

–

Traces: 1.2GB CS department network traffic; 1.3GB

Twitter traffic; 1MB synthetic trace

•

Snort-2012

–

Patterns: 403 regexes with capturing groups from HTTP

rules

–

Traces: 1.2GB CS department network traffic; 1.3GB

Twitter traffic; 1MB synthetic trace

•

Firewall-504

–

Patterns: 504 patterns from a commercial firewall F

–

Trace: 87MB of firewall logs (average line size 87 bytes)

Experimental Setup

•

Platform: Intel Core2 Duo E7500, Linux-2.6.3,

2GB RAM

•

Two configurations on pattern matching

–

Conf.S

•

patterns compiled individually

•

compiled pattern matched sequentially against input traces

–

Conf.C

•

patterns combined with UNION and compiled

•

combined pattern matched against input traces

Experimental Results: Snort-2009

Submatch-OBDD is one order of magnitude faster than RE2 and PCRE

10x

Summary

•

Submatch-OBDD: an extension of NFA-OBDD

to model submatch extraction

•

Feasibility study

–

Submatch-OBDD is one order of magnitude faster

than PCRE and Google’s RE2 when patterns are

combined

PART III: Efficient Matching of Patterns with

Back References

Joint work with V. Ganapathy

and P. Manadhata

Regexes Extended with Back References

•

Identifying repeated substrings within a string

•

Non-regular languages

sens

respons

)e

\1

ibility

Note:

\1

 denotes referencing the substring captured by the first capturing group

Example:

Existing Approach

•

Recursive backtracking (PCRE, etc.)

–

Fast in general

–

Can be extremely slow for certain patterns

(algorithmic complexity attacks)

PCRE fails to return

correct results when

n >= 25

My Approach: Relax + Constraint

•

Converting back-refs to conditional submatch

extraction

(a*)aa

\1

(a*)aa(

a*

), s.t. $1

=$2

$1 denotes a substring captured by the 1

st

 capturing

group, and $2 denotes a substring captured by the

nd

 capturing group

Example:

Representing Back-refs with Tagged NFAs

•

Example: (a*)aa(

a*

), s.t. $1

=$2

The tagged NFA constructed from (a*)aa(

a*

).

Labels t

 and t

 are used to tag transitions within

the 1

st

 and 2

nd

 capturing groups. The acceptance

condition is state 3 and $1 = $2.

Transitions of Tagged NFAs

•

Example (cont.):

Match Test

•

Frontier set

–

{(

state#, substr

, substr

, …)}

•

Frontier derivation

–

table lookup + action

•

Acceptance condition

–

 exist (

substr

substr

…), s.t.

 is an accept state

and

substr

=substr

Implementations

•

Two implementations

–

NFA-backref: an NFA-like C++ implementation

–

OBDD-backref: OBDD representation of NFA-backref

Experimental Datasets

•

Patho-01

–

regexes: (a?{n})a{n}\1

–

input strings: a

 (n from 5 to 30,

0% accept rate)

•

Patho-02

–

10 pathological regexes from Snort-2009

–

synthetic input strings (0% accept rate)

•

Benign-03

–

46 regexes with one back-ref from Snort-2012

–

Synthetic input strings (50% accept rate)

Experimental Results: Patho-02

Execution time (cycle/byte) of different implementations for 10

regexes revised from Snort-2009

 *Intel Core2 Duo E7500

2.93GHz

 Linux-2.6

2GB

RAM*

Experimental Results: Benign-03

Execution time (cycle/byte) of different implementations for sequentially

matching the 46 regexes from Snort 2012 with back references.

(a) benign trace

(b) pathological trace

Summary

•

NFA-backref: an efficient pattern matching

algorithm for back references

•

NFA-backref: resisting known algorithmic

complexity attacks (1000x faster than PCRE)

•

PCRE: 10x faster than NFA-backref for benign

patterns

Related Work

•

Multiple DFAs

[Yu

et al.

, ANCS’06]

•

XFAs

[Smith

et al.

, Oakland’08, SIGCOMM’08]

•

FA

[Kumar

et al.

, SIGCOMM’06]

•

Hybrid finite automata

[Becchi

et al.

, ANCS’08]

•

Multibyte speculative matching

[Luchaup

et al.

, RAID’09]

•

DFA-based Submatch extraction [

Horne

et al

., LATA’13

•

RE2 [

Cox, code.google.com/p/re2

•

TNFA [

Laurikari

et al.

, SPIRE’00

•

PCRE [

www.pcre.org

•

Many more – see my papers for details

Conclusion

•

New algorithms for time and space-efficient

pattern matching

–

NFA-OBDD: a time and space efficient data structure

for regular expressions

•

1000x faster than NFAs

–

Submatch-OBDD: an extension of NFA-OBDD to

model submatch extraction

•

10x faster than RE2 and PCRE for combined patterns

–

NFA-backref: an NFA-based algorithm for patterns

with back references

•

1000x faster than PCRE for certain patterns

•

10x slower than PCRE for benign patterns

Acknowledgment

•

•

•

•

…

•

•

Future Directions

•

Hardware Implementation

–

NFA-OBDD

–

Submatch-OBDD

–

NFA-Backref

•

Parallel pattern matching

–

Multithreading using GPUs

–

Multithreading using multi-core processors

–

Speculative NFA-based pattern matching

Other Contributions

•

Enhancing Users’ Comprehension of Android

Permissions

SPSM’12

•

Enhancing Mobile Malware Detection with Social

Collaboration

Socialcom’12

•

Quantifying Security in Preference-based

Authentication

DIM’08

•

Love and Authentication

CHI’08

•

Discount Anonymous On-demand Routing for

Mobile Ad hoc Networks

SecureComm’06

Slide Note

Good morning everybody. Welcome to attend my thesis defense. Today I am going to present “New Pattern Matching Algorithms for Network Security Applications”.

Embed Share

Download

Discusses new pattern matching algorithms for network security applications, focusing on intrusion detection systems (IDS) and the use of signatures and regular expressions to detect malicious patterns in network traffic. Explores the ideal and reality of pattern matching, time-space tradeoffs, and different types of patterns in the context of network security. Provides an overview of Liu Yang's thesis, which includes regular expressions, NFA-OBDD, submatch extraction, back references, and more.

zahr_450 Follow

Uploaded on Sep 23, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

New Pattern Matching Algorithms for Network Security Applications Liu Yang Department of Computer Science Rutgers University Liu Yang April 4th, 2013

Intrusion Detection Systems (IDS) Intrusion detection Host-based Network-based Anomaly-based (statistics ) Signature-based (using patterns to describe malicious traffic) Example signature1: alert tcp $EXTERNAL_NET any -> $HTTP_SERVERS ; pcre: /username=[^&\x3b\r\n]{255}/si ; This is an example signature from Snort, an network-based intrusion detection system (NIDS) 2 Liu Yang

Network-based Intrusion Detection Systems Pattern matching: detecting malicious traffic = { /.*evil.*/} patterns Network traffic NIDS ..evil.. innocent Alerts Network intrusion detection systems (NIDS) employ regular expressions to represent attack signatures. 3 Liu Yang

Ideal of Pattern Matching Time efficient fast to keep up with network speed, e.g., Gbps Space efficient compact to fit into main memory 4 Liu Yang

The Reality: Time-space Tradeoff Deterministic Finite Automata (DFAs) Fast in operation Consuming large space Nondeterministic Finite Automata (NFAs) Space efficient Slow in operation Recursive backtracking (implemented by PCRE, Java, etc) Fast in general Extremely slow for certain types of patterns 5 Liu Yang

The Reality: Time-space Tradeoff Backtracking (under algorithmic complexity attacks) NFA (non-deterministic finite automaton) Time My contribution Backtracking (with benign patterns) DFA (deterministic finite automaton) Ideal Space 6 Liu Yang

Overview of My Thesis Three types of patterns Regular expressions .*<embed[^>]*javascript ^file\x3a\x2f\x2f[^\n]{400} NFA-OBDD [RAID 10, COMNET 11] Regular expressions +submatch extraction Submatch-OBDD [ANCS 12] .*? address (\d+\.\d+\.\d+\.\d+), resolved by (\d+\.\d+\.\d+\.\d+) .*(NLSessionS[^=\s]*)\s*=\s*\x3 B.*\1\s*=[^\s\x3B] Regular expressions +back references NFA-backref [to submit] 7 Liu Yang

Main Contribution Algorithms for time and space efficient pattern matching NFA-OBDD space efficient (60MB memory for 1500+ patterns) 1000x faster than NFAs Submatch-OBDD: space efficient 10x faster than PCRE and Google s RE2 NFA-backref: space efficient resisting known algorithmic attacks (1000x faster than PCRE for certain types of patterns) 8 Liu Yang

Part I: NFA-OBDD: A Time and Space Efficient Data Structure for Regular Expression Matching Joint work with R. Karim, V. Ganapathy, and R. Smith [RAID 10, COMNET 11] 9 Liu Yang

Finite Automata Regular expressions and finite automata are equally expressive Regular expressions NFAs DFAs 10 Liu Yang

Why not DFA? Combining DFAs: Multiplicative increase in number of states .*ab.*cd .*ef.*gh Picture courtesy : [Smith et al. Oakland 08] .*ab.*cd | .*ef.*gh 11 Liu Yang

Why not DFA? (cont.) State explosion may happen NFA Pattern: .*1[0|1] {3} DFA State explosion n O(2^n) The value of quantifier n is up to 255 in Snort 12 Liu Yang

Pattern Set Grows Fast 30000 25000 20000 15000 10000 5000 0 2005 2006 2007 2008 2009 2010 2011 2012 Snort rule set grows 7x in 8 years 13 Liu Yang

Space-efficiency of NFAs Combining NFAs: Additive increase in number of states M N .*ab.*cd .*ef.*gh .*ab.*cd | .*ef.*gh 14 Liu Yang

NFAs are Slow NFA frontiers1 may contain multiple states Frontier update may require multiple transition table lookups 1. A frontier set is a set of states where NFA can be at any instant. 15 Liu Yang

NFAs of Regular Expressions Example: regex= a*aa a a a 1 2 3 Current state (x) 1 1 2 Input symbol (i) a a a Next state (y) 1 2 3 Transition table T(x,i,y) 16 Liu Yang

NFA Frontier Update: Multiple Lookups regex= a*aa ; input= aaaa 1 2 3 Accept aaaa aaaa aaaa aaaa Frontier {1} {1,2} {1,2,3} {1,2,3} {1,2,3} 17 Liu Yang

Can We Make NFAs Faster? regex= a*aa ; input= aaaa 1 2 3 Accept aaaa aaaa aaaa aaaa Frontier {1} {1,2} {1,2,3} {1,2,3} {1,2,3} Idea: Update frontiers in ONE step 18 Liu Yang

NFA-OBDD: Main Idea Represent and operate NFA frontiers symbolically using Boolean functions Update the frontiers in ONE step: using a single Boolean formula Use ordered binary decision diagrams (OBDDs) to represent and operate Boolean formula 19 Liu Yang

Transitions as Boolean Functions regex= a*aa Current state (x) 1 1 2 Input symbol (i) a a a Next state (y) 1 2 3 (1 a 1) V (1 a 2) V (2 a 3) T(x,i,y) = 20 Liu Yang

Match Test using Boolean Functions (1 a 1 ) V (1 a 2 ) aaaa {1} a T(x,i,y) Input symbol Start states Transition relation Next states aaaa (1 a 1) V (1 a 2) V (2 a 3) {1,2} a T(x,i,y) Current states aaaa (1 a 1) V (1 a 2) V (2 a 3) {1,2,3} a T(x,i,y) Accept 21 Liu Yang

NFA Operations using Boolean Functions Frontier derivation: finding new frontiers after processing one input symbol: Next frontiers = Checking acceptance: ( ( ) Map InputSymbo l i , y x x i ( ) Frontier x ( , , i x )) TransFunct ion y ( ( ) ( )) SAT SetOfAccep tStates x Frontier x 22 Liu Yang

Ordered Binary Decision Diagram (OBDD) [Bryant 1986] OBDDs: Compact representation of Boolean functions x1 0 0 1 x2 1 1 0 x3 1 1 1 x4 0 1 1 x5 1 0 1 x6 1 0 0 F(x) 1 1 1 = ( ) ( ) F x x x x x x x 1 2 x 3 x 4 x 5 x 6 x ( ) x 1 2 3 4 5 6 ( ) x x x x x x 1 2 3 4 5 6 23 Liu Yang

Experimental Toolchain C++ and CUDD package for OBDDs 24 Liu Yang

Regular Expression Sets Snort HTTP signature set 1503 regular expressions from March 2007 2612 regular expressions from October 2009 Snort FTP signature set 98 regular expressions from October 2009 Extracted regular expressions from pcre and uricontent fields of signatures 25 Liu Yang

Traffic Traces HTTP traces Rutgers datasets 33 traces, size ranges: 5.1MB 1.24 GB One week period in Aug 2009 from Web server of the CS department at Rutgers DARPA 1999 datasets (11.7GB) FTP traces 2 FTP traces Size: 19.4MB, 24.7 MB Two weeks period in March 2010 from FTP server of the CS department at Rutgers 26 Liu Yang

Experimental Results For 1503 regexes from HTTP Signatures 10x 1645x 9-26x 27 Liu Yang *Intel Core2 Duo E7500, 2.93GHz; Linux-2.6; 2GB RAM*

Summary NFA-OBDD is time and space efficient Outperforms NFAs by three orders of magnitude, retaining space efficiency of NFAs Outperforms or competitive with the PCRE package Competitive with variants of DFAs but drastically less memory-intensive 28 Liu Yang

Part II: Extension of NFA-OBDD to Model Submatch Extraction [ANCS 12] Joint work with P. Manadhata, W. Horne, P. Rao, and V. Ganapathy 29 Liu Yang

Submatch Extraction Extract information of interest when finding a match host address 128.6.60.45 resolved by 128.6.1.1 Submatch extraction .*? address (\d+\.\d+\.\d+\.\d+), resolved by (\d+\.\d+\.\d+\.\d+) $1 = 128.6.60.45 $2 = 128.6.1.1 30 Liu Yang

Submatch Tagging: Tagged NFAs E = (a*)aa Tag(E) = (a*)t aa 1 Tagged NFA of (a*)aa with submatch tagging t1 a/t1 a a 1 2 3 Current state (x) 1 1 2 Input symbol (i) Next state (y) a a a Output tags (t) {t1} {} {} 1 2 3 Transition table T(x,i,y,t) of the tagged NFA 31 Liu Yang

Match Test RE= (a*)aa ; Input = aaaa {t1} {t1} {t1} {t1} 1 2 3 Accept aaaa aaaa aaaa aaaa Frontier {1} {1,2} {1,2,3} {1,2,3} {1,2,3} 32 Liu Yang

Submatch Extraction {t1} {t1} {t1} {t1} 1 2 3 accept aaaa aaaa aaaa aaaa $1=aa Frontier {1} {1,2} {1,2,3} {1,2,3} {1,2,3} Any path from an accept state to a start state generates a valid assignment of submatches. 33 Liu Yang

Submatch-OBDD Representing tagged NFAs using Boolean functions Updating frontiers using Boolean formula Finding a submatch path using Boolean operations Using OBDDs to manipulate Boolean functions 34 Liu Yang

Boolean Representation of Submatch Extraction A back traversal approach: starting from the last input symbol. = OneRrevers eTransitio n ( ( ) PickOne CurrentSta te y ( ) InputSymbo l i ( , , OneRrevers , )) IntermTran = sitions x i y t ( ( )) previousSt ate = Map eTransitio n , , x y i y t ( ) OutputTag Submatch extraction: the last consecutive sequence of symbols that are assigned with same tags OneRrevers eTransitio n , , i x y 35 Liu Yang

Overview of Toolchain Toolchain in C++, interfacing with the CUDD* input stream Tagged NFAs OBDDs regexes with capturing groups pattern matching re2tnfa tnfa2obdd rejected matched submatches $1 = 36 Liu Yang

Experimental Datasets Snort-2009 Patterns: 115 regexes with capturing groups from HTTP rules Traces: 1.2GB CS department network traffic; 1.3GB Twitter traffic; 1MB synthetic trace Snort-2012 Patterns: 403 regexes with capturing groups from HTTP rules Traces: 1.2GB CS department network traffic; 1.3GB Twitter traffic; 1MB synthetic trace Firewall-504 Patterns: 504 patterns from a commercial firewall F Trace: 87MB of firewall logs (average line size 87 bytes) 37 Liu Yang

Experimental Setup Platform: Intel Core2 Duo E7500, Linux-2.6.3, 2GB RAM Two configurations on pattern matching Conf.S patterns compiled individually compiled pattern matched sequentially against input traces Conf.C patterns combined with UNION and compiled combined pattern matched against input traces 38 Liu Yang

Experimental Results: Snort-2009 Submatch-OBDD is one order of magnitude faster than RE2 and PCRE 10000000 execution time (cycle/byte) 1000000 100000 10x 10000 1000 100 10 1 Conf.S Conf.C RE2 PCRE Submatch-OBDD Execution time (cycle/byte) of different implementations Memory consumption: RE2 (7.3MB), PCRE (1.2MB), Submatch-OBDD (9.4MB) 39 Liu Yang

Summary Submatch-OBDD: an extension of NFA-OBDD to model submatch extraction Feasibility study Submatch-OBDD is one order of magnitude faster than PCRE and Google s RE2 when patterns are combined 40 Liu Yang

PART III: Efficient Matching of Patterns with Back References Joint work with V. Ganapathy and P. Manadhata 41 Liu Yang

Regexes Extended with Back References Identifying repeated substrings within a string Non-regular languages Example: sense sensibility response responsibility sense responsibility response sensibility (sens|respons)e \1ibility Note: \1 denotes referencing the substring captured by the first capturing group An example from Snort rule set: /.*javascript.+function\s+(\w+)\s*$\w*$\s*\{.+location=[^}]+\1.+\}/sim 42 Liu Yang

Existing Approach Recursive backtracking (PCRE, etc.) Fast in general Can be extremely slow for certain patterns (algorithmic complexity attacks) 0.14 Throughput (MB/sec) 0.12 0.1 PCRE fails to return correct results when n >= 25 0.08 0.06 0.04 Nearly zero throughput 0.02 0 5 10 15 n 20 25 30 Throughput of PCRE when matching (a?{n})a{n}\1 with an 43 Liu Yang

My Approach: Relax + Constraint Converting back-refs to conditional submatch extraction constraint Example: (a*)aa\1 (a*)aa(a*), s.t. $1=$2 $1 denotes a substring captured by the 1st capturing group, and $2 denotes a substring captured by the 2nd capturing group 44 Liu Yang

Representing Back-refs with Tagged NFAs Example: (a*)aa(a*), s.t. $1=$2 a/t1 a/t2 a a 1 2 3 The tagged NFA constructed from (a*)aa(a*). Labels t1 and t2 are used to tag transitions within the 1st and 2nd capturing groups. The acceptance condition is state 3 and $1 = $2. 45 Liu Yang

Transitions of Tagged NFAs Example (cont.): Current state (x) 1 1 2 3 Input symbol (i) Next state (y) a a a a Action New(t1) or update(t1) Carry-over(t1) Carry-over(t1) New(t2) or Update(t2) 1 2 3 3 New(): create a new captured substring Update(): update a captured substring Carry-over(): copy around the substrings captured from state to state 46 Liu Yang

Match Test Frontier set {(state#, substr1, substr2, )} Frontier derivation table lookup + action Acceptance condition exist (s, substr1, substr2, ), s.t. s is an accept state and substr1=substr2 47 Liu Yang

Implementations Two implementations NFA-backref: an NFA-like C++ implementation OBDD-backref: OBDD representation of NFA-backref input stream patterns with back-refs tagged NFAs with constraint match test re2tnfa matched or not 48 Liu Yang

Experimental Datasets Patho-01 regexes: (a?{n})a{n}\1 input strings: an (n from 5 to 30, 100% accept rate) Patho-02 10 pathological regexes from Snort-2009 synthetic input strings (0% accept rate) Benign-03 46 regexes with one back-ref from Snort-2012 Synthetic input strings (50% accept rate) 49 Liu Yang

Experimental Results: Patho-02 NFA-back-ref is >= 3 orders of magnitude faster than PCRE 10000000 Exec-time (cycle/byte) 1000000 100000 10000 1000 100 10 1 1 2 3 4 5 6 7 8 9 10 regex # PCRE OBDD-backref NFA-backref Execution time (cycle/byte) of different implementations for 10 regexes revised from Snort-2009 *Intel Core2 Duo E7500, 2.93GHz; Linux-2.6; 2GB RAM* 50 Liu Yang

New Pattern Matching Algorithms for Network Security Applications by Liu Yang

Download Presentation

Presentation Transcript

Related

More Related Content