New Pattern Matching Algorithms for Network Security Applications by Liu Yang

 
New Pattern Matching Algorithms for
Network Security Applications
 
L
i
u
 
Y
a
n
g
Department of Computer Science
Rutgers University
 
April 4th, 2013
 
Intrusion Detection Systems (IDS)
 
2
 
Intrusion detection
 
Host-based
 
Network-based
 
Anomaly-based
 
Signature-based
 
(
u
s
i
n
g
 
p
a
t
t
e
r
n
s
 
t
o
 
d
e
s
c
r
i
b
e
m
a
l
i
c
i
o
u
s
 
t
r
a
f
f
i
c
)
 
a
l
e
r
t
 
t
c
p
 
$
E
X
T
E
R
N
A
L
_
N
E
T
 
a
n
y
 
-
>
 
$
H
T
T
P
_
S
E
R
V
E
R
S
 
;
p
c
r
e
:
/
u
s
e
r
n
a
m
e
=
[
^
&
\
x
3
b
\
r
\
n
]
{
2
5
5
}
/
s
i
;
 
 
Example signature
1
:
 
This is an example signature from Snort, an network-based intrusion
detection system (NIDS)
 
(statistics …)
.
.
e
v
i
l
.
.
patterns
Network-based Intrusion Detection Systems
3
Network traffic
Alerts
NIDS
Network intrusion detection systems (NIDS) employ
regular expressions to represent attack signatures.
=
 
{
 
/
.
*
e
v
i
l
.
*
/
}
i
n
n
o
c
e
n
t
Pattern matching: detecting malicious traffic
Ideal of Pattern Matching
 
Time efficient
fast to keep up with network speed, e.g., Gbps
Space efficient
compact to fit into main memory
4
The Reality: Time-space Tradeoff
5
 
Deterministic Finite Automata (DFAs)
Fast in operation
Consuming large space
Nondeterministic Finite Automata (NFAs)
Space efficient
Slow in operation
Recursive backtracking (implemented by PCRE, Java,
etc)
Fast in general
Extremely slow for certain types of patterns
The Reality: Time-space Tradeoff
6
Space
Time
 
Ideal
 
DFA (deterministic
finite automaton)
 
NFA (non-deterministic
finite automaton)
 
Backtracking (under algorithmic
complexity attacks)
 
Backtracking (with
benign patterns)
 
My contribution
Overview of My Thesis
7
.
*
?
 
a
d
d
r
e
s
s
 
(
\
d
+
\
.
\
d
+
\
.
\
d
+
\
.
\
d
+
)
,
r
e
s
o
l
v
e
d
 
b
y
 
(
\
d
+
\
.
\
d
+
\
.
\
d
+
\
.
\
d
+
)
.
*
(
N
L
S
e
s
s
i
o
n
S
[
^
=
\
s
]
*
)
\
s
*
=
\
s
*
\
x
3
B
.
*
\
1
\
s
*
=
[
^
\
s
\
x
3
B
]
Regular expressions
+submatch extraction
Regular expressions
+back references
 
Submatch-OBDD
[ANCS’12]
 
NFA-backref  
[to submit]
Main Contribution
 
Algorithms for time and space efficient pattern
matching
NFA-OBDD
space efficient (60MB memory for 1500+ patterns)
1000x faster than NFAs
Submatch-OBDD:
space efficient
10x faster than PCRE and Google’s RE2
NFA-backref:
space efficient
resisting known algorithmic attacks (1000x faster than PCRE
for certain types of patterns)
8
 
Part I: 
NFA-OBDD: A Time and Space
Efficient Data Structure for Regular
Expression Matching
 
9
 
Joint work with R. Karim, V.
Ganapathy, and R. Smith
[
RAID’10, COMNET’11
]
 
Finite Automata
 
Regular expressions and finite automata are
equally expressive
 
10
Regular
expressions
NFAs
DFAs
Why not DFA?
11
 
.
*
a
b
.
*
c
d
 
.
*
e
f
.
*
g
h
 
.
*
a
b
.
*
c
d
 
|
 
.
*
e
f
.
*
g
h
Picture courtesy : 
[Smith et al.  Oakland’08]
C
o
m
b
i
n
i
n
g
 
D
F
A
s
:
 
M
u
l
t
i
p
l
i
c
a
t
i
v
e
 
i
n
c
r
e
a
s
e
 
i
n
n
u
m
b
e
r
 
o
f
 
s
t
a
t
e
s
Why not DFA? (cont.)
12
 
Pattern:
“.*1[0|1] {3}  ”
 
N
F
A
 
D
F
A
State
explosion
n 
O(2^n)
State explosion may happen
The value of quantifier 
n
 is up to 255 in Snort
 
Pattern Set Grows Fast
 
13
Snort rule set grows 7x in 8 years
Space-efficiency of NFAs
14
 
M
 
N
.
*
a
b
.
*
c
d
.
*
e
f
.
*
g
h
.
*
a
b
.
*
c
d
 
|
 
.
*
e
f
.
*
g
h
C
o
m
b
i
n
i
n
g
 
N
F
A
s
:
 
A
d
d
i
t
i
v
e
 
i
n
c
r
e
a
s
e
 
i
n
n
u
m
b
e
r
 
o
f
 
s
t
a
t
e
s
 
NFAs are Slow
 
NFA frontiers
1
 may contain multiple states
F
r
o
n
t
i
e
r
 
u
p
d
a
t
e
 
m
a
y
 
r
e
q
u
i
r
e
 
m
u
l
t
i
p
l
e
 
t
r
a
n
s
i
t
i
o
n
t
a
b
l
e
 
l
o
o
k
u
p
s
 
15
 
1
. A frontier set is a set of states where NFA can be
at any instant.
NFAs of Regular Expressions
 
Example: regex=“a*aa”
 
T
r
a
n
s
i
t
i
o
n
 
t
a
b
l
e
 
T
(
x
,
i
,
y
)
16
NFA Frontier Update: Multiple Lookups
regex=“a*aa”; input=“aaaa”
1
2
3
 
a
a
a
a
 
a
a
a
a
 
a
a
a
a
 
a
a
a
a
{1}
 
{1,2}
 
{1,2,3}
 
{1,2,3}
 
{1,2,3}
 
A
c
c
e
p
t
Frontier
17
Can We Make NFAs Faster?
1
2
3
 
a
a
a
a
 
a
a
a
a
 
a
a
a
a
 
a
a
a
a
{1}
 
{1,2}
 
{1,2,3}
 
{1,2,3}
 
{1,2,3}
 
A
c
c
e
p
t
Frontier
Idea: Update frontiers in ONE step
regex=“a*aa”; input=“aaaa”
18
NFA-OBDD: Main Idea
 
Represent and operate NFA frontiers
symbolically using Boolean functions
Update the frontiers in ONE step: using a single
Boolean formula
Use ordered binary decision diagrams (OBDDs) to
represent and operate Boolean formula
 
19
Transitions as Boolean Functions
20
r
egex=“a*aa”
Match Test using Boolean Functions
21
 
{1} 
Λ 
a 
Λ 
T(x,i,y)
 
   (1
Λ
a
Λ
 1 )
V (1
Λ
a
Λ
 2 )
 
{1,2} 
Λ 
a
 Λ 
T(x,i,y)
 
   (1
Λ
a
Λ 
1)
V (1
Λ
a
Λ
 2)
V (2
Λ
a
Λ 
3)
 
{1,2,3}
 Λ 
a
 Λ 
T(x,i,y)
 
   (1
Λ
a
Λ 
1)
V (1
Λ
a
Λ 
2)
V (2
Λ
a
Λ 
3)
Input
symbol
Start
states
Transition
relation
Next states
Current
states
A
c
c
e
p
t
 
a
aaa
 
a
a
aa
 
 
aaa
a
 
NFA Operations using Boolean Functions
 
Frontier derivation: finding new frontiers after
processing one input symbol:
    
Next frontiers =
 
Checking acceptance:
 
22
Ordered Binary Decision Diagram (OBDD)
[Bryant 1986] 
23
O
B
D
D
s
:
 
C
o
m
p
a
c
t
 
r
e
p
r
e
s
e
n
t
a
t
i
o
n
o
f
 
B
o
o
l
e
a
n
 
f
u
n
c
t
i
o
n
s
 
Experimental Toolchain
 
24
 
 C++
 and 
CUDD package for OBDDs
 
Regular Expression
 Sets
 
Snort 
HTTP
 signature set
1503 regular expressions from
 March 2007
2612 
regular expressions
 
from
 October 2009
 
Snort 
FTP
 signature set
98 
regular expressions
 from October 2009
 
Extracted regular expressions from
 
pcre
 
and
uricontent
 fields of signatures
 
25
 
Traffic Traces
 
HTTP traces
Rutgers datasets
33 traces, size ranges: 5.1MB –1.24 GB
One week period in Aug 2009
 from Web server of the CS
department at Rutgers
DARPA 1999 datasets (11.7GB)
FTP traces
2 FTP traces
Size: 19.4MB, 24.7 MB
Two weeks period in March 2010
 from FTP server of
the CS department at Rutgers
 
26
Experimental Results
For 1503 regexes from HTTP Signatures
27
*
I
n
t
e
l
 
C
o
r
e
2
 
D
u
o
 
E
7
5
0
0
,
 
2
.
9
3
G
H
z
;
 
L
i
n
u
x
-
2
.
6
;
 
2
G
B
 
R
A
M
*
 
1645x
 
9-26x
 
10x
 
Summary
 
NFA-OBDD is time and space efficient
Outperforms NFAs by three orders of magnitude,
retaining space efficiency of NFAs
Outperforms or competitive with the PCRE package
Competitive with variants of DFAs but drastically less
memory-intensive
 
28
 
Part II: Extension of NFA-OBDD to Model
Submatch Extraction 
[ANCS’12]
 
29
 
Joint work with P. Manadhata, W. Horne, P.
Rao, and V. Ganapathy
Submatch Extraction
30
 
.
*
?
 
a
d
d
r
e
s
s
 
(
\
d
+
\
.
\
d
+
\
.
\
d
+
\
.
\
d
+
)
,
r
e
s
o
l
v
e
d
 
b
y
 
(
\
d
+
\
.
\
d
+
\
.
\
d
+
\
.
\
d
+
)
 
host address 128.6.60.45
resolved by 128.6.1.1
 
Submatch
extraction
 
$
1
 
=
 
1
2
8
.
6
.
6
0
.
4
5
$
2
 
=
 
1
2
8
.
6
.
1
.
1
Extract information of interest when finding a match
Submatch Tagging: Tagged NFAs
 
E = (a*)aa
 
Tagged NFA of “(a*)aa” with submatch tagging 
t
1
 
Transition table T(x,i,y,t) of the tagged NFA
31
Match Test
RE=“(a*)aa”; Input = “aaaa”
1
2
3
 
a
a
a
a
 
a
a
a
a
 
a
a
a
a
 
a
a
a
a
{1}
 
{1,2}
 
{1,2,3}
 
{1,2,3}
 
{1,2,3}
 
{t
1
}
 
{t
1
}
 
{t
1
}
 
{t
1
}
 
A
c
c
e
p
t
Frontier
32
 
Submatch Extraction
 
1
 
2
 
3
 
a
a
a
a
 
a
a
a
a
 
a
a
a
a
 
a
a
a
a
 
{t
1
}
 
{t
1
}
 
{t
1
}
 
{t
1
}
 
a
c
c
e
p
t
 
{1}
 
{1,2}
 
{1,2,3}
 
{1,2,3}
 
{1,2,3}
 
Frontier
Any path from an accept state to a start state
generates a valid assignment of submatches.
 
$1=aa
 
33
 
Submatch-OBDD
 
Representing tagged NFAs using Boolean
functions
Updating frontiers using Boolean formula
Finding a submatch path using Boolean operations
Using OBDDs to manipulate Boolean functions
 
34
 
Boolean Representation of Submatch Extraction
Submatch extraction: the last consecutive sequence of
symbols that are assigned with same tags
 
A back traversal approach: starting from the last
input symbol.
 
35
 
Overview of Toolchain
 
36
 
Toolchain in C++, interfacing with the CUDD*
Experimental Datasets
 
Snort-2009
Patterns: 115 regexes with capturing groups from HTTP
rules
Traces: 1.2GB CS department network traffic; 1.3GB
Twitter traffic; 1MB synthetic trace
Snort-2012
Patterns: 403 regexes with capturing groups from HTTP
rules
Traces: 1.2GB CS department network traffic; 1.3GB
Twitter traffic; 1MB synthetic trace
Firewall-504
Patterns: 504 patterns from a commercial firewall F
Trace: 87MB of firewall logs (average line size 87 bytes)
37
Experimental Setup
 
Platform: Intel Core2 Duo E7500, Linux-2.6.3,
2GB RAM
Two configurations on pattern matching
Conf.S
patterns compiled individually
compiled pattern matched sequentially against input traces
Conf.C
patterns combined with UNION and compiled
combined pattern matched against input traces
38
Experimental Results: Snort-2009
39
E
x
e
c
u
t
i
o
n
 
t
i
m
e
 
(
c
y
c
l
e
/
b
y
t
e
)
 
o
f
 
d
i
f
f
e
r
e
n
t
 
i
m
p
l
e
m
e
n
t
a
t
i
o
n
s
e
x
e
c
u
t
i
o
n
 
t
i
m
e
 
(
c
y
c
l
e
/
b
y
t
e
)
M
e
m
o
r
y
 
c
o
n
s
u
m
p
t
i
o
n
:
 
R
E
2
 
(
7
.
3
M
B
)
,
 
P
C
R
E
 
(
1
.
2
M
B
)
,
 
S
u
b
m
a
t
c
h
-
O
B
D
D
 
(
9
.
4
M
B
)
Submatch-OBDD is one order of magnitude faster than RE2 and PCRE
 
10x
 
Summary
 
Submatch-OBDD: an extension of NFA-OBDD
to model submatch extraction
Feasibility study
Submatch-OBDD is one order of magnitude faster
than PCRE and Google’s RE2 when patterns are
combined
 
40
 
PART III: Efficient Matching of Patterns with
Back References
 
41
 
Joint work with V. Ganapathy
and P. Manadhata
Regexes Extended with Back References
 
Identifying repeated substrings within a string
Non-regular languages
42
 
(
sens
|
respons
)e 
\1
ibility
 
Note:
 \1
 denotes referencing the substring captured by the first capturing group
 
Example:
 
A
n
 
e
x
a
m
p
l
e
 
f
r
o
m
 
S
n
o
r
t
 
r
u
l
e
 
s
e
t
:
/
.
*
j
a
v
a
s
c
r
i
p
t
.
+
f
u
n
c
t
i
o
n
\
s
+
(
\
w
+
)
\
s
*
\
(
\
w
*
\
)
\
s
*
\
{
.
+
l
o
c
a
t
i
o
n
=
[
^
}
]
+
\
1
.
+
\
}
/
s
i
m
Existing Approach
 
Recursive backtracking (PCRE, etc.)
Fast in general
Can be extremely slow for certain patterns
(algorithmic complexity attacks)
43
PCRE fails to return
correct results when
n >= 25
My Approach: Relax + Constraint
Converting back-refs to conditional submatch
extraction
44
(a*)aa
\1
(a*)aa(
a*
), s.t. $1
=$2
 
$1 denotes a substring captured by the 1
st
 capturing
group, and $2 denotes a substring captured by the
2
nd
 capturing group
 
Example:
Representing Back-refs with Tagged NFAs
 
Example: (a*)aa(
a*
), s.t. $1
=$2
45
 
The tagged NFA constructed from (a*)aa(
a*
).
Labels t
1
 and t
2
 are used to tag transitions within
the 1
st
 and 2
nd
 capturing groups. The acceptance
condition is state 3 and $1 = $2.
 
Transitions of Tagged NFAs
 
46
 
Example (cont.):
 
N
e
w
(
)
:
 
c
r
e
a
t
e
 
a
 
n
e
w
 
c
a
p
t
u
r
e
d
 
s
u
b
s
t
r
i
n
g
U
p
d
a
t
e
(
)
:
 
u
p
d
a
t
e
 
a
 
c
a
p
t
u
r
e
d
 
s
u
b
s
t
r
i
n
g
C
a
r
r
y
-
o
v
e
r
(
)
:
 
c
o
p
y
 
a
r
o
u
n
d
 
t
h
e
 
s
u
b
s
t
r
i
n
g
s
 
c
a
p
t
u
r
e
d
 
f
r
o
m
s
t
a
t
e
 
t
o
 
s
t
a
t
e
Match Test
 
Frontier set
{(
state#, substr
1
, substr
2
, …)}
Frontier derivation
table lookup + action
Acceptance condition
 exist (
s
, 
substr
1
, 
substr
2
, 
…), s.t. 
s
 is an accept state
and 
substr
1
=substr
2
47
 
Implementations
 
48
 
Two implementations
NFA-backref: an NFA-like C++ implementation
OBDD-backref: OBDD representation of NFA-backref
 
w
i
t
h
 
c
o
n
s
t
r
a
i
n
t
Experimental Datasets
 
Patho-01
regexes: (a?{n})a{n}\1
input strings: a
n
 (n from 5 to 30, 
10
0% accept rate)
Patho-02
10 pathological regexes from Snort-2009
synthetic input strings (0% accept rate)
Benign-03
46 regexes with one back-ref from Snort-2012
Synthetic input strings (50% accept rate)
49
 
Experimental Results: Patho-02
 
50
 
Execution time (cycle/byte) of different implementations for 10
regexes revised from Snort-2009
 
r
e
g
e
x
 
#
N
F
A
-
b
a
c
k
-
r
e
f
 
i
s
 
>
=
 
3
 
o
r
d
e
r
s
 
o
f
 
m
a
g
n
i
t
u
d
e
 
f
a
s
t
e
r
 
t
h
a
n
 
P
C
R
E
 
 *Intel Core2 Duo E7500
, 
2.93GHz
;
 Linux-2.6
; 
2GB 
RAM*
 
Experimental Results: Benign-03
 
51
 
Execution time (cycle/byte) of different implementations for sequentially
matching the 46 regexes from Snort 2012 with back references.
P
C
R
E
 
i
s
 
1
0
x
 
f
a
s
t
e
r
 
t
h
a
n
 
N
F
A
-
b
a
c
k
r
e
f
 
f
o
r
 
b
e
n
i
g
n
 
t
r
a
c
e
s
,
 
b
u
t
1
0
0
0
x
 
s
l
o
w
e
r
 
t
h
a
n
 
N
F
A
-
b
a
c
k
r
e
f
 
f
o
r
 
p
a
t
h
o
l
o
g
i
c
a
l
 
t
r
a
c
e
s
 
(a) benign trace
 
(b) pathological trace
 
Summary
 
NFA-backref: an efficient pattern matching
algorithm for back references
NFA-backref: resisting known algorithmic
complexity attacks (1000x faster than PCRE)
PCRE: 10x faster than NFA-backref for benign
patterns
 
52
 
Related Work
 
Multiple DFAs 
[Yu 
et al.
, ANCS’06]
XFAs 
[Smith 
et al.
, Oakland’08, SIGCOMM’08]
D
2
FA 
[Kumar 
et al.
, SIGCOMM’06]
Hybrid finite automata 
[Becchi 
et al.
, ANCS’08]
Multibyte speculative matching 
[Luchaup 
et al.
, RAID’09]
DFA-based Submatch extraction [
Horne 
et al
., LATA’13
]
RE2 [
Cox, code.google.com/p/re2
]
TNFA [
Laurikari 
et al.
, SPIRE’00
]
PCRE [
www.pcre.org
]
Many more – see my papers for details
 
53
 
Conclusion
 
New algorithms for time and space-efficient
pattern matching
NFA-OBDD: a time and space efficient data structure
for regular expressions
1000x faster than NFAs
Submatch-OBDD: an extension of NFA-OBDD to
model submatch extraction
10x faster than RE2 and PCRE for combined patterns
NFA-backref: an NFA-based algorithm for patterns
with back references
1000x faster than PCRE for certain patterns
10x slower than PCRE for benign patterns
 
54
 
Acknowledgment
 
A
d
v
i
s
o
r
:
 
P
r
o
f
.
 
V
i
n
o
d
 
G
a
n
a
p
a
t
h
y
R
e
s
e
a
r
c
h
 
d
i
r
e
c
t
o
r
s
:
 
P
r
o
f
.
 
V
i
n
o
d
 
G
a
n
a
p
a
t
h
y
,
 
P
r
o
f
.
 
L
i
v
i
u
 
I
f
t
o
d
e
T
h
e
s
i
s
 
C
o
m
m
i
t
t
e
e
:
 
P
r
o
f
.
 
V
i
n
o
d
 
G
a
n
a
p
a
t
h
y
,
 
P
r
o
f
.
 
L
i
v
i
u
 
I
f
t
o
d
e
,
 
P
r
o
f
.
B
a
d
r
i
 
N
a
t
h
,
 
a
n
d
 
D
r
.
 
A
b
h
i
n
a
v
 
S
r
i
v
a
s
t
a
v
a
C
o
-
a
u
t
h
o
r
s
:
 
V
i
n
o
d
 
G
a
n
a
p
a
t
h
y
,
 
L
i
v
i
u
 
I
f
t
o
d
e
,
 
R
a
n
d
y
 
S
m
i
t
h
,
 
R
e
z
w
a
n
a
K
a
r
i
m
,
 
P
r
a
t
y
u
s
a
 
M
a
n
a
d
h
a
t
a
,
 
W
i
l
l
i
a
m
 
H
o
r
n
e
,
 
P
r
a
s
a
d
 
R
a
o
,
 
N
a
d
e
r
B
o
u
s
h
e
h
r
i
n
e
j
a
d
m
o
r
a
d
i
,
 
P
a
l
l
a
b
 
R
o
y
,
 
M
a
r
k
u
s
 
J
a
k
o
b
s
s
o
n
,
 
C
o
l
l
e
a
g
u
e
s
:
 
M
o
h
a
n
 
D
h
a
w
a
n
,
 
S
h
a
k
e
e
l
 
B
u
t
t
,
 
L
u
 
H
a
n
,
 
A
m
r
u
t
a
G
o
k
h
a
l
e
,
 
R
e
z
w
a
n
a
 
K
a
r
i
m
,
 
a
n
d
 
N
a
d
e
r
 
B
o
u
s
h
e
h
r
i
n
e
j
a
d
m
o
r
a
d
i
M
y
 
w
i
f
e
:
 
W
e
i
w
e
i
 
T
a
n
g
 
55
 
Future Directions
 
Hardware Implementation
NFA-OBDD
Submatch-OBDD
NFA-Backref
Parallel pattern matching
Multithreading using GPUs
Multithreading using multi-core processors
Speculative NFA-based pattern matching
 
56
 
Other Contributions
 
Enhancing Users’ Comprehension of Android
Permissions 
[
SPSM’12
]
Enhancing Mobile Malware Detection with Social
Collaboration 
[
Socialcom’12
]
Quantifying Security in Preference-based
Authentication 
[
DIM’08
]
Love and Authentication
 [
CHI’08
]
Discount Anonymous On-demand Routing for
Mobile Ad hoc Networks 
[
SecureComm’06
]
 
57
Slide Note

Good morning everybody. Welcome to attend my thesis defense. Today I am going to present “New Pattern Matching Algorithms for Network Security Applications”.

Embed
Share

Discusses new pattern matching algorithms for network security applications, focusing on intrusion detection systems (IDS) and the use of signatures and regular expressions to detect malicious patterns in network traffic. Explores the ideal and reality of pattern matching, time-space tradeoffs, and different types of patterns in the context of network security. Provides an overview of Liu Yang's thesis, which includes regular expressions, NFA-OBDD, submatch extraction, back references, and more.


Uploaded on Sep 23, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. New Pattern Matching Algorithms for Network Security Applications Liu Yang Department of Computer Science Rutgers University Liu Yang April 4th, 2013

  2. Intrusion Detection Systems (IDS) Intrusion detection Host-based Network-based Anomaly-based (statistics ) Signature-based (using patterns to describe malicious traffic) Example signature1: alert tcp $EXTERNAL_NET any -> $HTTP_SERVERS ; pcre: /username=[^&\x3b\r\n]{255}/si ; This is an example signature from Snort, an network-based intrusion detection system (NIDS) 2 Liu Yang

  3. Network-based Intrusion Detection Systems Pattern matching: detecting malicious traffic = { /.*evil.*/} patterns Network traffic NIDS ..evil.. innocent Alerts Network intrusion detection systems (NIDS) employ regular expressions to represent attack signatures. 3 Liu Yang

  4. Ideal of Pattern Matching Time efficient fast to keep up with network speed, e.g., Gbps Space efficient compact to fit into main memory 4 Liu Yang

  5. The Reality: Time-space Tradeoff Deterministic Finite Automata (DFAs) Fast in operation Consuming large space Nondeterministic Finite Automata (NFAs) Space efficient Slow in operation Recursive backtracking (implemented by PCRE, Java, etc) Fast in general Extremely slow for certain types of patterns 5 Liu Yang

  6. The Reality: Time-space Tradeoff Backtracking (under algorithmic complexity attacks) NFA (non-deterministic finite automaton) Time My contribution Backtracking (with benign patterns) DFA (deterministic finite automaton) Ideal Space 6 Liu Yang

  7. Overview of My Thesis Three types of patterns Regular expressions .*<embed[^>]*javascript ^file\x3a\x2f\x2f[^\n]{400} NFA-OBDD [RAID 10, COMNET 11] Regular expressions +submatch extraction Submatch-OBDD [ANCS 12] .*? address (\d+\.\d+\.\d+\.\d+), resolved by (\d+\.\d+\.\d+\.\d+) .*(NLSessionS[^=\s]*)\s*=\s*\x3 B.*\1\s*=[^\s\x3B] Regular expressions +back references NFA-backref [to submit] 7 Liu Yang

  8. Main Contribution Algorithms for time and space efficient pattern matching NFA-OBDD space efficient (60MB memory for 1500+ patterns) 1000x faster than NFAs Submatch-OBDD: space efficient 10x faster than PCRE and Google s RE2 NFA-backref: space efficient resisting known algorithmic attacks (1000x faster than PCRE for certain types of patterns) 8 Liu Yang

  9. Part I: NFA-OBDD: A Time and Space Efficient Data Structure for Regular Expression Matching Joint work with R. Karim, V. Ganapathy, and R. Smith [RAID 10, COMNET 11] 9 Liu Yang

  10. Finite Automata Regular expressions and finite automata are equally expressive Regular expressions NFAs DFAs 10 Liu Yang

  11. Why not DFA? Combining DFAs: Multiplicative increase in number of states .*ab.*cd .*ef.*gh Picture courtesy : [Smith et al. Oakland 08] .*ab.*cd | .*ef.*gh 11 Liu Yang

  12. Why not DFA? (cont.) State explosion may happen NFA Pattern: .*1[0|1] {3} DFA State explosion n O(2^n) The value of quantifier n is up to 255 in Snort 12 Liu Yang

  13. Pattern Set Grows Fast 30000 25000 20000 15000 10000 5000 0 2005 2006 2007 2008 2009 2010 2011 2012 Snort rule set grows 7x in 8 years 13 Liu Yang

  14. Space-efficiency of NFAs Combining NFAs: Additive increase in number of states M N .*ab.*cd .*ef.*gh .*ab.*cd | .*ef.*gh 14 Liu Yang

  15. NFAs are Slow NFA frontiers1 may contain multiple states Frontier update may require multiple transition table lookups 1. A frontier set is a set of states where NFA can be at any instant. 15 Liu Yang

  16. NFAs of Regular Expressions Example: regex= a*aa a a a 1 2 3 Current state (x) 1 1 2 Input symbol (i) a a a Next state (y) 1 2 3 Transition table T(x,i,y) 16 Liu Yang

  17. NFA Frontier Update: Multiple Lookups regex= a*aa ; input= aaaa 1 2 3 Accept aaaa aaaa aaaa aaaa Frontier {1} {1,2} {1,2,3} {1,2,3} {1,2,3} 17 Liu Yang

  18. Can We Make NFAs Faster? regex= a*aa ; input= aaaa 1 2 3 Accept aaaa aaaa aaaa aaaa Frontier {1} {1,2} {1,2,3} {1,2,3} {1,2,3} Idea: Update frontiers in ONE step 18 Liu Yang

  19. NFA-OBDD: Main Idea Represent and operate NFA frontiers symbolically using Boolean functions Update the frontiers in ONE step: using a single Boolean formula Use ordered binary decision diagrams (OBDDs) to represent and operate Boolean formula 19 Liu Yang

  20. Transitions as Boolean Functions regex= a*aa Current state (x) 1 1 2 Input symbol (i) a a a Next state (y) 1 2 3 (1 a 1) V (1 a 2) V (2 a 3) T(x,i,y) = 20 Liu Yang

  21. Match Test using Boolean Functions (1 a 1 ) V (1 a 2 ) aaaa {1} a T(x,i,y) Input symbol Start states Transition relation Next states aaaa (1 a 1) V (1 a 2) V (2 a 3) {1,2} a T(x,i,y) Current states aaaa (1 a 1) V (1 a 2) V (2 a 3) {1,2,3} a T(x,i,y) Accept 21 Liu Yang

  22. NFA Operations using Boolean Functions Frontier derivation: finding new frontiers after processing one input symbol: Next frontiers = Checking acceptance: ( ( ) Map InputSymbo l i , y x x i ( ) Frontier x ( , , i x )) TransFunct ion y ( ( ) ( )) SAT SetOfAccep tStates x Frontier x 22 Liu Yang

  23. Ordered Binary Decision Diagram (OBDD) [Bryant 1986] OBDDs: Compact representation of Boolean functions x1 0 0 1 x2 1 1 0 x3 1 1 1 x4 0 1 1 x5 1 0 1 x6 1 0 0 F(x) 1 1 1 = ( ) ( ) F x x x x x x x 1 2 x 3 x 4 x 5 x 6 x ( ) x 1 2 3 4 5 6 ( ) x x x x x x 1 2 3 4 5 6 23 Liu Yang

  24. Experimental Toolchain C++ and CUDD package for OBDDs 24 Liu Yang

  25. Regular Expression Sets Snort HTTP signature set 1503 regular expressions from March 2007 2612 regular expressions from October 2009 Snort FTP signature set 98 regular expressions from October 2009 Extracted regular expressions from pcre and uricontent fields of signatures 25 Liu Yang

  26. Traffic Traces HTTP traces Rutgers datasets 33 traces, size ranges: 5.1MB 1.24 GB One week period in Aug 2009 from Web server of the CS department at Rutgers DARPA 1999 datasets (11.7GB) FTP traces 2 FTP traces Size: 19.4MB, 24.7 MB Two weeks period in March 2010 from FTP server of the CS department at Rutgers 26 Liu Yang

  27. Experimental Results For 1503 regexes from HTTP Signatures 10x 1645x 9-26x 27 Liu Yang *Intel Core2 Duo E7500, 2.93GHz; Linux-2.6; 2GB RAM*

  28. Summary NFA-OBDD is time and space efficient Outperforms NFAs by three orders of magnitude, retaining space efficiency of NFAs Outperforms or competitive with the PCRE package Competitive with variants of DFAs but drastically less memory-intensive 28 Liu Yang

  29. Part II: Extension of NFA-OBDD to Model Submatch Extraction [ANCS 12] Joint work with P. Manadhata, W. Horne, P. Rao, and V. Ganapathy 29 Liu Yang

  30. Submatch Extraction Extract information of interest when finding a match host address 128.6.60.45 resolved by 128.6.1.1 Submatch extraction .*? address (\d+\.\d+\.\d+\.\d+), resolved by (\d+\.\d+\.\d+\.\d+) $1 = 128.6.60.45 $2 = 128.6.1.1 30 Liu Yang

  31. Submatch Tagging: Tagged NFAs E = (a*)aa Tag(E) = (a*)t aa 1 Tagged NFA of (a*)aa with submatch tagging t1 a/t1 a a 1 2 3 Current state (x) 1 1 2 Input symbol (i) Next state (y) a a a Output tags (t) {t1} {} {} 1 2 3 Transition table T(x,i,y,t) of the tagged NFA 31 Liu Yang

  32. Match Test RE= (a*)aa ; Input = aaaa {t1} {t1} {t1} {t1} 1 2 3 Accept aaaa aaaa aaaa aaaa Frontier {1} {1,2} {1,2,3} {1,2,3} {1,2,3} 32 Liu Yang

  33. Submatch Extraction {t1} {t1} {t1} {t1} 1 2 3 accept aaaa aaaa aaaa aaaa $1=aa Frontier {1} {1,2} {1,2,3} {1,2,3} {1,2,3} Any path from an accept state to a start state generates a valid assignment of submatches. 33 Liu Yang

  34. Submatch-OBDD Representing tagged NFAs using Boolean functions Updating frontiers using Boolean formula Finding a submatch path using Boolean operations Using OBDDs to manipulate Boolean functions 34 Liu Yang

  35. Boolean Representation of Submatch Extraction A back traversal approach: starting from the last input symbol. = OneRrevers eTransitio n ( ( ) PickOne CurrentSta te y ( ) InputSymbo l i ( , , OneRrevers , )) IntermTran = sitions x i y t ( ( )) previousSt ate = Map eTransitio n , , x y i y t ( ) OutputTag Submatch extraction: the last consecutive sequence of symbols that are assigned with same tags OneRrevers eTransitio n , , i x y 35 Liu Yang

  36. Overview of Toolchain Toolchain in C++, interfacing with the CUDD* input stream Tagged NFAs OBDDs regexes with capturing groups pattern matching re2tnfa tnfa2obdd rejected matched submatches $1 = 36 Liu Yang

  37. Experimental Datasets Snort-2009 Patterns: 115 regexes with capturing groups from HTTP rules Traces: 1.2GB CS department network traffic; 1.3GB Twitter traffic; 1MB synthetic trace Snort-2012 Patterns: 403 regexes with capturing groups from HTTP rules Traces: 1.2GB CS department network traffic; 1.3GB Twitter traffic; 1MB synthetic trace Firewall-504 Patterns: 504 patterns from a commercial firewall F Trace: 87MB of firewall logs (average line size 87 bytes) 37 Liu Yang

  38. Experimental Setup Platform: Intel Core2 Duo E7500, Linux-2.6.3, 2GB RAM Two configurations on pattern matching Conf.S patterns compiled individually compiled pattern matched sequentially against input traces Conf.C patterns combined with UNION and compiled combined pattern matched against input traces 38 Liu Yang

  39. Experimental Results: Snort-2009 Submatch-OBDD is one order of magnitude faster than RE2 and PCRE 10000000 execution time (cycle/byte) 1000000 100000 10x 10000 1000 100 10 1 Conf.S Conf.C RE2 PCRE Submatch-OBDD Execution time (cycle/byte) of different implementations Memory consumption: RE2 (7.3MB), PCRE (1.2MB), Submatch-OBDD (9.4MB) 39 Liu Yang

  40. Summary Submatch-OBDD: an extension of NFA-OBDD to model submatch extraction Feasibility study Submatch-OBDD is one order of magnitude faster than PCRE and Google s RE2 when patterns are combined 40 Liu Yang

  41. PART III: Efficient Matching of Patterns with Back References Joint work with V. Ganapathy and P. Manadhata 41 Liu Yang

  42. Regexes Extended with Back References Identifying repeated substrings within a string Non-regular languages Example: sense sensibility response responsibility sense responsibility response sensibility (sens|respons)e \1ibility Note: \1 denotes referencing the substring captured by the first capturing group An example from Snort rule set: /.*javascript.+function\s+(\w+)\s*\(\w*\)\s*\{.+location=[^}]+\1.+\}/sim 42 Liu Yang

  43. Existing Approach Recursive backtracking (PCRE, etc.) Fast in general Can be extremely slow for certain patterns (algorithmic complexity attacks) 0.14 Throughput (MB/sec) 0.12 0.1 PCRE fails to return correct results when n >= 25 0.08 0.06 0.04 Nearly zero throughput 0.02 0 5 10 15 n 20 25 30 Throughput of PCRE when matching (a?{n})a{n}\1 with an 43 Liu Yang

  44. My Approach: Relax + Constraint Converting back-refs to conditional submatch extraction constraint Example: (a*)aa\1 (a*)aa(a*), s.t. $1=$2 $1 denotes a substring captured by the 1st capturing group, and $2 denotes a substring captured by the 2nd capturing group 44 Liu Yang

  45. Representing Back-refs with Tagged NFAs Example: (a*)aa(a*), s.t. $1=$2 a/t1 a/t2 a a 1 2 3 The tagged NFA constructed from (a*)aa(a*). Labels t1 and t2 are used to tag transitions within the 1st and 2nd capturing groups. The acceptance condition is state 3 and $1 = $2. 45 Liu Yang

  46. Transitions of Tagged NFAs Example (cont.): Current state (x) 1 1 2 3 Input symbol (i) Next state (y) a a a a Action New(t1) or update(t1) Carry-over(t1) Carry-over(t1) New(t2) or Update(t2) 1 2 3 3 New(): create a new captured substring Update(): update a captured substring Carry-over(): copy around the substrings captured from state to state 46 Liu Yang

  47. Match Test Frontier set {(state#, substr1, substr2, )} Frontier derivation table lookup + action Acceptance condition exist (s, substr1, substr2, ), s.t. s is an accept state and substr1=substr2 47 Liu Yang

  48. Implementations Two implementations NFA-backref: an NFA-like C++ implementation OBDD-backref: OBDD representation of NFA-backref input stream patterns with back-refs tagged NFAs with constraint match test re2tnfa matched or not 48 Liu Yang

  49. Experimental Datasets Patho-01 regexes: (a?{n})a{n}\1 input strings: an (n from 5 to 30, 100% accept rate) Patho-02 10 pathological regexes from Snort-2009 synthetic input strings (0% accept rate) Benign-03 46 regexes with one back-ref from Snort-2012 Synthetic input strings (50% accept rate) 49 Liu Yang

  50. Experimental Results: Patho-02 NFA-back-ref is >= 3 orders of magnitude faster than PCRE 10000000 Exec-time (cycle/byte) 1000000 100000 10000 1000 100 10 1 1 2 3 4 5 6 7 8 9 10 regex # PCRE OBDD-backref NFA-backref Execution time (cycle/byte) of different implementations for 10 regexes revised from Snort-2009 *Intel Core2 Duo E7500, 2.93GHz; Linux-2.6; 2GB RAM* 50 Liu Yang

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#