Dictionary Compression and Deep Packet Inspection (DPI) Overview

Slide Note
Embed
Share

This content discusses Decompression-Free Inspection (DPI) for shared dictionary compression over HTTP, the challenges and solutions in deep packet inspection (DPI), compressed HTTP methods, examples of intra-response and inter-response compression, and current operations of Network Intrusion Detection Systems (NIDS). It covers topics such as data compression techniques, security tools performance, and real-time pattern matching engines.


Uploaded on Sep 22, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Decompression-Free Inspection: DPI for Shared Dictionary Compression over HTTP Anat Bremler-Barr Interdisciplinary Center Herzliya Shimrit Tzur David Interdisciplinary Center Herzliya & The Hebrew University, Jerusalem David Hay The Hebrew University, Jerusalem Yaron Koral Tel Aviv University 1

  2. Outline Motivation Background AC algorithm Our solution The offline Phase The online phase Experimental Results 2

  3. Deep Packet Inspection (DPI) Search for patterns in the packets` payload Signatures-based NIDS Intrusion Preventions Web-Application Firewalls Leakage prevention Content Filtering Challenges: Thousands of known malicious patterns Real time, link rate Security tools performance is dominated by the pattern matching engine (Fisk & Varghese 2002) 3

  4. Compressed HTTP 84.1% of the top 1,000 sites compress their traffic. Data compression is done by adding references to repeated data. There are two types of compression: 19% increase in 8 month! Intra-response compression the references point to bytes within the response (Gzip/Deflate) Inter-responses/connections compression the references point to bytes in a separate file, called dictionary (Google s SDCH). 4

  5. Example Intra-Response Compression File1.html: abcdefgabcd File2.html abcdxyzbcdtr TCP Connection Setup Encode repeated strings by pointer: {distance, length} 5

  6. Example Inter-Response Compression Dictionary: abcd File1.html: abcdefgabcd File2.html abcdxyzbcdtr TCP Connection Setup Copy repeated strings from the dictionary: (address, length) 6

  7. Current NIDS Operation (1) GET \index.html Accept-Encoding: SDCH GET \index.html Accept-Encoding: SDCH NIDS Client Server Scan for Intrusions 7

  8. Current NIDS Operation (2) GET \index.html Accept-Encoding: SDCH GET \index.html Accept-Encoding: SDCH NIDS Client Server Do Not Scan/ Decompress, Scan, Compress 8

  9. GET \index.html Accept-Encoding: SDCH GET \index.html Accept-Encoding: SDCH NIDS Client Server Scan directly with no decompression 9

  10. Our Solution: Decompression-Free Scanning Focused on inter-response compression Our algorithm works in two phases Offline phase - Scanning the dictionary Online phase - Scanning the delta files Works at the rate of the compressed traffic Gain 56% improvement compared with scanning the plain-text directly 10

  11. Outline Motivation Background Aho-Corasick (AC) algorithm Our solution The offline Phase The online phase Experimental Results 11

  12. Aho-Corasick (AC) Algorithm s0 E C Finite State Machine (FSM) Regular states, accepting states B s1 s2 s7 E C D D Goto function (black arrows) g(state,symbol) state Patterns: E BE BD BCAA BCD CDBCAB s3 s4 s5 s8 A D B Each state corresponds to a label- the sequence of characters on its goto path from the root. The length of the label is the depth of the state g(S11,A) = ? s9 g(S11,B) = S12 s13 A s6 C s14 s10 Failure function (red arrows) f(state) state Taken when there is no goto function Goes to a state that its label is the longest suffix of the current state s label A s11 The label of S14is BCAA B s12 f(S11) = S13 g(S11,A) g(S13,A)=S14

  13. Aho-Corasick Insights The automaton remembers only its current state s0 E C B s1 s2 s7 E The input text ends with the label of current state This label is the longest suffix in the text that can be a prefix of a match C D D s3 s4 s5 s8 A D B s9 s13 A s6 C s14 s10 A No future pattern can begin before this label s11 B s12

  14. Outlines Motivation Background Aho-Corasick (AC) algorithm Our solution The offline Phase The online phase Experimental Results 14

  15. Accelerator Algorithm Idea The algorithm operates in two phases: The Offline Phase: Scan the dictionary and store information about the pattern matching results The Online Phase: Scan the delta file and skip almost all referenced bytes that were already scanned for patterns. 15

  16. The Offline Phase s0 B The dictionary is scanned using AC (from its first byte and from s0). We save the state after each byte. E C s1 s2 D s7 E C D s3 s4 s5 s8 B A D s9 C s1 3 s6 A State: s1 4 s10 A 0 D S0 1 B S2 2 E S3 3 A S0 4 A S0 5 C S7 6 D S8 7 B S9 8 C S10 9 A S11 10 B S12 11 C S5 s11 B s12 We also save information of matched patterns that are found in the dictionary 16

  17. Challenges Patterns/ Signatures: E BE BD BCAA BCD CDBCAB 0 1 2 3 4 5 6 7 8 9 10 11 Dictionary: Delta file: ABDB(5,4)AAB(1,4) The uncompressed data is: A B D B C D B C A A B B E A A DB E AACDBCABC We copy from arbitrary position in the dictionary when the automaton in an arbitrary state We show that no matter in what state and which symbol we start to copy, the resulting state is reachable via failure transitions from the saved state. Types of matches: Right boundary Internal Left boundary 17

  18. The Online Phase Scan the delta file: Uncompressed bytes - scan using AC. Copy instruction (p,x) The compressed data that we already scanned in the offline phase. We will save the scan for almost all these bytes. The internal match is trivial, see paper for details. 18

  19. The Online Phase - Right Boundary When encountering copy instruction (p,x), We want to stop scanning and jump to state[p+x-1] If the label of the state is longer than the copy- value The label begins before the copy value The context of this state is not as in the online scan We take failure transitions to find state with sufficiently short label. otherwise The label of the state is contained in the copy value This is the longest suffix that can lead to a match 19

  20. Example Right Boundary s0 B E C COPY(7,4): BCAB s1 s2 s7 E C D D Uncompressed data: B Go to State[10]=s12. depth(s12) > 4. Go to f(s12)=s2 depth(s2) 4 Current state is S2 s3 s4 s5 s8 A D B s9 s13 A s6 C s14 s10 A s11 B State: s12 0 D S0 1 B S2 2 E S3 3 A S0 4 A S0 5 C S7 6 D S8 7 B S9 8 C S10 9 A S11 10 B S12 11 C S5 20

  21. The Online Phase Left Boundary When encountering copy instruction (p,x), We want to stop scanning and jump to state[p+x-1] If the number of bytes we read from the copy value is less than the depth of the current state The label of the state begins before the copied bytes We scan the copy value till we reach a state that its label is shorter than the number of read bytes. otherwise The label of the state is contained in the copy value Both offline and online scans have the same context 21

  22. Example Left Boundary s0 B E C COPY(5,4): CDBC s1 s2 s7 E C Uncompressed data: B D D s3 s4 s5 s8 A D j=2 j=1 Depth=2 Depth=3 j=0 depth=1 Continue Continue Continue B s9 s1 3 s6 j=3 Stop scanning (depth(s9) 3) C A s1 4 s10 A s11 B State: s12 0 D S0 1 B S2 2 E S3 3 A S0 4 A S0 5 C S7 6 D S8 7 B S9 8 C S10 9 A S11 10 B S12 11 C S5 22

  23. Outline Motivation Background Aho-Corasick (AC) algorithm Our solution The offline Phase The online phase Experimental Results 23

  24. Experimental Results Input: google.com dictionary Pages for 1000 most popular Google queries. Patterns Snort The synthetic case A patterns file for each input file so the input file has a different percentage of matches, from 25% to 100%. 24

  25. The Algorithm Overheads 1. Traversing the failure transitions In the right boundary 2. Scanning the copy value In the left boundary 3. Memory consumption: The additional information of the offline phase. Total: 420 KB (per dictionary) Can be further reduced by a variable-length pointer encoding. 25

  26. Failure Transitions Right Boundaries If length depth, no failure transition is taken In our experiments: The average is 2.35 failure transitions per file (average of 557 copy instructions per file) 26

  27. Scanning the Copy Value - Left Boundary Compression ratio compressed/uncompressed Scan ratio scanned/uncompressed. Snort low percentage of matches scan-ratio ~ compression ratio The synthetic case high percentage of matches Unrealistic case scan-ratio is between 1.05 to 1.2 times compression- ratio. 27

  28. Regular Expression Results Strings were extracted from the regular expression and were added to the pattern set. When needed, we use off-the-shelf perl compatible regular expression engine to scan additional parts of the text. The overhead of the regular expression is around 1% which is almost negligible 28

  29. Questions?? 29

  30. Regular Expression Very common in security purpose patterns. In Snort, 55% of the rules contain regular expression. Composed of anchors and pcre tokens. For example, in the pattern: abc[1-9]*xyza{3,7} The anchors are: abc xyz The pcre tokens are: [1-9]* a{3,7} 30

  31. Dealing with Regular Expression 1. The anchors are extracted from the regular expression offline. 2. The anchors are added to the patterns set. 3. If there is a regular expression which all its anchors were matched: run an off the-shelf regular expression engine until, either a mismatch, a full pattern match, or the whole (limited) text is searched. 31

  32. Regular Expression Limited Search In most cases, we can limit the search in at least one direction. If before the first anchor all tokens have a limited size, there is a bounded number of characters we should examine before the matched anchor. If after the last anchor all tokens have a limited size there is a bounded number of characters we should examine after the matched anchor. 32

  33. Memory Consumption 1. Doubling the size of the dictionary (for saving the offline scan results, one pointer per symbol) 2. Saving the matched list (for internal matches) Our experiments: Match list size 40,000 Dictionary size 116K symbols Pointer size 17 bits Total memory consumption is 420 KB (per dictionary) Can be further reduced by a variable-length pointer encoding. 33

Related


More Related Content