AES Encryption Algorithm and Its Implementation

 
A
E
S
 
i
n
 
P
4
u
s
i
n
g
 
s
c
r
a
m
b
l
e
d
 
l
o
o
k
u
p
 
t
a
b
l
e
s
 
Xiaoqi Chen
Secure 
A
pps 
N
eed 
S
ecure 
C
rypto
 
Never roll your own crypto
!
CRC-32 is 
likely
 
not secure 
enough…
 
Data plane: many computational constraints
Short pipeline, simple arithmetic
 
Is standardized crypto still possible?
Maybe?
AES:
 
An
 
Intro
 
NIST
 
standard
 
encryption
 
algorithm
 
“Block
 
cipher”,
 
encrypt
 
data
 
16
 
bytes
 
at
 
a
 
time
Uses
 
10,
 
12,
 
or
 
14
 
encryption
 
rounds
Each
 
round:
AddRoundKey
SubBytes
ShiftRows
MixColumns
An AES
 
Round
 
 
 
16-byte
 
round
 
key
 
Substitution
 
box
 
(S-box)
 
 
(1)
AddRoundKey
:
using
 
XOR
 
(2)
 
SubBytes
:
lookup
 
in
 
S-box
An AES
 
Round
An AES
 
Round
 
(3)
 
Shift
Rows
:
 
each
 
row
 
cyclically
shifted
 
by
 
0,
 
1,
 
2,
 
or
 
3
 
positions
 
(4)
 
MixColumns
:
Polynomial
 
Multiplication
 
under
 
GF(2
8
),
 
modulus
 
(x
4
+1)
 
 
An AES
 
Round
 
Rinse
 
and
 
repeat!
AES-128:
 
10
 
rounds
AES-192:
 
12
 
rounds
AES-256:
 
14
 
rounds
Different 
O
ptimization 
O
bjectives
Embedded system
Very small memory
Don’t use large lookup tables
Sequential computation
Run many cycles
 
P4 switch
 
Relatively larger memory
For routing tables
Parallel data flow
Challenge:
 
limited # stages
Solution:
 
Scrambled
 
Lookup
 
Table
 
16
 
tables
 
per
 
round,
one
 
for
 
each
 
k
i,j
 
Pre-computed
 
Scrambled
 
Lookup
 
Tables
 
 
Solution:
 
Scrambled
 
Lookup
 
Table
 
Use 
160
x
~224x
 
memory
 
space to save XOR
s
Input
 
block
 
->
 
Lookup
 
->
 
XOR
 
->
 
Output
 
block
Theoretical
 
minimum:
 
2
 
stage
 
per
 
round
Each
 
table
 
costs
 
1KB
Justifiable
 
in
 
switches:
 
short
 
pipeline,
 
MBs
 
of
 
memory
Evaluation
 
Does it actually 
fit
?
Theoretically
 
SLT
s
 costs 
160~224
 KB of 
memory
 
in
 
total
In
 
practice
,
 
occupied
 
<15
%
 
of
 
SRAM
 
on
 
Tofino
 
v1
Other resource utilization are negligible
 
So, what is the real-world performance?
1
0
G
bit
/s @ two rounds per pass
 
(200Gbps
 
recirculation
 
limit)
122
%
 
improvement
 
compared
 
with
 
baseline
 
(one
 
round
 
per
 
pass)
 
Evaluation
    
=1.3x
 
  
   
=0.3x
Evaluation
 
Tofino
 
Wedge100-32x,
 
$
8000
MacBook
 
Air
 
13”
 
2017,
 
$
1200
Intel
 
i7
 
(AES-NI),
 
2.2Ghz
 
2
 
cores
MacBook
 
Pro
 
16”
 
2019,
 
$
2400
Intel
 
i7
 
(AES-NI),
 
2.6Ghz
 
6
 
cores
 
 
Use
 
50%
 
ports
 
for
 
recirculation
 
    
=11x
 
   
   
=2.4x
Conclusion
 
&
 
Future
 
Directions
 
AES runs on 
P4
 
switches
!
Please,
 
avoid
 
hand-rolled
 
crypto
Scrambled Lookup Table: save XOR
s
 using 
160~224KB
 
memory
Two
 
rounds
 
per
 
pass,
 
i
mproves throughput by 
122
%
Tofino v1 is not the best device to run AES
Other
 
algorithms?
 
Is
 
Feistel Cipher
 
more
 
PISA-friendly
?
Fewer rounds?
D
edicated crypto co-processor
 
in
 
switches
?
Our
 
P4
 
code
 
is
 
open-source!
github.com/Princeton-Cabernet/p4-projects
Slide Note
Embed
Share

Learn about the Advanced Encryption Standard (AES) algorithm - a NSA-approved NIST standard encryption method. Explore how AES works, its key rounds, SubBytes, ShiftRows, MixColumns operations, and its optimization for embedded systems and small memory devices. Discover the importance of secure crypto practices and the risks of rolling your own encryption. Dive into the details of AES-128, AES-192, and AES-256 rounds and their encryption processes.

  • AES Encryption
  • NSA
  • NIST Standard
  • Crypto Security
  • Encryption Algorithm

Uploaded on Jul 14, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Shield Icon Png #432401 - Free Icons Library AES AES in P4 P4 using scrambled lookup tables Xiaoqi Chen

  2. Secure Apps Need Secure Crypto Never roll your own crypto ! CRC-32 is likely not secure enough Data plane: many computational constraints Short pipeline, simple arithmetic Is standardized crypto still possible? Maybe?

  3. AES: An Intro NSA Approved NIST standard encryption algorithm Block cipher , encrypt data 16 bytes at a time Uses 10, 12, or 14 encryption rounds Each round: AddRoundKey SubBytes ShiftRows MixColumns

  4. An AES Round 16-byte round key k0,0 k1,0 k2,0 k3,0 Substitution box (S-box) k0,1 k1,1 k2,1 k3,1 In 00 01 02 03 04 05 k0,2 k1,2 k2,2 k3,2 Out 63 7B 77 7B F2 6B k0,3 k1,3 k2,3 k3,3 a0,0 a1,0 a2,0 a3,0 b0,0 b1,0 b2,0 b3,0 c0,0 c1,0 c2,0 c3,0 (1) AddRoundKey: using XOR (2) SubBytes: lookup in S-box a0,1 a1,1 a2,1 a3,1 b0,1 01 b2,1 b3,1 c0,1 7B c2,1 c3,1 a0,2 a1,2 a2,2 a3,2 b0,2 b1,2 b2,2 b3,2 c0,2 c1,2 c2,2 c3,2 a0,3 a1,3 a2,3 a3,3 b0,3 b1,3 b2,3 04 c0,3 c1,3 c2,3 F2

  5. An AES Round c0,0 c1,0 c2,0 c3,0 c0,1 c1,1 c2,1 c3,1 c0,2 c1,2 c2,2 c3,2 c0,3 c1,3 c2,3 c3,3

  6. An AES Round (4) MixColumns: Polynomial Multiplication under GF(28), modulus (x4+1) a(x) = 3x3+x2+x+2 b(x) = c1,0x3+c2,1x2+c3,2x+c0,3 d(x) = a(x) b(x) mod (x4+1) (3) ShiftRows: each row cyclically shifted by 0, 1, 2, or 3 positions = d1,0x3+d1,1x2+d1,2x+d1,3 c0,0 c1,0 c2,0 c3,0 c0,0 c1,0 c2,0 c3,0 d0,0 d1,0 d2,0 d3,0 c0,1 c1,1 c2,1 c3,1 c1,1 c2,1 c3,1 c0,1 d0,1 d1,1 d2,1 d3,1 c0,2 c1,2 c2,2 c3,2 c2,2 c3,2 c0,2 c1,2 d0,2 d1,2 d2,2 d3,2 c0,3 c1,3 c2,3 c3,3 c3,3 c0,3 c1,3 c2,3 d0,3 d1,3 d2,3 d3,3

  7. An AES Round Rinse and repeat! AES-128: 10 rounds AES-192: 12 rounds AES-256: 14 rounds a0,0 a1,0 a2,0 a3,0 d0,0 d1,0 d2,0 d3,0 a0,1 a1,1 a2,1 a3,1 d0,1 d1,1 d2,1 d3,1 a0,2 a1,2 a2,2 a3,2 d0,2 d1,2 d2,2 d3,2 a0,3 a1,3 a2,3 a3,3 d0,3 d1,3 d2,3 d3,3

  8. Different Optimization Objectives Embedded system P4 switch Very small memory Relatively larger memory Don t use large lookup tables For routing tables Sequential computation Parallel data flow Run many cycles Challenge: limited # stages

  9. Solution: Scrambled Lookup Table 12 k1,0 k2,0 k3,0 Pre-computed Scrambled Lookup Tables 34 k1,1 k2,1 k3,1 In 00 01 02 03 04 05 In 00 01 02 03 04 05 In 00 01 02 03 04 05 71 k1,2 k2,2 k3,2 In XOR k0,0 In XOR k1,0 In XOR k0,2 12 13 10 11 16 17 34 35 36 37 30 31 55 k1,3 k2,3 k3,3 71 70 73 72 75 74 Out Out Out 63 7B 77 7B F2 6B 63 7B 77 7B F2 6B 63 7B 77 7B F2 6B a0,0 a1,0 a2,0 a3,0 c0,0 c1,0 c2,0 c3,0 16 tables per round, one for each ki,j a0,1 a1,1 a2,1 a3,1 c0,1 c1,1 c2,1 c3,1 75 a1,2 a2,2 a3,2 F2 c1,2 c2,2 c3,2 a0,3 a1,3 a2,3 a3,3 c0,3 c1,3 c2,3 c3,3

  10. Solution: Scrambled Lookup Table Use 160x~224x memory space to save XORs Input block -> Lookup -> XOR -> Output block Theoretical minimum: 2 stage per round Each table costs 1KB Justifiable in switches: short pipeline, MBs of memory

  11. Evaluation Does it actually fit? Theoretically SLTs costs 160~224 KB of memory in total In practice, occupied <15% of SRAM on Tofino v1 Other resource utilization are negligible So, what is the real-world performance? 10Gbit/s @ two rounds per pass (200Gbps recirculation limit) 122% improvement compared with baseline (one round per pass)

  12. Evaluation

  13. Evaluation Macbook Pro 15" - =1.3x =0.3x Tofino Wedge100-32x, $8000 MacBook Air 13 2017, $1200 Intel i7 (AES-NI), 2.2Ghz 2 cores MacBook Pro 16 2019, $2400 Intel i7 (AES-NI), 2.6Ghz 6 cores Macbook Pro 15" - =11x =2.4x Use 50% ports for recirculation

  14. Conclusion & Future Directions AES runs on P4 switches! Please, avoid hand-rolled crypto Scrambled Lookup Table: save XORs using 160~224KB memory Two rounds per pass, improves throughput by 122% Tofino v1 is not the best device to run AES Other algorithms? Is Feistel Cipher more PISA-friendly? Fewer rounds? Dedicated crypto co-processor in switches? Our P4 code is open-source! github.com/Princeton-Cabernet/p4-projects

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#