AES Encryption Algorithm and Its Implementation

Xiaoqi Chen

Secure

pps

eed

ecure

rypto

•

“

Never roll your own crypto

”

•

CRC-32 is

likely

not secure

enough…

•

Data plane: many computational constraints

•

Short pipeline, simple arithmetic

•

Is standardized crypto still possible?

•

Maybe?

AES:

An

Intro

•

NIST

standard

encryption

algorithm

•

“Block

cipher”,

encrypt

data

bytes

at

time

•

Uses

10,

12,

or

encryption

rounds

•

Each

round:

•

AddRoundKey

•

SubBytes

•

ShiftRows

•

MixColumns

An AES

Round

16-byte

round

key

Substitution

box

(S-box)

(1)

AddRoundKey

using

XOR

(2)

SubBytes

lookup

in

S-box

An AES

Round

An AES

Round

(3)

Shift

Rows

each

row

cyclically

shifted

by

0,

1,

2,

or

positions

(4)

MixColumns

Polynomial

Multiplication

under

GF(2

),

modulus

(x

+1)

An AES

Round

Rinse

and

repeat!

AES-128:

rounds

AES-192:

rounds

AES-256:

rounds

Different

ptimization

bjectives

Embedded system

•

Very small memory

•

Don’t use large lookup tables

•

Sequential computation

•

Run many cycles

P4 switch

•

Relatively larger memory

•

For routing tables

•

Parallel data flow

•

Challenge:

limited # stages

Solution:

Scrambled

Lookup

Table

tables

per

round,

one

for

each

i,j

Pre-computed

Scrambled

Lookup

Tables

…

Solution:

Scrambled

Lookup

Table

•

Use

~224x

memory

space to save XOR

•

Input

block

->

Lookup

->

XOR

->

Output

block

•

Theoretical

minimum:

stage

per

round

•

Each

table

costs

1KB

•

Justifiable

in

switches:

short

pipeline,

MBs

of

memory

Evaluation

•

Does it actually

fit

•

Theoretically

SLT

 costs

160~224

 KB of

memory

in

total

•

In

practice

occupied

<15

of

SRAM

on

Tofino

v1

•

Other resource utilization are negligible

•

So, what is the real-world performance?

•

bit

/s @ two rounds per pass

(200Gbps

recirculation

limit)

•

improvement

compared

with

baseline

(one

round

per

pass)

Evaluation

=1.3x

=0.3x

Evaluation

Tofino

Wedge100-32x,

MacBook

Air

13”

2017,

Intel

i7

(AES-NI),

2.2Ghz

cores

MacBook

Pro

16”

2019,

Intel

i7

(AES-NI),

2.6Ghz

cores

Use

50%

ports

for

recirculation

=11x

=2.4x

Conclusion

Future

Directions

•

AES runs on

P4

switches

•

Please,

avoid

hand-rolled

crypto

•

Scrambled Lookup Table: save XOR

 using

160~224KB

memory

•

Two

rounds

per

pass,

mproves throughput by

•

Tofino v1 is not the best device to run AES

•

Other

algorithms?

Is

Feistel Cipher

more

PISA-friendly

•

Fewer rounds?

•

edicated crypto co-processor

in

switches

Our

P4

code

is

open-source!

github.com/Princeton-Cabernet/p4-projects

Slide Note

Embed Share

Download

Learn about the Advanced Encryption Standard (AES) algorithm - a NSA-approved NIST standard encryption method. Explore how AES works, its key rounds, SubBytes, ShiftRows, MixColumns operations, and its optimization for embedded systems and small memory devices. Discover the importance of secure crypto practices and the risks of rolling your own encryption. Dive into the details of AES-128, AES-192, and AES-256 rounds and their encryption processes.

danger Follow

Uploaded on Jul 14, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Shield Icon Png #432401 - Free Icons Library AES AES in P4 P4 using scrambled lookup tables Xiaoqi Chen

Secure Apps Need Secure Crypto Never roll your own crypto ! CRC-32 is likely not secure enough Data plane: many computational constraints Short pipeline, simple arithmetic Is standardized crypto still possible? Maybe?

AES: An Intro NSA Approved NIST standard encryption algorithm Block cipher , encrypt data 16 bytes at a time Uses 10, 12, or 14 encryption rounds Each round: AddRoundKey SubBytes ShiftRows MixColumns

An AES Round 16-byte round key k0,0 k1,0 k2,0 k3,0 Substitution box (S-box) k0,1 k1,1 k2,1 k3,1 In 00 01 02 03 04 05 k0,2 k1,2 k2,2 k3,2 Out 63 7B 77 7B F2 6B k0,3 k1,3 k2,3 k3,3 a0,0 a1,0 a2,0 a3,0 b0,0 b1,0 b2,0 b3,0 c0,0 c1,0 c2,0 c3,0 (1) AddRoundKey: using XOR (2) SubBytes: lookup in S-box a0,1 a1,1 a2,1 a3,1 b0,1 01 b2,1 b3,1 c0,1 7B c2,1 c3,1 a0,2 a1,2 a2,2 a3,2 b0,2 b1,2 b2,2 b3,2 c0,2 c1,2 c2,2 c3,2 a0,3 a1,3 a2,3 a3,3 b0,3 b1,3 b2,3 04 c0,3 c1,3 c2,3 F2

An AES Round c0,0 c1,0 c2,0 c3,0 c0,1 c1,1 c2,1 c3,1 c0,2 c1,2 c2,2 c3,2 c0,3 c1,3 c2,3 c3,3

An AES Round (4) MixColumns: Polynomial Multiplication under GF(28), modulus (x4+1) a(x) = 3x3+x2+x+2 b(x) = c1,0x3+c2,1x2+c3,2x+c0,3 d(x) = a(x) b(x) mod (x4+1) (3) ShiftRows: each row cyclically shifted by 0, 1, 2, or 3 positions = d1,0x3+d1,1x2+d1,2x+d1,3 c0,0 c1,0 c2,0 c3,0 c0,0 c1,0 c2,0 c3,0 d0,0 d1,0 d2,0 d3,0 c0,1 c1,1 c2,1 c3,1 c1,1 c2,1 c3,1 c0,1 d0,1 d1,1 d2,1 d3,1 c0,2 c1,2 c2,2 c3,2 c2,2 c3,2 c0,2 c1,2 d0,2 d1,2 d2,2 d3,2 c0,3 c1,3 c2,3 c3,3 c3,3 c0,3 c1,3 c2,3 d0,3 d1,3 d2,3 d3,3

An AES Round Rinse and repeat! AES-128: 10 rounds AES-192: 12 rounds AES-256: 14 rounds a0,0 a1,0 a2,0 a3,0 d0,0 d1,0 d2,0 d3,0 a0,1 a1,1 a2,1 a3,1 d0,1 d1,1 d2,1 d3,1 a0,2 a1,2 a2,2 a3,2 d0,2 d1,2 d2,2 d3,2 a0,3 a1,3 a2,3 a3,3 d0,3 d1,3 d2,3 d3,3

Different Optimization Objectives Embedded system P4 switch Very small memory Relatively larger memory Don t use large lookup tables For routing tables Sequential computation Parallel data flow Run many cycles Challenge: limited # stages

Solution: Scrambled Lookup Table 12 k1,0 k2,0 k3,0 Pre-computed Scrambled Lookup Tables 34 k1,1 k2,1 k3,1 In 00 01 02 03 04 05 In 00 01 02 03 04 05 In 00 01 02 03 04 05 71 k1,2 k2,2 k3,2 In XOR k0,0 In XOR k1,0 In XOR k0,2 12 13 10 11 16 17 34 35 36 37 30 31 55 k1,3 k2,3 k3,3 71 70 73 72 75 74 Out Out Out 63 7B 77 7B F2 6B 63 7B 77 7B F2 6B 63 7B 77 7B F2 6B a0,0 a1,0 a2,0 a3,0 c0,0 c1,0 c2,0 c3,0 16 tables per round, one for each ki,j a0,1 a1,1 a2,1 a3,1 c0,1 c1,1 c2,1 c3,1 75 a1,2 a2,2 a3,2 F2 c1,2 c2,2 c3,2 a0,3 a1,3 a2,3 a3,3 c0,3 c1,3 c2,3 c3,3

Solution: Scrambled Lookup Table Use 160x~224x memory space to save XORs Input block -> Lookup -> XOR -> Output block Theoretical minimum: 2 stage per round Each table costs 1KB Justifiable in switches: short pipeline, MBs of memory

Evaluation Does it actually fit? Theoretically SLTs costs 160~224 KB of memory in total In practice, occupied <15% of SRAM on Tofino v1 Other resource utilization are negligible So, what is the real-world performance? 10Gbit/s @ two rounds per pass (200Gbps recirculation limit) 122% improvement compared with baseline (one round per pass)

Evaluation

Evaluation Macbook Pro 15" - =1.3x =0.3x Tofino Wedge100-32x, $8000 MacBook Air 13 2017, $1200 Intel i7 (AES-NI), 2.2Ghz 2 cores MacBook Pro 16 2019, $2400 Intel i7 (AES-NI), 2.6Ghz 6 cores Macbook Pro 15" - =11x =2.4x Use 50% ports for recirculation

Conclusion & Future Directions AES runs on P4 switches! Please, avoid hand-rolled crypto Scrambled Lookup Table: save XORs using 160~224KB memory Two rounds per pass, improves throughput by 122% Tofino v1 is not the best device to run AES Other algorithms? Is Feistel Cipher more PISA-friendly? Fewer rounds? Dedicated crypto co-processor in switches? Our P4 code is open-source! github.com/Princeton-Cabernet/p4-projects

AES Encryption Algorithm and Its Implementation

Download Presentation

Presentation Transcript

Related

More Related Content