Design of Elliptic Curve Cryptosystem with Register Bank Overview
Explore the design aspects of an Elliptic Curve Cryptosystem, focusing on the Register Bank overview featuring Processor, Arithmetic Unit, and Control Unit. Learn about the organization of register banks and the functionality of distributed RAMs in Xilinx FPGAs for efficient data storage and retrieval.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Lab Session 2 Design of Elliptic Curve Cryptosystem Debdeep Mukhopadhyay Chester Rebeiro Dept. of Computer Science and Engineering Indian Institute of Technology Kharagpur INDIA
The Processor Overview Register Bank: regbank.v (stores temporary data) ROM: stores the curve constant b and the base points. Arithmetic Unit: ec_alu.v (performs the underlying field computations) Control Unit: smul.v (sequences the field computations for performing the point operations)
Register Bank Heart of the register file is 8 registers of size 233 bits. Organized as 3 banks:RA, RB, RC Dual port Distributed RAMs of the Xilinx FPGAs Asynchronous Read Synchronous Write we: write enable signal Inputs: C0, C1, Qout Outputs: A0, A1, A2, A3 and Qin
Module regbank.v assign qin = (cwh[21] == 1'b1) ? a1 : a2; module regbank(clk, cwh, c0, c1, a0, a1, a2, a3); input wire clk; assign rb1_addr1 = {3'b0, cwh[0]}; assign rb1_addr2 = {3'b0, cwh[1]}; assign rb2_addr1 = {2'b0, cwh[4:3]}; assign rb2_addr2 = {2'b0, cwh[6:5]}; assign rb3_addr1 = {3'b0, cwh[8]}; assign rb3_addr2 = {3'b0, cwh[9]}; input wire [22:0] cwh; /* control word */ input wire [232:0] c0, c1; /* Inputs to regbank from ALU */ output wire [232:0] a0, a1, a2, a3; /* Output from regbank to ALU */ wire rb1_we; wire [3:0] rb1_addr1, rb2_addr1, rb3_addr1; wire [3:0] rb1_addr2, rb2_addr2, rb3_addr2; /* a0 to a3 are fed to the ALU */ assign a0 = (cwh[11] == 1'b0) ? rb1_dout1 : rb2_dout2; assign a2 = (cwh[12] == 1'b0) ? rb2_dout1 : rb1_dout2; assign a1 = (cwh[13] == 1'b0) ? rb3_dout1 : rb2_dout2; assign a3 = rb3_dout2; wire [232:0] rb1_din, rb2_din, rb3_din; wire [232:0] rb1_dout1, rb2_dout1, rb3_dout1; wire [232:0] rb1_dout2, rb2_dout2, rb3_dout2; wire [232:0] qin, qout; /* Select what get written into RAM */ assign rb1_we = cwh[2]; assign rb1_din = (cwh[22] == 1'b1) ? c0: ((cwh[14] == 1'b0) ? c0 : c1); /* Instances of distributed memory */ XC3S_RAM16X233 regbank1(rb1_din, rb1_addr1, rb1_addr2, rb1_we, clk, rb1_dout1, rb1_dout2); XC3S_RAM16X233_regbank2(rb2_din, rb2_addr1, rb2_addr2, rb2_we, clk, rb2_dout1, rb2_dout2); assign rb2_we = cwh[7]; assign rb2_din = (cwh[22] == 1'b1) ? c1: ((cwh[15] == 1'b0) ? c0 : c1); XC3S_RAM16X233_D regbank3(rb3_din, rb3_addr1, rb3_addr2, rb3_we, clk, rb3_dout1, rb3_dout2); assign rb3_we = cwh[10]; assign rb3_din = (cwh[22] == 1'b1) ? 233'h1 (cwh[20] == 1'b1) ? qout: c0; endmodule /* Quadblock instance */ : bquadblk bqb(cwh[20], qin, cwh[19:16], qout);
The ALU for the ECC Processor The computation of the AU has 2 phases: Point addition and doubling Inversion The AU consists of: Hybrid Karatsuba Multiplier (used at all time steps) Quad block (used in phase 2 for the final invesion) The Quadblock has 14 cascade steps (as described before) and computes the quading operation repeatedly (as per the value of c[29 26] The AU performs: point operations (doubling and addition) in Lopez Dahab Projective co- ordinates efficiently. Inversion (from projective co-ordinates) to affine co-ordinates. The AU has 5 inputs (the outputs of the regbank) and 3 outputs (the inputs of the regbank).
The verilog code for ALU module ec_alu(cw, a0, a1, a2, a3, c0, c1); /* Choose the inputs to the Multiplier */ mux8 muxA(a0, a0sq, a2, sa7, sd2, a1, a1qu, 233'd0, cw[2:0], minA); mux8 muxB(a1, a1sq, sa4, sa8, sd2_1, a3, a2qu,a1qu, cw[5:3], minB); input wire [232:0] a0, a1, a2, a3; /* the inputs to the alu */ input wire [9:0] cw; /* the control word */ output wire [232:0] c0, c1; /* the alu outputs */ /* Temporary results */ /* Choose the outputs of the ALU */ mux4 muxC(mout, sa2, a1sq, sc1, cw[7:6], c0); mux4 muxD(sa8_1, sa5, a1qu, sd2, cw[9:8], c1); wire [232:0] a0sq, a0qu; wire [232:0] a1sq, a1qu; wire [232:0] a2sq, a2qu; wire [232:0] sa2, sa4, sa5, sa7, sa8, sa8_1; assign sa2 = mout ^ a2; assign sa4 = a1sq ^ a2; assign sa5 = mout ^ a2sq ^ a0; assign sa7 = a0 ^ a2; assign sa8 = a1 ^ a3; assign sa8_1 = mout ^ a0; wire [232:0] sc1; wire [232:0] sd2, sd2_1; /* Multiplier inputs and output */ wire [232:0] minA, minB, mout; multiplier mul(minA, minB, mout); assign sc1 = mout ^ a3; squarer sq1_p0(a0, a0sq); squarer sq_p1(a1, a1sq); assign sd2 = a0qu ^ a1; assign sd2_1 = a2sq ^ a3 ^ a1; squarer sq_p2(a2, a2sq); squarer sq2_p2(a2sq, a2qu); squarer sq2_p1(a1sq, a1qu); endmodule squarer sq2_p3(a0sq, a0qu);
Control Unit The CU is hardwired. It generates 33 control signals, which determine the flow of data. c[0 9]: controls the input to the multiplier and the output C0 and C1 of the AU c[26 29]: select line of the multiplexers inside the quad block Remaining control lines are for read and write of the registers in the register file
Projective Point Arithmetic Point Doubling: Input : (X1,Y1,Z1), Output: (X4, Y4, Z4) Constraint: One multiplier Can perform one multiplication per clock cycle Hence needs four time steps
Projective Point Doubling Sequencing
Projective Point Addition Input : P=(X1,Y1,Z1), Q=(x2,y2) Output: (X3, Y3, Z3) Constraint: One multiplier Can perform one multiplication per clock cycle Hence needs eight time steps
Projective Point Addition Sequencing
The verilog code for the CU /* Output Logic */ always @(state) begin case(state) 6'd0: begin end 6'd1: begin end 6'd2: begin end /* The Doubling*/ 6'd3: begin /* Double Step 1 */ cwl <= 10'h209; cwh <= 23'h0x8490; end 6'd4: begin /* Double Step 1a */ cwl <= 10'h002; cwh <= 23'h0x20F0; end 6'd5: begin /* Double Step 2 */ cwl <= 10'h324; cwh <= 23'h0x6544; end /* The Addition States */ 6'd7: begin /* Addition Step 1 */ cwl <= 10'h048; cwh <= 23'h0x08a0; end 6'd8: begin /* Addition Step 2 */ cwl <= 10'h002; cwh <= 23'h0x5006; end 6'd9: begin /* Addition Step 3 */ cwl <= 10'h028; cwh <= 23'h0x0090; end 6'd10: begin /* Addition Step 4 */ cwl <= 10'h011; cwh <= 23'h0x0214; end 6'd11: begin /* Addition Step 5 */ cwl <= 10'h102; cwh <= 23'h0x6544; end 6'd12: begin /* Addition Step 6 */ cwl <= 10'h08A; cwh <= 23'h0xB4D2; end 6'd13: begin /* Addition Step 7 */ cwl <= 10'h00B; cwh <= 23'h0x18A2; end 6'd14: begin /* Addition Step 8 */ cwl <= 10'h058; cwh <= 23'h0x0ac0; end cwl <= 10'h000; /* Init L2R Step 1 */ cwh <= 23'h4x8484; cwl <= 10'h000; cwh <= 23'h4x808D; /* Init L2R Step 2 */ cwl <= 10'hx; /* Init L2R Step 3 */ cwh <= 23'h4xx098;
The verilog snippet /* The final Inverse : Starting the Itoh Tsujii here*/ 6'd15: begin /* Inv 1 */ cwl <= 10'hx0D; cwh <= 23'h0x04x0; /* The first a=a^3 */ end 6'd16: begin /* Inv 2 */ cwl <= 10'hx06; cwh <= 23'h0x0090; end 6'd17: begin /* Inv 3 */ cwl <= 10'hx35; cwh <= 23'h0x0090; end 6'd18: begin /* Inv 4-1 */ cwl <= 10'hx; cwh <= 23'h130510; end 6'd19: begin /* Inv 4-2 */ cwl <= 10'hx02; cwh <= 23'h0x0190; end 6'd20: begin /* Inv 5 */ cwl <= 10'hx35; cwh <= 23'h0x0090; end end 6'd21: begin /* Inv 6-1 */ cwl <= 10'hx; cwh <= 23'h170510; end 6'd22: begin /* Inv 6-2 */ cwl <= 10'hx02; cwh <= 23'h0x0190; end 6'd23: begin /* Inv 7-1 */ cwl <= 10'hx; cwh <= 23'h1E0510; 6'd24: begin /* Inv 7-2 */ end 6'd25: begin /* Inv 8 */ end 6'd26: begin /* Inv 9-1 */ end 6'd27: begin /* Inv 9-2 */ end 6'd28: begin end cwl <= 10'hx02; cwh <= 23'h0x0190; cwl <= 10'hx35; cwh <= 23'h0x0090; cwl <= 10'hx; cwh <= 23'h1E0510; cwl <= 10'hx; cwh <= 23'h3E0500; cwl <= 10'hx3A; /* Inv 9-3 */ cwh <= 23'h0x0190; endmodule
The verilog snippet /* the next state logic */ /* Addition States */ 6'd7: nextstate <= 6'd8; 6'd8: nextstate <= 6'd9; 6'd9: nextstate <= 6'd10; 6'd10: nextstate <= 6'd11; 6'd11: nextstate <= 6'd12; 6'd12: nextstate <= 6'd13; 6'd13: nextstate <= 6'd14; 6'd14: begin end /* The Itoh Tsujii States */ 6'd15: nextstate <= 6'd16; 6'd16: nextstate <= 6'd17; 6'd17: nextstate <= 6'd18; 6'd18: nextstate <= 6'd19; 6'd19: nextstate <= 6'd20; 6'd20: nextstate <= 6'd21; 6'd21: nextstate <= 6'd22; 6'd22: nextstate <= 6'd23; 6'd23: nextstate <= 6'd24; 6'd24: nextstate <= 6'd25; 6'd25: nextstate <= 6'd26; 6'd26: nextstate <= 6'd27; 6'd27: nextstate <= 6'd28; 6'd28: nextstate <= 6'd29; 6'd29: nextstate <= 6'd30; 6'd30: nextstate <= 6'd31; 6'd31: nextstate <= 6'd32; 6'd32: nextstate <= 6'd33; 6'd33: nextstate <= 6'd34; 6'd34: nextstate <= 6'd35; 6'd35: nextstate <= 6'd36; 6'd36: nextstate <= 6'd37; 6'd37: nextstate <= 6'd38; 6'd38: nextstate <= 6'd38; default: nextstate <= 6'bx; endcase end always @(state) begin case(state) /* Init states */ 6'd0: nextstate <= 6'd1; 6'd1: nextstate <= 6'd2; if(ef == 1'b1) else 6'd2: nextstate <= 6'd3; nextstate <= 6'd3; /* Double States */ nextstate <= 6'd15; 6'd3: nextstate <= 6'd4; 6'd4: nextstate <= 6'd5; 6'd5: nextstate <= 6'd6; 6'd6: begin if(k[`KEYMSB] == 1'b1) doubling if K0 is 1 */ nextstate <= 6'd7; /* Do Addition and else if (ef == 1'b0) are in the last iteration, goto end */ nextstate <= 6'd15; /* k[0]=0 and we else do next doubling */ nextstate <= 6'd3; /* Skip addition and end
IO Interface of the scalar multiplier /* The input to regbank is either constants or the results from ALU */ assign c0r = (cwh[22] == 1'b0) ? c0a : `BASEPOINT_X; assign c1r = (cwh[22] == 1'b0) ? c1a : yconstants; assign yconstants = (state != 6'd2) ? `BASEPOINT_Y : `CURVECONSTANT_B;
Output /* Store the results after converting back into affine coordinates */ assign sx = (key != 233'b1) ? a0 : `BASEPOINT_X; assign sy = (key != 233'b1) ? a2 : `BASEPOINT_Y; assign done = (state == 6'd38) ? 1 : 0; /* Set done to 1 if multiplication is complete */
The Module counter `define KEYSIZE 32 `define KEYMSB 31 module counter (clk, nrst, e); input wire clk; /* clock used for the counter */ input wire nrst; /* active low reset */ output wire e; /* set to 0 if count = 0 */ This module computes the number of key bits which are processed. Before the leading 1 is detected, it just shifts the key (at a fast clock) After, the leading 1 is detected, it updates at the start of doubling. reg [7:0] count; /* ...and the register which actually decrements */ /* activate e when count reaches 0 */ assign e = |count; always @(posedge clk or negedge nrst) begin if(nrst == 1'b0) count <= 8'd`KEYMSB; else count <= count - 1'b1; end endmodule
The clocks in the design always @(posedge clk) begin k[(`KEYMSB - 1)]; end /* Shift register for the key. The current bit is always the MSB */ always @(posedge clk) begin if (nrst == 1'b0) k <= key; else if (start == 1'b0) /* if start=0, shift every clock cycle */ k[`KEYMSB:0] <= {k[(`KEYMSB-1):0], 1'b0}; else if (state == 6'd4) /* if start=1, shift once very iteration of multiplier */ k[`KEYMSB:0] <= {k[(`KEYMSB-1):0], 1'b0}; else /* else, don't shift */ k[`KEYMSB:0] <= k[`KEYMSB:0]; end if (nrst == 1'b0) else start <= key[`KEYMSB]; start <= tstart | /* Generate clock signal for counter */ always @(posedge clk) begin end if (nrst == 1'b0) begin end else begin end cck <= 1'b0; /* Detect the first 1 */ assign tstart = start; case(state) 6'd3: cck <= ~cck; 6'd4: cck <= ~cck; default: cck <= cck; endcase /* The counter clock is either a fast clock (clk) used during * leading 1 detection, or a slow clock used during multiplication */ assign mainclk = (start == 1'b0) ? clk : cck;
Testing of the design Top level Interface of the design: module ecsmul(clk, nrst, key, sx, sy, done); Test-bench: module tb(); reg clk; reg nrst; wire done; reg [31:0] kmem [7:0]; reg [255:0] k; wire [232:0] sx, sy; reg [232:0] res[1:0]; integer i;
A testbench initial begin #100000 end initial begin $readmemh("../../scratch/tv.txt", kmem); #50 for(i=0; i<8; i = i+1) begin $display("%d %h\n", i, kmem[i]); end k = {kmem[0], kmem[1], kmem[2], kmem[3], kmem[4], kmem[5], kmem[6], kmem[7]}; // k = `KEYSIZE'hC7; $display("%h\n", k[`KEYSIZE - 1:0]); clk = 1'b0; nrst = 1'b1; #10 nrst = 1'b0; #110 nrst = 1'b1; #100000 $display("%h\n%h", sx, sy); res[0] = sx; res[1] = sy; #1 $writememh("../../scratch/res_verilog.txt", res); $finish; end $dumpfile ("dump.vcd"); $dumpvars; $dumpon; $dumpoff; ecsmul ec_mul(clk, nrst, k[`KEYMSB:0], sx, sy, done); always begin #100 clk =~clk; end endmodule
Verification of Correctness We use the elliptic curve code by M. Rosing in the book Implementation of Elliptic Curve Cryptography Go to directory ld_rosing/polymath/basics/ In file elliptic_poly.c Lines 1232 to 1239 contain the scalar t1.e[0] is most significant word and t1.e[7] is the least significant Word. Save file and exit On a linux machine run make to create the executable Execute elliptic_poly