BARE METAL EXPERIENCES

BARE METAL EXPERIENCES
Slide Note
Embed
Share

In this presentation by Herman Roebbers at 040Coders, insights are shared on reducing energy consumption for TLS operations by a significant factor, detailing the challenges faced in CAN FD message copying on S32K144. The narrative unfolds with real-world observations of slow copying speeds, offering practical solutions to optimize performance. Learn from the advanced expertise of this thought leader and gain valuable knowledge on energy-autonomous systems for IoT applications. Dive into the world of bare metal experiences and discover ways to enhance efficiency and performance in embedded systems development.

  • Bare Metal Experiences
  • Energy Consumption
  • TLS Operations
  • IoT Applications
  • Embedded Systems

Uploaded on Feb 22, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. BARE METAL EXPERIENCES Herman Roebbers @ 040coders 20-2-2020

  2. AGENDA About Altran About me CAN FD message copying on S32K144 is very slow How to reduce energy consumption for TLS operations by factor of 9 Reference score And the winner is Conclusions 2 Bare Metal Experiences @ 040Coders, 2020-02-20

  3. 1 ABOUT ALTRAN Bare Metal Experiences @ 040Coders, 2020-02-20 3

  4. 4 Bare Metal Experiences @ 040Coders, 2020-02-20

  5. 5 Bare Metal Experiences @ 040Coders, 2020-02-20

  6. 2 ABOUT ME Bare Metal Experiences @ 040Coders, 2020-02-20 6

  7. INTRODUCTION Herman Roebbers, Advanced Expert, Thought Leader @ Altran NL 7 Bare Metal Experiences @ 040Coders, 2020-02-20

  8. INTRODUCTION Who did I work with or for 8 Bare Metal Experiences @ 040Coders, 2020-02-20

  9. INTRODUCTION Lecturer / speaker TTW/NWO project ZERO: Energy-autonomous systems for IoT. 3 TU s and 14 companies Member of review committee of CPA conferences (www.wotug.org) Guest lecturer / Student project coach External advisor for 3 ULP benchmarks (eembc.org) Driving energy-autonomous (meeting room) display development 9 Bare Metal Experiences @ 040Coders, 2020-02-20

  10. 3 CAN FD MESSAGE COPYING ON S32K144 IS VERY SLOW. WHAT TO DO? Bare Metal Experiences @ 040Coders, 2020-02-20 13

  11. OBSERVATION Copying a 64 byte message from a CAN FD message buffer to SRAM takes far more time than expected ( 100 secs) MCU: NXP S32K144 - 80 MHz Cortex-M4F running from flash - Internal 64 kB ECC SRAM - Internal 512 kB ECC Flash - 4KB Code cache - CAN FD controllers (FlexCAN) - HW Crypto block (CSEc) 14 Bare Metal Experiences @ 040Coders, 2020-02-20

  12. CAN FD ISO CAN FD (Flexible Datarate) : New CAN protocol - Message size up to 64 Byte (Classical CAN 8 bytes max) - Data bit rate up to 8 Mbps (Classical CAN 1 Mbps max) ISO CAN FD controller (FlexCAN) - No Rx FIFO in FD mode - Read message data from message buffer - Cannot use DMA in FD mode. - Must read message data by hand. 15 Bare Metal Experiences @ 040Coders, 2020-02-20

  13. CAN FD Assembly: Original code: void CopyCanMsgDataB( const CanMsg * inMsg, const CanMsg * outMsg) { unsigned int i; unsigned char * pInByte unsigned char * pOutByte = (unsigned char *) outMsg; CopyCanMsgDataB: sub sub add r3, r0, #1 r1, r1, #1 r0, r0, #71 = (unsigned char *) inMsg; .L2: ldrb cmp strb bne bx r2, [r3, #1]! r3, r0 r2, [r1, #1]! .L2 lr for ( i = 0 ; i < sizeof(CanMsg) ; i++ ) { pOutByte[i] = pInByte[i]; } } 16 Bare Metal Experiences @ 040Coders, 2020-02-20

  14. CAN FD Assembly: Total cycles (P=1 for SRAM, >= 1 for flash) CopyCanMsgDataB: Cycles sub sub add r3, r0, #1 r1, r1, #1 r0, r0, #71 1 1 1 3 .L2: ldrb cmp strb bne r2, [r3, #1]! r3, r0 r2, [r1, #1]! .L2 2 1 3 1+P + 72 * (2+1+3+3) = 72 * 9 = 648 + 2 = 653 bx lr 1+P 17 Bare Metal Experiences @ 040Coders, 2020-02-20

  15. CAN FD Loop: for ( i = 0 ; i < sizeof(CanMsg) ; i++ ) { pOutByte[i] = pInByte[i]; } Translates to: Load byte from CAN FD message buffer Store byte to SRAM Increment i; compare i Back to start if i < sizeof(CanMsg) 18 Bare Metal Experiences @ 040Coders, 2020-02-20

  16. SRAM STORE Q: How many cycles to store a byte or half word to SRAM? 19 Bare Metal Experiences @ 040Coders, 2020-02-20

  17. SRAM Q: How many cycles to store a byte or half word to SRAM? A: 1 20 Bare Metal Experiences @ 040Coders, 2020-02-20

  18. ECC SRAM Q: How many cycles to store a byte or half word to ECC SRAM? 21 Bare Metal Experiences @ 040Coders, 2020-02-20

  19. ECC SRAM Q: How many cycles to store a byte or half word to SRAM? A: 3 ??? 22 Bare Metal Experiences @ 040Coders, 2020-02-20

  20. ECC SRAM Q: How many cycles to store a byte to ECC SRAM? A: 3 ??? Rationale: - We have ECC SRAM - Error Checking and Correcting Code o SECDED Single Error Correct, Double Error Detect 1. Read word (get data, calculate and verify ECC) 2. Update data byte and generate new ECC 3. Write data word + ECC to memory 23 Bare Metal Experiences @ 040Coders, 2020-02-20

  21. CAN FD Better loop (must ensure proper alignment of outMsg): void CopyCanMsgDataW(const CanMsg * inMsg, const CanMsg * outMsg) { unsigned int i; uint32_t * pInW = (uint32_t *) inMsg; uint32_t * pOutW = (uint32_t *) outMsg; for ( i = 0 ; i < (sizeof(CanMsg)/ sizeof(uint32_t)) ; i++ ) { pOutW[i] = pInW[i]; } } 24 Bare Metal Experiences @ 040Coders, 2020-02-20

  22. CAN FD Assembly: Total cycles (P=1 for SRAM, can be > 3 for flash) CopyCanMsgDataW: Cycles sub sub add r3, r0, #4 r1, r1, #4 r0, r0, #68 1 1 1 3 .L7: ldr cmp str bne r2, [r3, #4]! r3, r0 r2, [r1, #4]! .L7 2 1 2 1+P + 18 * (2+1+2+1+P) = 18 * 7 = 126 bx lr 2 + 2 = 131 (code in SRAM) 25 Bare Metal Experiences @ 040Coders, 2020-02-20

  23. CAN FD Translates to: Load 32-bit word from CAN FD message buffer Store 32-bit word to SRAM Increment i; compare i; branch back if not done This time we do 4 bytes at a time, so do 4 times fewer accesses. Also, because we write 32-bit at a time, we don t need read-modify-write. We can do the write in 1 cycle i.s.o. 3. Result is 12 times less access time, 4 times less loop overhead 26 Bare Metal Experiences @ 040Coders, 2020-02-20

  24. COPY More improvement: - We still have loop overhead (increment, test, branch). - Lets unroll to get rid of it void CopyCanMsgData( const CanMsg * inMsg, const CanMsg * outMsg) { uint32_t * pInWord = (uint32_t *) inMsg; uint32_t * pOutWord = (uint32_t *) outMsg; *pOutWord[0] = *pInWord[0]; *pOutWord[17] = *pInWord[17]; } No more loop overhead! Total cycles (P=1 for SRAM, can be > 3 for flash) CopyCanMsgData: ldr str .... ldr str r3, [r0] r3, [r1] 2 (not pipelined) 1 r3, [r0, #68] r3, [r1, #68] 1 (pipelined) 1 bx lr 1+P = (18 * 2) + 1 + 1 = 36 27 Bare Metal Experiences @ 040Coders, 2020-02-20

  25. COPY Alternative: void CopyCanMsgData(const CanMsg * inMsg, const CanMsg * outMsg) { memcpy((char *) outMsg, (const char *) inMsg, sizeof(CanMsg)); } Performance depends on implementation of memcpy(). - gcc implementation will do byte copy (slow) - IAR has optimized copy routines for any alignment, so will also work very fast. - But both have loop overhead (it may be possible to specify loop unroll compiler option) 28 Bare Metal Experiences @ 040Coders, 2020-02-20

  26. COMPARE RESULTS (@ -O2) Name Cycles (code in SRAM) Size (Bytes) 16 16 38 Improvement factor 1 653/131=5 653/36=18 653 131 36 CopyCanMsgDataB CopyCanMsgDataW CopyCanMsgData 29 Bare Metal Experiences @ 040Coders, 2020-02-20

  27. HOW ABOUT RUNNING FROM FLASH? S32K144 Flash 512 KB Program 64 KB Data 128-bit width means 8 16-bit instructions or 4 32-bit instructions or a mix. 128 bit is also size of prefetch buffer Flash clock is 1/N of system clock - N 1 wait states per IFetch - System clock @ 112 MHz - Flash < 28 MHz (N=4) Data path width 128 bits 64 bits 30 Bare Metal Experiences @ 040Coders, 2020-02-20

  28. COMPARE RESULTS (@ -O2) Name Cycles (code in Flash) Size (Bytes) Improvement factor 1 653/131=5 653/36=18 653 131 36 16 16 38 CopyCanMsgDataB CopyCanMsgDataW CopyCanMsgData 31 Bare Metal Experiences @ 040Coders, 2020-02-20

  29. CONCLUSION Understand your system Check if you have SRAM with ECC and understand the impact on byte/halfword writes You must initialize ECC memory using 32-bit writes prior to first read! Check your copy routines, e.g. memcpy() It can be very worthwhile implementing your own copy routines. Don t assume the available vendor/compiler libraries are optimal/fastest Execution time went down from 100 to 8 sec ! 32 Bare Metal Experiences @ 040Coders, 2020-02-20

  30. 2 MAXIMIZE EEMBC SECUREMARK-TLS SCORE Bare Metal Experiences @ 040Coders, 2020-02-20 33

  31. EEMBC SECUREMARK-TLS EEMBC = Embedded Microprocessor Benchmarking Consortium (http://eembc.org) I am external advisor for EEMBC on Ultra Low Power benchmarks (http://eembc.org/#ulp). SecureMark-TLS aims to measure energy consumption of typical TLS operations on an embedded platform. 34 Bare Metal Experiences @ 040Coders, 2020-02-20

  32. FACTORS INFLUENCING ENERGY CONSUMPTION Hardware There are many factors and then some Radio Protocol (Zigbee / BLE / Zwave / WiFi / LoRa, ) Application SW Compiler & compiler settings OS Software Radio Technology configuration Sensor application HW+SW Radio Frequency (2.4 GHz, 868 Mhz, 433 MHz, ) Printed Circuit Board HW accelerators Low Power Modes Processor Energy consumption Battery technology IP blocks Process Technology 35 Bare Metal Experiences @ 040Coders, 2020-02-20

  33. EEMBC SECUREMARK-TLS This means that the following encryption algorithms must be executed: - AES128 ECB encrypt with 144, 224, 320 and 2K byte data - AES128 CCM decrypt with 168 byte data - ECDH p256r1 Secret Mix - ECDSA p256r1 Sign - ECDSA p256r1 Verify - SHA-256 with 23, 57, 384 and 4224 byte data - SHA-256 + AES mix Each algorithm will be run for > 10 seconds, during which energy consumption will be measured. SecureMark-TLS score = 1000 / ?=0 ?? ?? The higher the score, the better (less energy) ? 36 Bare Metal Experiences @ 040Coders, 2020-02-20

  34. EEMBC SECUREMARK-TLS I was asked to optimize the initial implementation, which uses the ARM mbed-TLS crypto library. Execution platform: STM32L4A6ZG processor (Cortex-M4F optimized for low power) on Nucleo-L4A6ZG evaluation board. IDE Atollic TrueStudio for STM32 v9.3.0 CubeMX V1.0 version MX.4.26.0 Libaries: STM32L4xx library 1.12 37 Bare Metal Experiences @ 040Coders, 2020-02-20

  35. HARDWARE STM32L4A6ZG - Cortex-M4F - Max 80 MHz - External crystal 16 MHz, 32.768 kHz - Internal HF Oscillator - PLL - UART - Flash : 1 MB (ICODE/DCODE memory) with ART flash accelerator - SRAM 1: 256 kB (system memory) - SRAM 2: 64 kB (ICODE/DCODE memory) - Crypto HW (not used now, as not all needed crypto functionality available in HW, or supported by low level driver SW)) 38 Bare Metal Experiences @ 040Coders, 2020-02-20

  36. 5 REFERENCE SCORE Bare Metal Experiences @ 040Coders, 2020-02-20 39

  37. BASELINE SecureMark-TLS 505 Normalized to 3.0Vdd/1.2Vcore 505 Normalized to 1.8Vdd/1.0Vcore < 26 MHz 1007 Energy Time Power Iterations Variables - VCC:3.0 V - Vcore:1.2 V - Freq 80 MHz - Oscillator 16 MHz HSI - PLL setting - UART /GPIO clk on - Optimizer setting O0 - Linker settings (flash) - Optimized memcpy: N - Optimized memset: N - Flash ws : 6 AES128 ECB Encrypt [144B] 11.6 uJ 290 us 39.9 mW 59305 AES128 ECB Encrypt [224B] 17.1 uJ 427 us 40.1 mW 40000 AES128 ECB Encrypt [320B] 23.8 uJ 591 us 40.2 mW 30000 AES128 CCM Encrypt [52B] 21.6 uJ 556 us 38.9 mW 52821 AES128 CCM Decrypt [168B] 43.6 uJ 1.11 ms 39.2 mW 20000 ECDH p256r1 Secret Mix 45.2 mJ 1.19 s 37.8 mW 25 ECDSA p256r1 Sign 18.6 mJ 498 ms 37.5 mW 22 ECDSA p256r1 Verify 64.4 mJ 1.71 s 37.6 mW 20 SHA256 [23B] 3.36 uJ 87.2 us 38.6 mW 320763 SHA256 [57B] 8.27 uJ 213 us 38.7 mW 131855 SHA256 [384B] 53.8 uJ 1.38 ms 38.8 mW 21381 SHA256+AES Mix 291 uJ 7.53 ms 38.6 mW 4000 SHA256 [4224B] 591 uJ 15.2 ms 38.8 mW 2000 Data Tx (AES ENC) [2KB] 143 uJ 3.54 ms 40.5 mW 5000 40 Bare Metal Experiences @ 040Coders, 2020-02-20

  38. BASELINE 505 GAIN 1336/505= 2.65 Variables - VCC:3.0 V - Vcore:1.2 V - Freq: 80 MHz - Oscillator 16 MHz HSI - PLL setting - UART /GPIO on - Optimizer setting O1 - Linker settings (flash) - Optimized memcpy: N - Optimized memset: N - Flash ws : 6 SecureMark-TLS Normalized to 3.0Vdd/1.2Vcore 1336 1336 Normalized to 1.8Vdd/1.0Vcore < 26 MHz 2663 Energy 6.44 uJ 9.38 uJ 12.9 uJ 11.7 uJ 23.0 uJ 16.9 mJ 7.20 mJ 24.6 mJ 740 nJ 1.78 uJ 10.2 uJ 62.9 uJ 112 uJ 76.3 uJ Time 172 us 250 us 345 us 313 us 613 us 462 ms 197 ms 674 ms 20.1 us 48.4 us 279 us 1.71 ms 3.05 ms 2.04 ms Power 37.4 mW 37.4 mW 37.3 mW 37.4 mW 37.4 mW 36.5 mW 36.5 mW 36.5 mW 36.6 mW 36.8 mW 36.6 mW 36.7 mW 36.6 mW 37.3 mW Iterations AES128 ECB Encrypt [144B] AES128 ECB Encrypt [224B] AES128 ECB Encrypt [320B] AES128 CCM Encrypt [52B] AES128 CCM Decrypt [168B] ECDH p256r1 Secret Mix ECDSA p256r1 Sign ECDSA p256r1 Verify SHA256 [23B] SHA256 [57B] SHA256 [384B] SHA256+AES Mix SHA256 [4224B] Data Tx (AES ENC) [2KB] 59305 40000 30000 52821 20000 25 57 20 520033 216664 37612 6132 3436 5000 41 Bare Metal Experiences @ 040Coders, 2020-02-20

  39. BASELINE 505 GAIN 1448/1336= 1.08 Variables - VCC:3.0 V - Vcore:1.2 V - Freq: 80 MHz - Oscillator 16 MHz HSI - PLL setting - UART /GPIO clk off* - Optimizer setting O1 - Linker settings (flash) - Optimized memcpy: N - Optimized memset: N - Flash ws : 6 *r/w gpio pins still works. Generating pin intr won t SecureMark-TLS Normalized to 3.0Vdd/1.2Vcore 1448 1448 Normalized to 1.8Vdd/1.0Vcore < 26 MHz 2886 Energy Time Power Iterations AES128 ECB Encrypt [144B] AES128 ECB Encrypt [224B] AES128 ECB Encrypt [320B] AES128 CCM Encrypt [52B] AES128 CCM Decrypt [168B] ECDH p256r1 Secret Mix ECDSA p256r1 Sign ECDSA p256r1 Verify SHA256 [23B] SHA256 [57B] SHA256 [384B] SHA256+AES Mix SHA256 [4224B] Data Tx (AES ENC) [2KB] 5.95 uJ 172 us 34.6 mW 59305 8.69 uJ 250 us 34.6 mW 40000 11.9 uJ 345 us 34.6 mW 30000 10.8 uJ 312 us 34.6 mW 52821 21.3 uJ 613 us 34.7 mW 20000 15.6 mJ 462 ms 33.7 mW 25 6.65 mJ 197 ms 33.6 mW 57 22.7 mJ 673 ms 33.6 mW 20 686 nJ 20.1 us 33.9 mW 520033 1.65 uJ 48.4 us 34.1 mW 216664 9.47 uJ 279 us 33.9 mW 37612 58.2 uJ 1.71 ms 34.0 mW 6132 103 uJ 3.05 ms 33.9 mW 3436 70.7 uJ 2.04 ms 34.6 mW 5000 42 Bare Metal Experiences @ 040Coders, 2020-02-20

  40. BASELINE 505 GAIN 1524/1488= 1.024 Variables - VCC:3.0 V - Vcore:1.2 V - Freq: 80 MHz - Oscillator 16 MHz HSI - PLL setting - UART /GPIO off - Optimizer setting O1 - Linker settings (flash) - Optimized memcpy: Y - Optimized memset: N - Flash ws : 6 SecureMark-TLS Normalized to 3.0Vdd/1.2Vcore 1524 1524 Normalized to 1.8Vdd/1.0Vcore < 26 MHz 3038 Energy Time Power Iterations AES128 ECB Encrypt [144B] AES128 ECB Encrypt [224B] AES128 ECB Encrypt [320B] AES128 CCM Encrypt [52B] AES128 CCM Decrypt [168B] ECDH p256r1 Secret Mix ECDSA p256r1 Sign ECDSA p256r1 Verify SHA256 [23B] SHA256 [57B] SHA256 [384B] SHA256+AES Mix SHA256 [4224B] Data Tx (AES ENC) [2KB] 6.07 uJ 176 us 34.3 mW 59305 8.82 uJ 255 us 34.4 mW 40000 12.1 uJ 351 us 34.5 mW 30000 10.8 uJ 318 us 34.1 mW 52821 21.0 uJ 612 us 34.4 mW 20000 14.7 mJ 445 ms 33.1 mW 25 6.33 mJ 191 ms 33.1 mW 57 21.5 mJ 650 ms 33.1 mW 20 638 nJ 18.8 us 33.7 mW 555931 1.52 uJ 45.0 us 33.9 mW 233068 9.51 uJ 279 us 34.0 mW 37612 57.2 uJ 1.68 ms 33.9 mW 6132 104 uJ 3.05 ms 34.1 mW 3436 71.4 uJ 2.06 ms 34.6 mW 5000 43 Bare Metal Experiences @ 040Coders, 2020-02-20

  41. BASELINE 505 GAIN 1524/1524= 1.00 Variables - VCC:3.0 V - Vcore:1.2 V - Freq: 80 MHz - Oscillator 16 MHz HSI - PLL setting - UART /GPIO off - Optimizer setting O1 - Linker settings (flash) - Optimized memcpy: Y - Optimized memset: N - Flash ws : 6 - Copy loop alignment: 8 SecureMark-TLS Normalized to 3.0Vdd/1.2Vcore 1524 1524 Normalized to 1.8Vdd/1.0Vcore < 26 MHz 3038 Energy Time Power Iterations AES128 ECB Encrypt [144B] AES128 ECB Encrypt [224B] AES128 ECB Encrypt [320B] AES128 CCM Encrypt [52B] AES128 CCM Decrypt [168B] ECDH p256r1 Secret Mix ECDSA p256r1 Sign ECDSA p256r1 Verify SHA256 [23B] SHA256 [57B] SHA256 [384B] SHA256+AES Mix SHA256 [4224B] Data Tx (AES ENC) [2KB] 6.07 uJ 176 us 34.3 mW 59305 8.82 uJ 255 us 34.4 mW 40000 12.1 uJ 351 us 34.5 mW 30000 10.8 uJ 318 us 34.1 mW 52821 21.0 uJ 612 us 34.4 mW 20000 14.7 mJ 445 ms 33.1 mW 25 6.33 mJ 191 ms 33.1 mW 57 21.5 mJ 650 ms 33.1 mW 20 638 nJ 18.8 us 33.7 mW 555931 1.52 uJ 45.0 us 33.9 mW 233068 9.51 uJ 279 us 34.0 mW 37612 57.2 uJ 1.68 ms 33.9 mW 6132 104 uJ 3.05 ms 34.1 mW 3436 71.4 uJ 2.06 ms 34.6 mW 5000 44 Bare Metal Experiences @ 040Coders, 2020-02-20

  42. BASELINE 505 GAIN 1588/1526= 1.04 Variables - VCC:3.0 V - Vcore:1.2 V - Freq: 24 MHz - Oscillator 24 MHz MSI - PLL setting : No PLL - UART /GPIO off - Optimizer setting O1 - Linker settings (flash) - Optimized memcpy: Y - Optimized memset: N - Flash ws : 1 - Copy loop alignment: 8 SecureMark-TLS Normalized to 3.0Vdd/1.2Vcore 1588 1588 Normalized to 1.8Vdd/1.0Vcore < 26 MHz 3165 Energy Time Power Iterations AES128 ECB Encrypt [144B] AES128 ECB Encrypt [224B] AES128 ECB Encrypt [320B] AES128 CCM Encrypt [52B] AES128 CCM Decrypt [168B] ECDH p256r1 Secret Mix ECDSA p256r1 Sign ECDSA p256r1 Verify SHA256 [23B] SHA256 [57B] SHA256 [384B] SHA256+AES Mix SHA256 [4224B] Data Tx (AES ENC) [2KB] 5.74 uJ 499 us 11.5 mW 30000 8.31 uJ 716 us 11.6 mW 20000 11.3 uJ 976 us 11.6 mW 10744 10.4 uJ 915 us 11.3 mW 16000 19.8 uJ 1.70 ms 11.6 mW 12000 14.1 mJ 1.27 s 11.1 mW 12 6.09 mJ 552 ms 11.0 mW 19 20.6 mJ 1.86 s 11.0 mW 10 596 nJ 51.5 us 11.5 mW 320763 1.42 uJ 121 us 11.6 mW 131855 8.89 uJ 761 us 11.6 mW 21381 53.7 uJ 4.62 ms 11.6 mW 4000 97.4 uJ 8.34 ms 11.6 mW 2000 66.5 uJ 5.66 ms 11.7 mW 2000 45 Bare Metal Experiences @ 040Coders, 2020-02-20

  43. BASELINE 505 GAIN=1596/1588= 1.005 Variables - VCC:3.0 V - Vcore:1.2 V - Freq: 24 MHz - Oscillator 24 MHz MSI - PLL setting : No PLL - UART /GPIO off - Optimizer setting O1 - Linker settings (flash) - Optimized memcpy: Y - Optimized memset: N - Flash ws : 1 - Copy loop alignment: 8 - K256 (rodata) in SRAM1 SecureMark-TLS Normalized to 3.0Vdd/1.2Vcore 1596 1596 Normalized to 1.8Vdd/1.0Vcore < 26 MHz 3181 Energy Time Power Iterations AES128 ECB Encrypt [144B] AES128 ECB Encrypt [224B] AES128 ECB Encrypt [320B] AES128 CCM Encrypt [52B] AES128 CCM Decrypt [168B] ECDH p256r1 Secret Mix ECDSA p256r1 Sign ECDSA p256r1 Verify SHA256 [23B] SHA256 [57B] SHA256 [384B] SHA256+AES Mix SHA256 [4224B] Data Tx (AES ENC) [2KB] 5.73 uJ 499 us 11.4 mW 30000 8.29 uJ 716 us 11.5 mW 20000 11.3 uJ 977 us 11.6 mW 10744 10.3 uJ 917 us 11.2 mW 16000 19.7 uJ 1.70 ms 11.5 mW 12000 14.1 mJ 1.27 s 11.1 mW 12 6.06 mJ 551 ms 11.0 mW 19 20.6 mJ 1.86 s 11.0 mW 10 581 nJ 50.8 us 11.4 mW 320763 1.38 uJ 119 us 11.5 mW 131855 8.60 uJ 749 us 11.4 mW 21381 52.0 uJ 4.55 ms 11.4 mW 4000 94.1 uJ 8.21 ms 11.4 mW 2000 66.6 uJ 5.66 ms 11.7 mW 2000 46 Bare Metal Experiences @ 040Coders, 2020-02-20

  44. Q: HOW DO I GET RODATAAND CODE IN SRAM? 47 Bare Metal Experiences @ 040Coders, 2020-02-20

  45. Q: HOW DO I GET RODATAAND CODE IN SRAM? std::A: Adapt the linker definition file to move code/data from special segments to specific memory 48 Bare Metal Experiences @ 040Coders, 2020-02-20

  46. Q: HOW DO I GET RODATAAND CODE IN SRAM? std::A: Adapt the linker definition file to move code/data from special segments to specific memory Q: How do I do that without changing my code? 49 Bare Metal Experiences @ 040Coders, 2020-02-20

  47. Q: HOW DO I GET RODATAAND CODE IN SRAM (STEP 1)? std::A: Adapt the linker definition file to move code/data from special segments to specific memory areas Q: How do I do that without changing my code? A: You don t. By adapting the linker definition file in a different way you can move certain symbols or segments from specific .o files to specific memory areas. Example memory areas for STM32L4A69 MEMORY { RAM (xrw) : ORIGIN = 0x20000000, LENGTH = 256K SRAM2(xrw) : ORIGIN = 0x10000000, LENGTH = 64K FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 1024K } 50 Bare Metal Experiences @ 040Coders, 2020-02-20

  48. Q: HOW DO I GET RODATAAND CODE IN SRAM (STEP 1)? /* used by the startup to initialize RamFunc */ _siramfunc = LOADADDR(.RamFunc); /* Code to be executed from RAM sections goes into RAM, load LMA copy after code */ .RamFunc : { . = ALIGN(4); _sramfunc = .; /* create a global symbol at ram functions start */ *(.RamFunc) /* .RamFunc sections */ *(.RamFunc*) /* .RamFunc* sections */ *(.text.AES_*) /* All AES_* functions go here */ *(.text.SHA256*) /* All SHA256* functions go here */ *(.text.AccHw_*) /* All AccHw_* functions go here */ . = ALIGN(8); *(.text.mpi*hlp) /* All mpi*hlp functions go here */ . = ALIGN(4); _eramfunc = .; /* define a global symbol at ram functions end */ } >SRAM2 AT> FLASH 51 Bare Metal Experiences @ 040Coders, 2020-02-20

  49. Q: HOW DO I GET RODATAAND CODE IN SRAM (STEP 1)? /* used by the startup to initialize ROData in Ram */ _siramrodata = LOADADDR(.RamROData); /* RO data sections goes into RAM, load LMA copy after code */ .RamROData : { . = ALIGN(4); _sramrodata = .; /* create a global symbol at ram ROData start */ *(.RamROData) /* .RamROData sections */ *(.RamROData*) /* .RamROData* sections */ *(.rodata.rcon) /* rcon rodata go here */ *(.rodata.K256) /* K256 rodata go here */ *(.rodata.InvSbox) /* InvSbox rodata go here */ *(.rodata.Sbox) /* Sbox rodata go here */ *(.rodata.secp256r1*) /* secp256r1* rodata go here */ . = ALIGN(4); _eramrodata = .; /* define a global symbol at ram ROData end */ } >RAM AT> FLASH 52 Bare Metal Experiences @ 040Coders, 2020-02-20

  50. Q: HOW DO I GET RODATAAND CODE IN SRAM (STEP 2A)? DEFINE NEW EXTERNAL SYMBOLS FOR NEW SEGMENT ADDRESS Declaring the symbols weak ensures that nothing breaks if the symbols are not defined in older linker definition file. + /* start address for the initialization values of the .RamFunc section. Defined in linker script */ + .weak _siramfunc + /* start address for the . RamFunc section. Defined in linker script */ + .weak _sramfunc + /* end address for the . RamFunc section. Defined in linker script */ + .weak _eramfunc + /* start address for the initialization values of the .RamROData section. Defined in linker script */ + .weak _siramrodata + /* start address for the . RamROData section. Defined in linker script */ + .weak _sramrodata + /* end address for the . RamROData section. Defined in linker script */ + .weak _eramrodata + /* start address for the initialization values of the .data section. Defined in linker script */ .word _sidata 53 Bare Metal Experiences @ 040Coders, 2020-02-20

More Related Content