Unified Approach for Performance Evaluation and Debug of System on Chip in Early Design Phase

Unified Approach for Performance Evaluation and
Debug of System on Chip at Early Design Phase
Nishit Gupta
  
Scientist,
Ministry of Electronics & Information Technology(MeitY), Government of India
Introduction
Problem statement
Alternate solutions and limitations
Proposed solutions
Performance and Debug Approaches
Conclusions
IP2
IP1
IP3
IP 4
IP 5
IP 6
IP 7
IP 8
IP 9
IP 10
MEMORY
BW is a major
Issue now
There is a limit on 
the available BW
 
IC’s now
BW was not an
issue
 
IC’s during 90-2000’s
4 GB/s required for a typical
Digital TV SOC
Available b/w  < 4GB/s due to
parallel multiple accesses
 
 
IP 1
IP 10
MEMORY
IP 2
IP 4
IP 3
IP 5
IP 7
IP 6
IP 9
IP 8
MIXER
Tune the DDR
accesses for
maximizing the
efficiency
Design a good
interconnect and
tune it to give a
“FAIR” BW share
to every IP
The IP Traffic Classes:
1.
Real time (video IP’s)
2.
Latency sensitive (processors)
3.
High Bandwidth (Video Decoder)
 
1.
Spreadsheet analysis
Low accuracy when traffic from different IP’s get mixed
Tuning the Mixer for best DDR efficiency is not feasible
 
2.
High level C/C++ model for the whole system (including IP,
interconnect, DDR subsystem)
High level C/C++ models for complex SOC not available. Need a lot of
effort to create/maintain for each SOC
Low accuracy  and not effective correlation with post-silicon results
 
3.
Performance Simulation with SOC RTL
Stable SOC RTL available very late. Very Slow though accurate
Software drivers for all IP’s need to be available
 
4.
Emulation platforms
Faster than RTL simulations but results are very late in SOC cycle and
availability of s/w drivers on time is difficult
RTL changes are very difficult without major impact on schedule
Model of System of Chip (SOC) at an Abstraction Level which:
̶
Available at an Early Design Phase
̶
Simulate fast enough
̶
Accurate
̶
Able to exercise various scenarios
 
Performance analysis
Embedded s/w
Power analysis
Timing analysis
 
Embedded s/w  dev.
Golden Reference Models
for Functional verification
Rough Power estimation
 
Architecture analysis
Performance analysis
Rough Power estimation
 
RTL
SOC available in years
High accuracy &
low simulation  speed
 
TLM
VSOC available in Months
Low accuracy &
high simulation speed
 
Proposed Solution
SoC available in weeks
 
Moderate- accuracy &
simulation speed
 
Interconnect
             (systemC model)
 
Create a platform with
BCA+TLM model of the
interconnect and
actual RTL for the DDR
memory subsystem
 
Create Bus Mater (IP Traffic
Generator) models
 Model the periodic peak
traffic accurately
 For “real time” IP’s model
the internal FIFO’s accurately
 
Run simulations
Check bandwidth and
latency values
Check FIFO levels for Real
time IP’s
Bus
Master
Bus
Master
Bus
Master
 
.
.
Mixer
      +
Memory
Ctrl
RTL
Memory
Model
SystemC Wrapper
 
create applications with a
set of Bus Mater’s for each
“Use case”
“Use case” is one of the
several modes in which the
chip can operate
 
Soft tuning
 Interconnect parameters
 Mixer + DRAM controller
parameters
IP parameters
 
Hard tuning (triggers RTL
changes)
 Interconnect topology
Interconnect data path widths
Interconnect frequencies
DRAM interface frequency
IP FIFO sizes
Tune the Programmable
parameters
Run SysPerf
on VCD
11
IP
Freq/size
c
onverter
Mixer
With
arbiter
DRAM
controller
DRAM
Arbitration scheme,
Bandwidth limiters
Mixer programming for
good flow regulation and
best DRAM efficiency
IP parameters
performance simulation results have this
info…
IP
Freq/size
converter
Mixer
With
arbiter
DRAM
controller
DRAM
 
Interconnect topology
change may help
 
Look for bottlenecks in the interconnect
Did it work?
Run Performance
simulations to find out
How much increase is optimal?
Run performance simulations
and find out..
Frequency increase unavoidable??
How much increase is optimal?
 
 
Flow
Transaction extractions from simulation and Simulation Database Recording
Different abstraction levels displayed & visualization of transactions along with signals
Features
Debug: Transaction recording and Protocol Checking
Performance: Analyzers Module (Latencies, BD …)
Latency (#cycles)
  - wvalid2wready
  - bvalid2bready
Request
packet from
Initiator1
Response
packet to
Initiator1
Bandwidth
865 MB/s
Memory References are extracted by running TxE scripts over Value Change Dump
(.vcd) files resulted from SoC simulation
Further tuned for timing parameters & synchronization and fed to BUS Masters
embedded with Cache Simulation Models.
$MASTER_NAME
   
FDMA
$MASTER_PROCESS_SEQ
  
{1,2,3}*10
# Process 1: IP doing no operation
$PROCESS_NAME
   
Nop_Operation
$PROCESS_SEQ
   
NOP 500
     
END
# Process 2: FDMA produces traffic noise of 50MB/s on memory area
$PROCESS_NAME
   
Mem_Write_Access
$PROCESS_BANDWIDTH
  
50
$PROCESS_DATA_LENGTH
  
1024
$PROCESS_OPCODE
  
{WRITE32;WRITE16;WRITE8}
$PROCESS_ADDRESS
  
{(0x0~0x2ff)=20;(0x3ff~0x4ff)=80}
# Process 3: IP does read accesses on control register area
$PROCESS_NAME
   
Read_To_CtrReg
$PROCESS_SEQ
   
START
    
READ16  $addr++  0xffff
     
REPEAT 1000
Replacement Policy
Least Recently Used
First-In-First-Out
Random
Fetch Policy
Demand Fetch
Pre-fetch
23
IP Verification
T
B
M
a
s
t
e
r
M
e
m
o
r
y
A
b
s
t
r
a
c
t
 
i
n
t
e
r
c
o
n
n
e
c
t
C/C++
Testbench
SystemC
Timing
Randomizer
(NTRP)
Based on SystemC Verification Library
Configured through input cfg file
Introduces Timing Randomization at Communication Interface – Random,
Constrained, Fixed, Timed/ Cycle delay
Selectively adds delay in the transaction – ID, Address, Opcode etc.
Re-orders transactions/ Introduces Out-of-Order
&lt;timing:set&gt;
&lt;timing:
id_number
&gt;default&lt;/timing:
id_number
&gt;
&lt;timing:signal_scope&gt;
AWREADY
&lt;/timing:signal_scope&gt;
&lt;timing:conditions&gt; &lt;/timing:conditions&gt;
&lt;timing:cycle_delay_model&gt;
RANDOM
&lt;/timing:cycle_delay_model&gt;
&lt;timing:
cycle_delay
timing:min=&quot;0&quot; timing:max=&quot;0&quot;
timing:percentage=&quot;
50
&quot;/&gt; &lt;timing:
cycle_delay
timing:min=&quot;
20
&quot; timing:max=&quot;
40
&quot;
timing:percentage=&quot;
50
&quot;/&gt;
&lt;/timing:set&gt;
BVALID Delay
: Disordering disabled, different BID value 
BVALID Delay
: Disordering enabled, different BID value
Nishit Gupta
  Scientist,  R&D in Electronics Group, Ministry of Electronics & Information Technology(MeitY),
Government of India
Slide Note
Embed
Share

This presentation discusses the challenges related to system-on-chip design, focusing on bandwidth issues, interconnect design, and DDR efficiency tuning. It explores the evolution of performance evaluation methods and the limitations of existing solutions. The need for a unified approach for early-phase evaluation and debugging is emphasized to address the complexities of modern SOC design.

  • Performance Evaluation
  • System on Chip
  • Interconnect Design
  • Bandwidth Issues
  • Early Design Phase

Uploaded on Sep 20, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Unified Approach for Performance Evaluation and Debug of System on Chip at Early Design Phase Nishit Gupta Scientist, Ministry of Electronics & Information Technology(MeitY), Government of India

  2. Agenda Introduction Problem statement Alternate solutions and limitations Proposed solutions Performance and Debug Approaches Conclusions

  3. Introduction Problem Statement (Then & Now) BW is a major Issue now 4 GB/s required for a typical IP 4 IP1 Digital TV SOC BW was not an issue IP 5 IP 6 M E M O R Y IP 7 IP2 Available b/w < 4GB/s due to parallel multiple accesses There is a limit on the available BW IP 8 IP 9 IP 10 IC s now IC s during 90-2000 s IP3

  4. Introduction Problem Statement (Typical SoC)

  5. Introduction Problem Statement (Typical Interconnect)

  6. Introduction Problem Statement (Areas of Concern) IP 2 Design a good interconnect and tune it to give a FAIR BW share to every IP IP 1 IP 3 IP 4 M E M O R Y IP 5 IP 7 MIXER IP 6 The IP Traffic Classes: 1. Real time (video IP s) 2. Latency sensitive (processors) 3. High Bandwidth (Video Decoder) IP 9 Tune the DDR accesses for maximizing the efficiency IP 8 IP 10

  7. Introduction Problem Statement (Alternate Solutions & Limitations) 1. Spreadsheet analysis Low accuracy when traffic from different IP s get mixed Tuning the Mixer for best DDR efficiency is not feasible 2. High level C/C++ model for the whole system (including IP, interconnect, DDR subsystem) High level C/C++ models for complex SOC not available. Need a lot of effort to create/maintain for each SOC Low accuracy and not effective correlation with post-silicon results 3. Performance Simulation with SOC RTL Stable SOC RTL available very late. Very Slow though accurate Software drivers for all IP s need to be available 4. Emulation platforms Faster than RTL simulations but results are very late in SOC cycle and availability of s/w drivers on time is difficult RTL changes are very difficult without major impact on schedule

  8. Requirements Model of System of Chip (SOC) at an Abstraction Level which: Available at an Early Design Phase Simulate fast enough Accurate Able to exercise various scenarios

  9. Proposed Solution Embedded s/w dev. Golden Reference Models for Functional verification Rough Power estimation Performance analysis Embedded s/w Power analysis Timing analysis Architecture analysis Performance analysis Rough Power estimation TLM Proposed Solution SoC available in weeks Moderate- accuracy & simulation speed RTL VSOC available in Months Low accuracy & high simulation speed SOC available in years High accuracy & low simulation speed Solution is to use reconfigurable components at abstraction level - Transaction Level Model (TLM) + Bus Cycle Accurate (BCA)

  10. Proposed Solution (Overview) Hard tuning (triggers RTL changes) create applications with a set of Bus Mater s for each Use case Use case is one of the Mixer + DRAM controller Interconnect data path widths Create Bus Mater (IP Traffic Generator) models Model the periodic peak traffic accurately For real time IP s model time IP s Run simulations Check bandwidth and latency values Check FIFO levels for Real several modes in which the parameters Interconnect frequencies Soft tuning Create a platform with BCA+TLM model of the interconnect and actual RTL for the DDR memory subsystem the internal FIFO s accurately chip can operate IP parameters DRAM interface frequency IP FIFO sizes SOC interconnect design ready Interconnect parameters Interconnect topology Change interconnect topology/ frequency /data path widths / IP/size or freq converter FIFO sizes / dram frequency Create Platform + applications for different use cases Simulation run Run Run SysPerf Simulation data (VCD) created Analyze waveforms Sysprobe on VCD generated on VCD Bus Master check FIFO states for real time IP s B/W and latency Values for all the IP s are available Mixer + Memory Ctrl RTL Bus Master Memory Model NO YES . . FIFO BW of all IP s Sufficient? Under/overflows observed? SystemC Wrapper Bus Master NO NO YES YES Interconnect (systemC model) Tune the programmable parameters for Interconnect components + IP ST bus plug + DRAM controller parameters Tried enough Tuning/ simulations? Tune the Programmable SOC Interconnect design is good

  11. What needs to be tuned for good traffic management ? IP parameters Arbitration scheme, Bandwidth limiters Mixer programming for good flow regulation and best DRAM efficiency Freq/size converter IP Node with arbiter Mixer With arbiter Fifo size, Store & forward D R A M DRAM controller Fifo size, Store & forward Fifo size, Store & forward 11

  12. Not able to meet bandwidth, latency requirements ? Look for bottlenecks in the interconnect Run performance simulations and find out.. Run Performance performance simulation results have this info simulations to find out Freq/size converter IP Node with arbiters Mixer With arbiter IP FIFO size increase unavoidable?? DRAM DRAM controller Interconnect topology change may help Frequency increase unavoidable?? Did it work? How much increase is optimal? How much increase is optimal?

  13. Performance Evaluation Approach CFG Result Files SysPerf HW .vcd Simulation time Analysis time Flow Transaction extractions from simulation and Simulation Database Recording Different abstraction levels displayed & visualization of transactions along with signals Features Debug: Transaction recording and Protocol Checking Performance: Analyzers Module (Latencies, BD )

  14. Performance Evaluation Approach (Transaction Recording Module)

  15. Performance Evaluation Approach (Performance Recording Module) Latency (#cycles) - wvalid2wready - bvalid2bready Response packet to Initiator1 Bandwidth 865 MB/s Request packet from Initiator1

  16. Performance Evaluation Approach (Performance Recording Module) Performance Figures Min cycles Max cycles Avg cycles Latency (WVALID to WREADY) 0 197 56 Latency (BVALID to BREADY) 21 30 22 AXI Burst Pipeline 2 3 4 AXI Beat Pipeline 0 8 2 BANDWIDTH 865 MB/s Opcode Table READ 1 Byte (1 bytes * 3) = 3 bytes(0.68%) WRITE 1 Byte (1 bytes * 2) = 2 bytes(0.45%) READ 2 Bytes (2 bytes * 11) = 22 bytes(5%) Number Of Bytes transferred 443 POWER Dissipated IDLE 0.145280W (99.32%) NOP 0.3W (99.67%) COMPUTE 0.7W (99.86%) READ 0.4W (99.75%) WRITE 0.009999999983W (90.91%)

  17. Modeling Approach- Extraction of Memory References Memory References are extracted by running TxE scripts over Value Change Dump (.vcd) files resulted from SoC simulation Further tuned for timing parameters & synchronization and fed to BUS Masters embedded with Cache Simulation Models. $MASTER_NAME $MASTER_PROCESS_SEQ FDMA {1,2,3}*10 # Process 1: IP doing no operation $PROCESS_NAME $PROCESS_SEQ Nop_Operation NOP 500 END # Process 2: FDMA produces traffic noise of 50MB/s on memory area $PROCESS_NAME $PROCESS_BANDWIDTH $PROCESS_DATA_LENGTH $PROCESS_OPCODE $PROCESS_ADDRESS Mem_Write_Access 50 1024 {WRITE32;WRITE16;WRITE8} {(0x0~0x2ff)=20;(0x3ff~0x4ff)=80} # Process 3: IP does read accesses on control register area $PROCESS_NAME $PROCESS_SEQ Read_To_CtrReg START REPEAT 1000 READ16 $addr++ 0xffff

  18. Cache Memory in Symmetric Multiprocessing System (SMP)

  19. Proposed Cache Memory Simulation Architecture

  20. Cache Memory Policies/ Configurations Replacement Policy Least Recently Used First-In-First-Out Random Fetch Policy Demand Fetch Pre-fetch Write hit policy Write Through Write Through Write Back Write Back Write miss policy Write Allocate No Write Allocate Write Allocate No Write Allocate Parameter Explanation Default Value -lN-Tsize Size 32K -lN-Tbsize Block size 1K -lN-Tassoc Associativity (1, 2, 4, ..) (default 1) -lN-Trepl Replacement policy (l=LRU, f=FIFO, r=random) (default 1) -lN-Tfetch Fetch policy (d=demand, prefetch) (default d) -lN-Twalloc Write allocate policy (a=always, n=never) (default a) -lN-Twback Write back policy (a=always, n=never) (default a) -maxcount Stop simulation after U Memory References (default infinite)

  21. Analysis of Simulation Results- Miss Ratio Associativity = 2 Associativity = 4 Associativity = 1 Cache Block Size (Byte) Memory Size 16 32 16 32 16 32 2K 4K 8K 16K 32K 0.0599 0.0429 0.0229 0.0119 0.0071 0.0394 0.0289 0.0155 0.0079 0.0047 0.0517 0.0294 0.0144 0.0069 0.0031 0.0346 0.0203 0.0100 0.0049 0.0021 0.0320 0.0263 0.0121 0.0044 0.0021 0.0150 0.0177 0.0087 0.0032 0.0014

  22. Functional Verification Approach Testbench TLM IPs TB Memory Master C/C++ Abstract interconnect Adapter Adapter IP Verification TLM DUT In-system Verif. RTL DUT 23

  23. Nonintrusive Timing Randomization Probes (NTRP) Based on SystemC Verification Library Configured through input cfg file Introduces Timing Randomization at Communication Interface Random, Constrained, Fixed, Timed/ Cycle delay Selectively adds delay in the transaction ID, Address, Opcode etc. Re-orders transactions/ Introduces Out-of-Order

  24. NTRP- configuration file &lt;timing:set&gt; &lt;timing:id_number&gt;default&lt;/timing:id_number&gt; &lt;timing:signal_scope&gt;AWREADY&lt;/timing:signal_scope&gt; &lt;timing:conditions&gt; &lt;/timing:conditions&gt; &lt;timing:cycle_delay_model&gt;RANDOM&lt;/timing:cycle_delay_model&gt; &lt;timing:cycle_delay timing:min=&quot;0&quot; timing:max=&quot;0&quot; timing:percentage=&quot;50&quot;/&gt; &lt;timing:cycle_delay timing:min=&quot;20&quot; timing:max=&quot;40&quot; timing:percentage=&quot;50&quot;/&gt; &lt;/timing:set&gt;

  25. NTRP- AXI3 Timing Randomizer (few scenarios) BVALID Delay: Disordering disabled, different BID value BVALID Delay: Disordering enabled, different BID value

  26. Conclusion (Comparison of available solutions) Proposed solution Alternate solution(s) Performance Parameter Spreadsheet High level C/C++ RTL Emulation platform Configurable TLM+BCA Models Schedule impact Very Low Low Very High Low High Low medium medium Overall Effort High High Low Low Low Low Very High medium Low medium Cost of tools medium Simulation time High Very high High medium medium High No of use cases High High High Results Accuracy Very Low Very low

  27. Nishit Gupta Scientist, R&D in Electronics Group, Ministry of Electronics & Information Technology(MeitY), Government of India

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#