Advanced Processors

Advanced Processors
(Books shared in separate link)
Overview of DSP
Unit-5
Unit-6
Agenda
DSP Processors
DSP Processors are specialized microprocessor with an
optimized architecture for the fast operational needs of
digital signal processing
.
Need for DSP Architecture
 
Filtering, correlation,
FFT
Heavy data flow
through CPU
Real time operations
Parallelism
 
Harvard Architecture
Pipelining
Fast dedicated
hardware     MAC
Special Instruction
Replication
On-chip memory and
cache
Extended Parallelism-
SIMD, VLIW
, 
Superscalar
Simplified Architecture of Standard Microprocessor
Van Newman Architecture
Independency between the operations
Limitations on the increase in speed
Hardware Architecture for Signal Processing
Multiple Bus Structure
Separate data  and program memory
Data memory
Coefficients, input data, out put samples, intermediated  data
Non-Pipelining Architecture
Pipeline Architecture
Pipelining Concept
Pipeline MAC Operation
MAC Configuration
Special
 Instructions
Special Instruction: MAC
Repeat: RPT
Single Instruction Multiple Data (SIMD)
Processing
Data bus-A
Data bus-B
Execution Unit A
Execution Unit B
SIMD Processing
16x16
MAC
16x16
MAC
16x16
MAC
16x16
MAC
Very Long Instruction Word (VLIW)
Internal Program Memory
Instruction fetch decode
and dispatch
Internal data RAM
8x32-bits
nx32-bits
32-bits
32-bits
Instruction fetch packet
Eight 32 bit instructions
Always 256 bits wide
Execution packet
Dispatches instructions into
appropriate execution units
Varies from one to eight
instructions (32 bits to 256 bits)
Two data
paths
Superscalar Processors
Uses instruction level parallelism
Developed to execute multiple instructions in one
cycle
Achieved through multiple execution units
Extensive use of pipelining
Instruction width is not fixed
An instruction can be issued to execute in parallel
like SIMD
Uses load/store architecture suitable to take two
inputs and compute an output
Fixed point and Floating point
representation
16-bit signed fractional point,
often indicated as Q1.15
IEEE 754 normalized representation of a
single precision floating point number.
General purpose DSP architecture
Represent each number with a minimum of
16 bits
2
16
 = 65536 possible bit patterns can
represent a number
Unsigned integer : 0 to 65,535
Signed  integer  : -32,768 to 32,767
Unsigned fraction : spread uniformly
between 0 to 1
Signed fraction : spread uniformly between -1
to 1
DSP Processors
Fixed point processors
Floating point processors
Represent each number with a minimum of 16 bits
2
32
 = 4,294,967,296 possible bit patterns can represent a
number
Represented numbers are 
not
 uniformly spaced
ANSI/IEEE Std. 754-1985-- the largest and smallest
numbers are ±3.4×10
38
 and ± 1.2x10
-38
, respectively
The represented values are unequally spaced between
these two extremes, such that the gap between any two
numbers is about ten-million times smaller than the
value of the numbers.
This is important because it places large gaps between
large numbers, but small gaps between small numbers
Fixed point digital signal processors
 
First 
Generation
 
TMS320C1X by  TI
in 1982
Dedicated AU with
multiplier and
accumulator
Harvard
architecture with
separate program
and data memory
On-chip memory
and special
instructions for
execution of basic
DSP algorithms
 
Second
 Generation
 
TMS320C5X from
TI, DSP5600X from
Motorola,
ADSP21XX from
Analog Devices,
DSP16XX from
Lucent Technologies
Enhanced features
than first generation
Larger on-chip
memory and more
special instructions
to execute DSP
algorithms
MAC with Repeat
 
Third
 Generation
 
TMS320C54xx,D
SP563X and
DSP16000
Aimed for
Digital
communication
and Digital Audio
Special
instructions for
Adaptive filtering
which included
echo
cancellations and
adaptive
equalization and
Viterbi decoding
Low power and
had power
management
facility
 
Fourth 
Generation
 
TMS320C62XX
VLIW
Included
extensive
parallelism while
maintaining the
features of
earlier versions
Wider
instructions,
wider data paths
more registers,
larger instruction
cache and
multiple AU
Floating point DSP processors
 
First Generation
 
TMS320C3X  TI
Larger memory and
many on-chip
peripheral facilities
Program cache and
on-chip dual access
memories
Graphics and
Image processing
Supported three
floating point
formats
 
Second Generation
 
TMS320C4X, ADSP-
2106x SHARCH
Emphasis on
multiprocessing and
multiprocessor
support
 
Third  Generation
 
TMS320C67xx,
ADSP-TS001
VLIW
Special purpose Digital Signal Processor
FIR structure
Hardware Architecture
for FIR  filter
Hardware digital filters  :  FIR
Special Purpose Digital Signal Processor
IIR Structure
Hardware architecture for IIR filter
Hardware digital filters  :  IIR
Special purpose Digital Signal Processor
Hardware FFT Processors
Concept of hardware butterfly processor
Simplified architecture of hardware
FFT processor
Special purpose Digital Signal Processor
Hardware FFT Processors
FFT performed on N point data in buffer A while buffer B is being filled
Double buffering in real-time FFT
Architecture of TMS320C67XX
Valid Register Pairs
TMS320C67XX CPU data paths
Data lines
: 
scr1 and scr2
32bits (All)
40bits (.L, .S) 
 
   
 
   
Register File Cross Paths:
 Functional units can read and
write operands from own register
files
.L1,.S1,.M1, .L2, .S2, .M2 have
access to opposite side registers
through cross paths
Memory Load and Store Paths:
LD1and LD2 (LDDW)
ST1 and ST2
Data  Address Paths:
DA1 and DA2 allows data address
generated by any one path to
access data to or from any register
Control
 Registers 
(accessed by .S2 alone using MVC)
Address Mode Register AMR
15
        
           0
 
 
Functional Units
Addressing Modes
Register Addressing mode:
mnemonic .unit scr1, scr2, dst
Mnemonic used could be ADD, SUB, MPY etc.
ADD .L1 A1, A2,A3
ADD .S2 B1, B2, B2
ADD .L1 X  A1,B2, A2
Linear Addressing mode Uses
Circular Addressing mode
.D1 and .D2
Addressing Modes
Linear Addressing mode: Uses .D1 and .D2
mnemonic .unit mode field, dst
Load,  store
*+baseR[offsetR/ucst5]   
Positive offset from baseR specified by offserR/ucst5
*-baseR[offsetR/ucst5 ] 
Negative offset from baseR specified by offserR/ucst5
*++baseR[offsetR/ucst5]   Pre-incrmt 
from baseR specified by offserR/ucst5
*--baseR[offsetR/ucst5 ]  Pre-decrmt 
from baseR specified by offserR/ucst5
*baseR++[offsetR/ucst5 ]  Post-incrmt 
from baseR specified by offserR/ucst5
*baseR--[offsetR/ucst5  Post-decrmt 
from baseR specified by offserR/ucst5
Addressing Modes
Linear Addressing mode: Uses .D1 and .D2
mnemonic .unit mode field, dst
LDW .D1 *A0[1], A1
Load contents of mem located pointed by contents of A0+offset(1 left
shifted twice)  into reg A1
 (left shift by 3, 2, 1, 0 for double word, word, half word, byte respt.)
LDW .D1 *++A0[A4], A1
LDW .D1 *A0++[2], A1
Circular Addressing mode:
 Uses .D1 and .D2
A4-A7 and B4-B7 are used
Address mode register is used to select modes for
A4/B4—A7/B7
mnemonic .unit mode field, dst
Addressing Modes
Circular Buffering
Fixed Point Instructions
 
Conditional Operations:
All instructions can be conditional
A1,A2,B0,B1,B2  are tested for conditional operation
(value as zero or non zero can be tested)
Specified condition in register is tested at the beginning of
Execution E1 phase
Parallel Operation:
 8 instructions are fetched to form Fetched packet
Execution of these instructions is controlled by scanning
p-bit from left to right
P=1 of ith instruction; then i+1th instruction is to be
executed in parallel with ith instruction
P=0 of ith instruction; then i+1th instruction is to be
executed in the next machine cycle after ith instruction
Flow  of Execution
 
Flow  of Execution
Fully serial : p bits are zero; need 8 m/c to execute;
Fully parallel : p bits are 1;  need 1m/c
Partially serial :
Flow  of Execution
Flow  of Execution
In summary
Pipelining
Fetch Operation
Program address generate  
 
PG
 
Mem addr of 8 instr of fetch packet is generated
Program address send
 
PS
 
Address are send to mem
Program access ready wait 
 
PW
 
Mem read operation
Program fetch packet receive PR
 
8 instrn are received in CPU
Execution will depend on fully serial, fully parallel or partially serial type
Pipelining
Decode Operation
DP- Instruction dispatch
Fetched packet are spilt into execution packet
Execution packet consists of one  instrn or two to eight parallel instrn
Instrn are assigned to appropriate functional units
DC-Instruction decode
Source registers , destination registers and associated paths are decoded
Pipelining
Execute Operation
Fixed point processor
Floating point processor
Internal Memory
Internal Memory
Cached based
internal mem  arch.
2 level mem arch
L1P,L1D
 -4k size
Not inculded in
Mem. Map
Always enabled
L2 64k size shared
for both program and
data mem
First L1P and L1D are
accessed and if a
miss occurs then L2 is
accessed
L2 controller
facilitates
CPU access EMIF
CPU access
Peripherals
If L2 miss occurs then external memory is
accessed
Memory Attribute Register is used to enable
the external memory
External Memory
On-chip Peripherals
Features:
Provides full-duplex communication
Data selection size of 8,12,16,20,24 and 32 bits
Independent framing  and clocking for receive and transmit
External shift clock or internal programmable clock for data
for transfer
8-bit data transfer with an option of LSB or MSB first
Programmable polarity for both frame synchronization and
data clocks
Double buffered register which allows continuous data
transmission
Auto buffering capability through 5- channel DMA controller
µ law and A law companding
Direct interface to industry standard codecs, A/D, D/A
converters etc.
Multichannel Buffered Serial Port
Multichannel Buffered Serial Port
Features of TMS320C6X Processor
Advanced VLIW CPU with eight functional units, including two
multiplier and six ALUs
Executes up to eight instructions per cycle allows to develop RISC
like code
Instruction packing reduces code size, program fetches and power
consumption
Efficient code execution on independent functional units
Support 8/16/32- bit formats
Field manipulation and instruction extract, set, clear and bit
counting operations
Has support for single precision (32- bit) and double precision(64-
bit) IEEE floating point operations and also 32x32 bit integer
multiplication with 32 or 64- bit results
 
Thank you!
Functional Units and Operations Performed
On-chip Peripherals
Slide Note
Embed
Share

Delve into the world of DSP processors and hardware architectures for signal processing, exploring topics such as Harvard Architecture, pipelining, and specialized instructions like MAC. Discover the simplified architecture of standard microprocessors and the need for DSP architecture with elements like SIMD, VLIW, and parallelism. Gain an understanding of the specialized nature of DSP processors and their crucial role in real-time operations and heavy data flow. Explore the concepts behind pipeline MAC operations, hardware structures, and the importance of parallelism in signal processing.

  • Advanced Processors
  • Signal Processing
  • DSP Architecture
  • Hardware Architecture
  • Parallelism

Uploaded on Feb 26, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Advanced Processors (Books shared in separate link)

  2. Agenda Overview of DSP Unit-5 Unit-6

  3. DSP Processors DSP Processors are specialized microprocessor with an optimized architecture for the fast operational needs of digital signal processing.

  4. Need for DSP Architecture Harvard Architecture Pipelining Fast dedicated hardware MAC Special Instruction Replication On-chip memory and cache Extended Parallelism- SIMD, VLIW, Superscalar Filtering, correlation, FFT Heavy data flow through CPU Real time operations Parallelism

  5. Simplified Architecture of Standard Microprocessor Van Newman Architecture Independency between the operations Limitations on the increase in speed

  6. Hardware Architecture for Signal Processing Multiple Bus Structure Separate data and program memory Data memory Coefficients, input data, out put samples, intermediated data

  7. Non-Pipelining Architecture

  8. Pipeline Architecture

  9. Pipelining Concept

  10. Pipeline MAC Operation

  11. MAC Configuration

  12. Special Instructions Special Instruction: MAC Repeat: RPT

  13. Single Instruction Multiple Data (SIMD) Processing Data bus-A Data bus-B ALU MAC Shifter ALU MAC Shifter Execution Unit A Execution Unit B

  14. SIMD Processing 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bit 16 bi 16x16 MAC 16x16 MAC 16x16 MAC 16x16 MAC 32- bit result 32- bit result 32- bit result 32- bit result

  15. Very Long Instruction Word (VLIW) Instruction fetch packet Eight 32 bit instructions Always 256 bits wide Internal Program Memory 8x32-bits Instruction fetch decode and dispatch Execution packet Dispatches instructions into appropriate execution units Varies from one to eight instructions (32 bits to 256 bits) nx32-bits L1 S1 M1 D1 L1 S1 M1 D1 Register file A Register file B Two data paths 32-bits 32-bits Internal data RAM

  16. Superscalar Processors Uses instruction level parallelism Developed to execute multiple instructions in one cycle Achieved through multiple execution units Extensive use of pipelining Instruction width is not fixed An instruction can be issued to execute in parallel like SIMD Uses load/store architecture suitable to take two inputs and compute an output

  17. Fixed point and Floating point representation 16-bit signed fractional point, often indicated as Q1.15 IEEE 754 normalized representation of a single precision floating point number.

  18. General purpose DSP architecture DSP Processors Fixed point processors Floating point processors Represent each number with a minimum of 16 bits 216 = 65536 possible bit patterns can represent a number Unsigned integer : 0 to 65,535 Signed integer : -32,768 to 32,767 Unsigned fraction : spread uniformly between 0 to 1 Signed fraction : spread uniformly between -1 to 1 Represent each number with a minimum of 16 bits 232 = 4,294,967,296 possible bit patterns can represent a number Represented numbers are not uniformly spaced ANSI/IEEE Std. 754-1985-- the largest and smallest numbers are 3.4 1038 and 1.2x10-38, respectively The represented values are unequally spaced between these two extremes, such that the gap between any two numbers is about ten-million times smaller than the value of the numbers. This is important because it places large gaps between large numbers, but small gaps between small numbers

  19. Fixed point digital signal processors First Generation Second Generation Third Generation TMS320C54xx,D SP563X and DSP16000 Aimed for Digital communication and Digital Audio Special instructions for Adaptive filtering which included echo cancellations and adaptive equalization and Viterbi decoding Low power and had power management facility Fourth Generation TMS320C1X by TI in 1982 Dedicated AU with multiplier and accumulator Harvard architecture with separate program and data memory On-chip memory and special instructions for execution of basic DSP algorithms TMS320C5X from TI, DSP5600X from Motorola, ADSP21XX from Analog Devices, DSP16XX from Lucent Technologies Enhanced features than first generation Larger on-chip memory and more special instructions to execute DSP algorithms MAC with Repeat TMS320C62XX VLIW Included extensive parallelism while maintaining the features of earlier versions Wider instructions, wider data paths more registers, larger instruction cache and multiple AU

  20. Floating point DSP processors First Generation Second Generation Third Generation TMS320C3X TI Larger memory and many on-chip peripheral facilities Program cache and on-chip dual access memories Graphics and Image processing Supported three floating point formats TMS320C4X, ADSP- 2106x SHARCH Emphasis on multiprocessing and multiprocessor support TMS320C67xx, ADSP-TS001 VLIW

  21. Special purpose Digital Signal Processor Hardware digital filters : FIR FIR structure Hardware Architecture for FIR filter

  22. Special Purpose Digital Signal Processor Hardware digital filters : IIR IIR Structure Hardware architecture for IIR filter

  23. Special purpose Digital Signal Processor Hardware FFT Processors Simplified architecture of hardware FFT processor Concept of hardware butterfly processor

  24. Special purpose Digital Signal Processor Hardware FFT Processors Double buffering in real-time FFT FFT performed on N point data in buffer A while buffer B is being filled

  25. Architecture of TMS320C67XX

  26. Valid Register Pairs

  27. Name of the unit .L unit .S unit .M unit .D unit Type of operation 32 bit add and subtract operations only Arithmetic operation 32/40 bit operation 32 bit operation - Logical operation 32-bit operations 32-bit operations - 32-bit logical operations* 16x16 multiply operations Multiply operations - - - 32/40 bit shift operations Shift operations - - - Compare operations 32/40 bit operation - - - Branch operations - Yes - - Loads and stores with 5-bit constant offset(15 bit constant offset in .D2 only) Load and Store operations - - - Linear and circular address calculation - - - Yes Constant generation - Yes - - 32/40 bit count operations Count operations - - - 16 bit move operations Move operations Register to register only - Register to register only

  28. TMS320C67XX CPU data paths Data lines: scr1 and scr2 32bits (All) 40bits (.L, .S) Register File Cross Paths: Functional units can read and write operands from own register files .L1,.S1,.M1, .L2, .S2, .M2 have access to opposite side registers through cross paths Memory Load and Store Paths: LD1and LD2 (LDDW) ST1 and ST2 Data Address Paths: DA1 and DA2 allows data address generated by any one path to access data to or from any register

  29. Control Registers (accessed by .S2 alone using MVC) Register Name Abbre. Description Addressing Mode Reg. AMR Specifies linear or circular addressing of A4-A7 &B4-B7 Control Status Reg. CSR Contains important control and status bits of the processor Program Counter E1 Phase Reg. PCE1 Contains the address of the fetch packet that is in the E1 phase of the pipeline Interrupt Flag Reg. IFR Contains the status of INT4-INT5 and NMI maskable interrupts Interrupt Set Reg. ISR Used to manually set maskable pending interrupts Interrupt Clear Reg. ICR Used to manually clear maskable pending interrupts Interrupt Enable Reg. IER Used to enable/disable the individual maskable interrupts Interrupt Service Table Reg. ISTP Points to beginning of interrupt service table Interrupt Return Pointer IRP Contains the address to be used to return from a maskable interrupt Non-maskable Interrupt Return Pointer NRP Contains the address to be used to return from a non- maskable interrupt

  30. Address Mode Register AMR 31 26 25 21 20 16 Reserved BK1 BK2 B7 mode B6 mode B5 mode B4 mode A7 mode A6 mode A5 mode A4 mode 15 0 Mode Select Description of mode 0 0 Linear modification of address 0 1 Circular addressing using BK0 1 0 Circular addressing using BK1 1 1 Reserved

  31. Unit 5 Introduction to Computer Architecture R5-12.1, R5-12.2 General purpose Digital Signal Processors R5-12.3 Selecting digital signal processors R5-12.4 Special purpose DSP Hardware R5-12.6 Architecture of TMS320C67X Reference GuideTMS320C67XX/T2-13.2 Features of C67X processors Reference GuideTMS320C67XX/T2-13.2 TMS320C67x/C67x+ DSPCPU and Instruction Set Reference Guide/T2-13.4 CPU General purpose register files TMS320C67x/C67x+ DSPCPU and Instruction Set Reference Guide/T2-13.5 Functional units and operation TMS320C67x/C67x+ DSPCPU and Instruction Set Reference Guide/T2-13.6 Data paths TMS320C67x/C67x+ DSPCPU and Instruction Set Reference Guide/T2-13.7 Control register file TMS320C67x/C67x+ DSPCPU and Instruction Set Reference Guide/T2-13.8

  32. Functional Units Name of unit .L Type of operations Arithmetic, Logical, Compare , Other Arithmetic, Logical, Shift , Branch, Move, Other Multiply Arithmetic, Load store, Other .S .M .D

  33. Addressing Modes Register Addressing mode: mnemonic .unit scr1, scr2, dst Mnemonic used could be ADD, SUB, MPY etc. ADD .L1 A1, A2,A3 ADD .S2 B1, B2, B2 ADD .L1 X A1,B2, A2 Linear Addressing mode Uses Circular Addressing mode .D1 and .D2

  34. Addressing Modes Linear Addressing mode: Uses .D1 and .D2 mnemonic .unit mode field, dst Load, store *+baseR[offsetR/ucst5] Positive offset from baseR specified by offserR/ucst5 *-baseR[offsetR/ucst5 ] Negative offset from baseR specified by offserR/ucst5 *++baseR[offsetR/ucst5] Pre-incrmt from baseR specified by offserR/ucst5 *--baseR[offsetR/ucst5 ] Pre-decrmt from baseR specified by offserR/ucst5 *baseR++[offsetR/ucst5 ] Post-incrmt from baseR specified by offserR/ucst5 *baseR--[offsetR/ucst5 Post-decrmt from baseR specified by offserR/ucst5

  35. Addressing Modes Linear Addressing mode: Uses .D1 and .D2 mnemonic .unit mode field, dst LDW .D1 *A0[1], A1 Load contents of mem located pointed by contents of A0+offset(1 left shifted twice) into reg A1 (left shift by 3, 2, 1, 0 for double word, word, half word, byte respt.) LDW .D1 *++A0[A4], A1 LDW .D1 *A0++[2], A1

  36. Addressing Modes Circular Addressing mode: Uses .D1 and .D2 A4-A7 and B4-B7 are used Address mode register is used to select modes for A4/B4 A7/B7 mnemonic .unit mode field, dst

  37. Circular Buffering

  38. Fixed Point Instructions Instruction Functional Unit Description MV .L1 or .L2 .S1 or .S2 .D1 or .D2 Move value from one register to another MVC .S2 only Move value between control register and registerfile MVK .S1 or .S2 Move 16-bit const into lower 16-bits of a register and sign extended MVKLH .S1 or .S2 Move 16-bit const into upper 16-bits of a register MVKH .S1 or .S2 Move upper 16-bit const value of 32-bit into upper 16- bits of a register

  39. Flow of Execution Conditional Operations: All instructions can be conditional A1,A2,B0,B1,B2 are tested for conditional operation (value as zero or non zero can be tested) Specified condition in register is tested at the beginning of Execution E1 phase Parallel Operation: 8 instructions are fetched to form Fetched packet Execution of these instructions is controlled by scanning p-bit from left to right P=1 of ith instruction; then i+1th instruction is to be executed in parallel with ith instruction P=0 of ith instruction; then i+1th instruction is to be executed in the next machine cycle after ith instruction

  40. Flow of Execution

  41. Flow of Execution Fully serial : p bits are zero; need 8 m/c to execute; Fully parallel : p bits are 1; need 1m/c Partially serial :

  42. Flow of Execution In summary

  43. Pipelining Fetch Operation Program address generate PG Program address send Program access ready wait Program fetch packet receive PR Mem addr of 8 instr of fetch packet is generated Address are send to mem Mem read operation 8 instrn are received in CPU PS PW Execution will depend on fully serial, fully parallel or partially serial type

  44. Pipelining Decode Operation DP- Instruction dispatch Fetched packet are spilt into execution packet Execution packet consists of one instrn or two to eight parallel instrn Instrn are assigned to appropriate functional units DC-Instruction decode Source registers , destination registers and associated paths are decoded

  45. Pipelining Execute Operation E1 E2 E3 E4 E5 Fixed point processor E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 Floating point processor

  46. Internal Memory

  47. Internal Memory Cached based internal mem arch. 2 level mem arch L1P,L1D -4k size Not inculded in Mem. Map Always enabled L2 64k size shared for both program and data mem First L1P and L1D are accessed and if a miss occurs then L2 is accessed L2 controller facilitates CPU access EMIF CPU access Peripherals

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#