Google's Tensor Processing Unit (TPU) and Deep Neural Networks in Parallel Computing

 
EE 193: Parallel Computing
 
Fall 2017
Tufts University
 
Instructor: Joel Grodstein
 
Lecture 9: Google TPU
 
Resources
 
In-datacenter performance analysis of a tensor-processing unit
Norm Jouppi et. al, Google, ISCA 2017
a talk by Dave Patterson
he mostly just presented the above paper
Efficient Processing of Deep Neural Networks: A Tutorial
and Survey
Vivienne Sze et al., MIT, 2017
Available at Cornell Libraries
Design and Analysis of a Hardware CNN Accelerator
Kevin Kiningham et al, 2017 Stanford Tech Report
Describes a systolic array
Is it similar to the TPU? Don’t know…
 
EE 193 Joel Grodstein
Summary of this course
 
Building a one-core computer that’s really fast is
impossible
Instead, we build lots of cores, and hope to fill
them with lots of threads
We’ve succeeded
But we had to do lots of work (especially for a GPU)
Is there a better way?
Domain-specific architectures
Build custom hardware to match your problem
Only cost effective if your problem is really important
EE 193 Joel Grodstein
The problem
 
Three problems that Google faces:
Speech recognition and translation on phones
Search ranking
Google fixes your searches for you
Common solution: deep neural networks
EE 193 Joel Grodstein
 
Neural net
 
Each node: a weighted sum 
y
=
w
1
in
1
+
w
2
in
2
+…
Followed by an 
excitation function
e.g., out = (
y
<0? 0 : 1)
Each vertical slice is a 
layer
Deep
 neural net has lots of layers
 
EE 193 Joel Grodstein
 
Types of neural nets
 
Multi-layer perceptrons 
are fully connected (60%)
Convolutional neural net 
has connections to nearby
inputs only (30%)
Recurrent NN 
has internal state, like an IIR filter (5%)
 
EE 193 Joel Grodstein
 
What does this remind you of?
 
Matrix multiplication – yet again!
y
=
w
1
in
1
+
w
2
in
2
+… This is a vector dot product
One layer: output vector = weight matrix * input vector
So, yes, Deep Neural Networks is mostly just lots
of matrix multiplies (plus the excitation function)
 
EE 193 Joel Grodstein
How do we make it run fast?
 
We’ve spent an entire course learning that.
For Google, it wasn’t good enough
One network can have 5-100M weights
Lots and lots of customers
Goal: improve GPU performance by 10x
Strategy: what parts of a CPU/GPU do we not need?
Remove them, add more computes
EE 193 Joel Grodstein
What did they add?
 
Matrix-multiply unit
65K multiply-accumulate units!
Pascal has 3800 cores.
Lots of on-chip memory to hold the weights, and
temporary storage
3x more on-chip memory than a GPU
Still can’t hold them all, but a lot
What did they remove to make room for this stuff?
EE 193 Joel Grodstein
What did they remove?
 
No OOO
No speculative execution or branch prediction
Actually, no branches at all!
Far less registers
Pascal has 64K registers/SM; TPU has very few
How can they get away with that?
EE 193 Joel Grodstein
 
Like a GPU
 
Like a GPU
 
Just the arithmetic
Very few registers
 
Outputs of one stage directly feed the next stage
GPU or CPU approach
stuff outputs into registers
read them again as next-stage inputs
costs silicon area, and also power
just get rid of them, direct feed
part of what’s called a 
systolic array
also remove intermediate registers for the sums
EE 193 Joel Grodstein
What did they remove?
 
Far fewer threads
Haswell = 2 threads/core
Pascal 
 1000 threads/SM
TPU = ??
No SMT
But isn’t lots of threads and SMT what makes things go fast?
SMT: when one thread is stalled, swap in another
Stalls are hard to predict (depend on memory patterns, branch
patterns)
Keep lots of threads waiting
 
EE 193 Joel Grodstein
 
TPU has only
one application
 
TPU has no
branches
1 thread!
Downsides of SMT
 
Lots of threads required lots of registers
TPU doesn’t need nearly as many registers
Moving data back and forth between registers and ALUs costs
power
SMT is a way to keep the machine busy during an
unpredictable stalls
It doesn’t change the fact that a thread stalled unpredictably
It doesn’t make the stalled thread faster
It improves 
average
 latency, not 
worst-case
 latency
Google’s requirement:
Customers want 99% reliability on worst-case time
They don’t want unpredictable waits more than 1% of the time
SMT does not fit this customer requirement
EE 193 Joel Grodstein
 
What did they remove?
 
No OOO
No branches (or branch prediction, speculative execution)
Far less registers
1 thread, no SMT
8-bit integer arithmetic
training vs inference
training is in floating point; done infrequently and can be slow
inference can be done with 8-bit integers
TPU uses 8-bit MACs. Saves substantial area – but TPU cannot do
training
 
EE 193 Joel Grodstein
 
Results
 
15-30X faster at inference than GPU or CPU
30X-80X better performance/watt
“Performance” is 99
th
 percentile
 
EE 193 Joel Grodstein
 
Block diagram
 
EE 193 Joel Grodstein
 
Chip floorplan
 
EE 193 Joel Grodstein
Slide Note
Embed
Share

Delve into the world of Google's TPU and deep neural networks as key solutions for speech recognition, search ranking, and more. Learn about domain-specific architectures, the structure of neural networks, and the essence of matrix multiplication in parallel computing.

  • Google
  • Tensor Processing Unit
  • Deep Neural Networks
  • Parallel Computing
  • Domain-specific Architectures

Uploaded on Oct 05, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. EE 193: Parallel Computing Fall 2017 Tufts University Instructor: Joel Grodstein joel.grodstein@tufts.edu Lecture 9: Google TPU 1

  2. Resources In-datacenter performance analysis of a tensor-processing unit Norm Jouppi et. al, Google, ISCA 2017 https://eecs.berkeley.edu/research/colloquium/170315 a talk by Dave Patterson he mostly just presented the above paper Efficient Processing of Deep Neural Networks: A Tutorial and Survey Vivienne Sze et al., MIT, 2017 Available at Cornell Libraries Design and Analysis of a Hardware CNN Accelerator Kevin Kiningham et al, 2017 Stanford Tech Report Describes a systolic array Is it similar to the TPU? Don t know EE 193 Joel Grodstein 2

  3. Summary of this course Building a one-core computer that s really fast is impossible Instead, we build lots of cores, and hope to fill them with lots of threads We ve succeeded But we had to do lots of work (especially for a GPU) Is there a better way? Domain-specific architectures Build custom hardware to match your problem Only cost effective if your problem is really important EE 193 Joel Grodstein 3

  4. The problem Three problems that Google faces: Speech recognition and translation on phones Search ranking Google fixes your searches for you Common solution: deep neural networks EE 193 Joel Grodstein 4

  5. Neural net Each node: a weighted sum y=w1in1+w2in2+ Followed by an excitation function e.g., out = (y<0? 0 : 1) Each vertical slice is a layer Deep neural net has lots of layers EE 193 Joel Grodstein 5

  6. Types of neural nets Multi-layer perceptrons are fully connected (60%) Convolutional neural net has connections to nearby inputs only (30%) Recurrent NN has internal state, like an IIR filter (5%) EE 193 Joel Grodstein 6

  7. What does this remind you of? Matrix multiplication yet again! y=w1in1+w2in2+ This is a vector dot product One layer: output vector = weight matrix * input vector So, yes, Deep Neural Networks is mostly just lots of matrix multiplies (plus the excitation function) EE 193 Joel Grodstein 7

  8. How do we make it run fast? We ve spent an entire course learning that. For Google, it wasn t good enough One network can have 5-100M weights Lots and lots of customers Goal: improve GPU performance by 10x Strategy: what parts of a CPU/GPU do we not need? Remove them, add more computes EE 193 Joel Grodstein 8

  9. What did they add? Matrix-multiply unit 65K multiply-accumulate units! Pascal has 3800 cores. Lots of on-chip memory to hold the weights, and temporary storage 3x more on-chip memory than a GPU Still can t hold them all, but a lot What did they remove to make room for this stuff? EE 193 Joel Grodstein 9

  10. What did they remove? Like a GPU No OOO No speculative execution or branch prediction Actually, no branches at all! Far less registers Pascal has 64K registers/SM; TPU has very few How can they get away with that? Like a GPU Just the arithmetic EE 193 Joel Grodstein 10

  11. Very few registers Outputs of one stage directly feed the next stage GPU or CPU approach stuff outputs into registers read them again as next-stage inputs costs silicon area, and also power just get rid of them, direct feed part of what s called a systolic array also remove intermediate registers for the sums EE 193 Joel Grodstein 11

  12. What did they remove? Far fewer threads Haswell = 2 threads/core Pascal 1000 threads/SM TPU = ?? No SMT But isn t lots of threads and SMT what makes things go fast? SMT: when one thread is stalled, swap in another Stalls are hard to predict (depend on memory patterns, branch patterns) Keep lots of threads waiting 1 thread! TPU has no branches TPU has only one application EE 193 Joel Grodstein 12

  13. Downsides of SMT Lots of threads required lots of registers TPU doesn t need nearly as many registers Moving data back and forth between registers and ALUs costs power SMT is a way to keep the machine busy during an unpredictable stalls It doesn t change the fact that a thread stalled unpredictably It doesn t make the stalled thread faster It improves average latency, not worst-case latency Google s requirement: Customers want 99% reliability on worst-case time They don t want unpredictable waits more than 1% of the time SMT does not fit this customer requirement EE 193 Joel Grodstein 13

  14. What did they remove? No OOO No branches (or branch prediction, speculative execution) Far less registers 1 thread, no SMT 8-bit integer arithmetic training vs inference training is in floating point; done infrequently and can be slow inference can be done with 8-bit integers TPU uses 8-bit MACs. Saves substantial area but TPU cannot do training EE 193 Joel Grodstein 14

  15. Results 15-30X faster at inference than GPU or CPU 30X-80X better performance/watt Performance is 99th percentile EE 193 Joel Grodstein 15

  16. Block diagram EE 193 Joel Grodstein 16

  17. Chip floorplan EE 193 Joel Grodstein 17

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#