Efficient Video Encoder on CPU+FPGA Platform

 
Highly Efficient and Flexible Video Encoder
on CPU+FPGA Platform
 
Di Wu, Liang Zhang, Peng Liu, Yao Song
 
Motivation
 
For Video Encoding
 
 
 
 
 
 
 
Is it possible to combine the advantages of both
ASIC and Software solution?
 
CPU+FPGA
 
Industry has provided CPU+FPGA SoCs, and it is a
trend to integrate FPGA for GPUs in datacenter!
 
Xilinx Zynq - Dual-core ARM Cortex A9 + FPGA
 
For video encoding
Video Frame 
Encoding
 
(FPGA)
 
Video Packet Wrapping (CPU)
Video Packet Transmission (CPU)
 
 
 
High performance!
Easy upgrading!
 
 
Easy programming!
 
 
Discussions in x264 Community
 
[
x
2
6
4
-
d
e
v
e
l
]
 
F
P
G
A
s
 
a
n
d
 
x
2
6
4
h
t
t
p
s
:
/
/
m
a
i
l
m
a
n
.
v
i
d
e
o
l
a
n
.
o
r
g
/
p
i
p
e
r
m
a
i
l
/
x
2
6
4
-
d
e
v
e
l
/
2
0
0
9
-
J
u
l
y
/
0
0
6
0
0
9
.
h
t
m
l
J
u
s
t
 
o
f
f
l
o
a
d
i
n
g
 
s
i
m
p
l
e
 
D
S
P
 
f
u
n
c
t
i
o
n
s
 
t
o
 
a
n
 
f
p
g
a
 
i
s
 
a
 
b
a
d
i
d
e
a
 
w
h
e
n
 
t
h
e
 
h
o
s
t
 
i
s
 
a
 
m
o
d
e
r
n
 
c
p
u
.
 
Xilinx Zynq Architecture
 
programming system (PS) and programmable logic (PL)
AXI interface connecting PS and PL
 
http://www.xilinx.com/support/documentation/data_sheets/ds190-Zynq-7000-Overview.pdf
 
Design Process
 
Hardware
Developing custom IPs
Instantiating user defined IPs in Vivado
Instantiating programming system (PS) and predefined IP
as well as user IP in Vivado
Synthesizing the system and implementing on Zynq
Software
E
xport hardware to SDK
Create BSP
Application development and debugging
Most
efforts of our project
 
H.264 Baseline Profile
 
one of the three profiles defined in H.264 standard
widely used in mobile devices
supports predictive encoding, discrete cosine transform and
quantisation, entropy encoding
 
Video Encoder’s Tasks
 
Data Access
Input video frames
Reference frames
Intermediate data
Encoded video packets
Encoding
Motion Estimation
Prediction (Intra + Inter)
Filtering
Entropy coding
 
Data Access Challenges
 
We want
Low latency
Low cost
Zynq provides
AXI-based interconnection -> high throughput, but high
latency
Only PS has memory controller
 
Solution
Minimum data exchange between PS and PL
 
AXI Interfaces
 
Three kinds of interfaces
AXI-Lite (Memory Map)
AXI-Full (Memory Map)
AXI-Stream
We use AXI-Lite interface for control
signals, AXI-Full for YUV frames and
encoded packets
 
PS-PL Interconnection In Zynq
 
http://www.googoolia.com/wp/2014/06/20/lesson-8-an-overview-on-zynq-architecture/
 
Interfaces of Our H264 Encoder
 
Interrupt port
AXI-Lite slave interface for configuration
AXI-Full master interface for YUV frames reading
and encoded packets writing
Encoder IP
 
configuration
 
YUV frames
 
Interrupt
 
Encoded packets
AXI-Lite
Slave
AXI-Full
Master
 
The Encoder Engine
Encoder Engine
 
CLK
 
CLK2
 
NEWSLICE
 
NEWLINE
 
QP
 
xbuffer_DONE
 
intra4x4_READYI
 
intra4x4_STROBEI
 
intra4x4_DATAI
 
intra8x8cc_readyi
 
intra4x4_READYI
 
intra4x4_STROBEI
 
tobytes_BYTE
 
tobytes_STROBE
 
tobytes_DONE
 
control signals
 
U,V values of pixels
 
We use the open source implementation from
http://hardh264.sourceforge.net/
 
Encoded packets
 
Y values of pixels
 
Internal of Our H264 Encoder
 
Configuration
 
YUV frames
 
Interrupt
 
Encoded packets
AXI Lite
AXI Master Burst
Y
 
buffer(4-way)
U buffer
(4-way)
V buffer
(4-way)
Encoder Engine
Encoder Controller
 
Data Path
 
Control Path
 
Open Source
 
Xilinx IP
 
Our Verilog code
 
AXI Interfaces Implementation
 
Implement from scratch
RTL implementation based on templates
generated by Vivado
Design with HLS
Use Xilinx’s IP
AXI Lite Slave
interface
AXI Master
Interface
 
Block Design
 
Let’s integrate our IP to Zynq SoC
Encoder IP
 
configuration
 
YUV frames
 
Interrupt
 
Encoded packets
AXI4
Interconnect
DDR Memory
Controller
 
PL
 
PS
ARM Processor
Interrupt
Controller
HP0
GP0
 
Software Implementation
 
How to control the video encoder IP?
The AXI-Lite interface (memory map) can access registers
Control informations need to transfer to video
encoder
Start address of a YUV frame
YUV format
Video resolution
QP value of the encoder
frame number
output address of h264 stream
Information provided by video encoder
video packet size
 
Encoding Process in PS
 
1.
Put a video frame (YUV) in DRAM (from camera, decoder’s
output, etc.)
2.
Config the parameters
3.
Start encoding
4.
On interrupt, save encoded result
 
Encoding Process in PL
 
1.
Move a YUV frame data from DDR RAM to Y, U, V buffers
2.
Feed the pixels to encoder engine
3.
Wait for the encoder to generate video packets and move
the packets to DDR RAM
4.
After finish the encoding of one frame, generate an interrupt
 
Implementation
 
Our implementation is still ongoing
Hopefully finish the design and evaluation before report
deadline
Development Environment
Vivado 2015.3
Zedboard
 
Workload Characteristics
 
Data intensive
Frame by frame, not stream
Computation intensive
Real-time requirement for some applications
 
The Encoder Engine Verification
 
We simulated the encoder engine with test
bench, it can generate the correct NAL unit
stream.
Source YUV: coastguard(30 frames), 352x288, YUV420
QP: 28
PSNR: 36.3 dB
Compression ratio: 10.6:1
Reference: x264’s PSNR with same QP, medium preset
Only I frame. PSNR 40.31 dB, compression ratio 8.3:1
With P, B frame. PSNR 37.0 dB, compression ratio 42.6:1
 
Design Review
 
A H.264 video encoder IP for Zynq SoC
Our design needs to follow the design paradigm of AXI4
based IP
Minimum communication is the key to power/performance
gain
Use high throughput mode (burst), or other optimization
(future work: pipelining) to optimize the communication
between CPU and FPGA logic
Vivado is a powerful tool, but the learning curve is very high
(especially for software guys)
 
Discussion
 
Can we use some general framework to simplify
our work?
No
 
Example: RSoC framework:
http://rsoc-
framework.com/
Suitable for “stream” applications
 
Conclusion
 
Task offloading is not always helpful,
communication latency matters
Video encoding on CPU+FPGA SoC can be both
efficient and flexible
Our system can be used to integrate other
encoder engines
CPU+FPGA SoCs are powerful tools for
designers, “do simple thing fast on HW,
intelligence on SW”.
 
Future Works
 
Hardware
Optimization, e.g. pipelining
Video encoder engine improvement, e.g. support P frame
and B frame
Software
Bitrate control logic
OS support
Integration with multimedia software framework
(ffmpeg, gstreamer, etc.)
 
Thanks!
Q&A
Slide Note
Embed
Share

Explore the integration of CPU and FPGA for a highly efficient and flexible video encoder. Learn about the motivation, industry trends, discussions, Xilinx Zynq architecture, design process, H.264 baseline profile, and more to achieve high throughput, low power consumption, and easy upgrading.


Uploaded on Oct 10, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform Di Wu, Liang Zhang, Peng Liu, Yao Song

  2. Motivation For Video Encoding high throughput low power consumption difficult to upgrade Some tasks not suitable for hardware ASIC easy upgrade to new algorithm and standard highest quality poor encoding performance Pure software Is it possible to combine the advantages of both ASIC and Software solution?

  3. CPU+FPGA Industry has provided CPU+FPGA SoCs, and it is a trend to integrate FPGA for GPUs in datacenter! Xilinx Zynq - Dual-core ARM Cortex A9 + FPGA For video encoding Video Frame Encoding (FPGA) High performance! Easy upgrading! Video Packet Wrapping (CPU) Video Packet Transmission (CPU) Easy programming!

  4. Discussions in x264 Community [x264-devel] FPGAs and x264 https://mailman.videolan.org/pipermail/x264-devel/2009-July/006009.html Just offloading simple DSP functions to an fpga is a bad idea when the host is a modern cpu.

  5. Xilinx Zynq Architecture programming system (PS) and programmable logic (PL) AXI interface connecting PS and PL http://www.xilinx.com/support/documentation/data_sheets/ds190-Zynq-7000-Overview.pdf

  6. Design Process Most efforts of our project Hardware Developing custom IPs Instantiating user defined IPs in Vivado Instantiating programming system (PS) and predefined IP as well as user IP in Vivado Synthesizing the system and implementing on Zynq Software Export hardware to SDK Create BSP Application development and debugging

  7. H.264 Baseline Profile one of the three profiles defined in H.264 standard widely used in mobile devices supports predictive encoding, discrete cosine transform and quantisation, entropy encoding

  8. Video Encoders Tasks Data Access Input video frames Reference frames Intermediate data Encoded video packets Encoding Motion Estimation Prediction (Intra + Inter) Filtering Entropy coding

  9. Data Access Challenges We want Low latency Low cost Zynq provides AXI-based interconnection -> high throughput, but high latency Only PS has memory controller Solution Minimum data exchange between PS and PL

  10. AXI Interfaces Three kinds of interfaces AXI-Lite (Memory Map) AXI-Full (Memory Map) AXI-Stream We use AXI-Lite interface for control signals, AXI-Full for YUV frames and encoded packets

  11. PS-PL Interconnection In Zynq http://www.googoolia.com/wp/2014/06/20/lesson-8-an-overview-on-zynq-architecture/

  12. Interfaces of Our H264 Encoder Interrupt port AXI-Lite slave interface for configuration AXI-Full master interface for YUV frames reading and encoded packets writing Interrupt AXI-Lite Slave configuration Encoder IP AXI-Full Master YUV frames Encoded packets

  13. The Encoder Engine CLK CLK2 NEWSLICE control signals NEWLINE QP xbuffer_DONE intra4x4_READYI Encoder Engine Y values of pixels intra4x4_STROBEI intra4x4_DATAI intra8x8cc_readyi U,V values of pixels intra4x4_READYI intra4x4_STROBEI Encoded packets tobytes_BYTE tobytes_STROBE tobytes_DONE We use the open source implementation from http://hardh264.sourceforge.net/

  14. Internal of Our H264 Encoder Interrupt AXI Lite Encoder Controller Configuration Y buffer(4-way) YUV frames U buffer(4-way) Encoder Engine AXI Master Burst V buffer(4-way) Encoded packets Data Path Our Verilog code Open Source Control Path Xilinx IP

  15. AXI Interfaces Implementation Implement from scratch RTL implementation based on templates generated by Vivado Design with HLS Use Xilinx s IP AXI Lite Slave interface AXI Master Interface

  16. Block Design Let s integrate our IP to Zynq SoC Interrupt Controller Interrupt GP0 configuration ARM Processor Encoder IP YUV frames AXI4 HP0 Interconnect Encoded packets DDR Memory Controller PL PS

  17. Software Implementation How to control the video encoder IP? The AXI-Lite interface (memory map) can access registers Control informations need to transfer to video encoder Start address of a YUV frame YUV format Video resolution QP value of the encoder frame number output address of h264 stream Information provided by video encoder video packet size

  18. Encoding Process in PS 1. Put a video frame (YUV) in DRAM (from camera, decoder s output, etc.) 2. Config the parameters 3. Start encoding 4. On interrupt, save encoded result

  19. Encoding Process in PL 1. Move a YUV frame data from DDR RAM to Y, U, V buffers 2. Feed the pixels to encoder engine 3. Wait for the encoder to generate video packets and move the packets to DDR RAM 4. After finish the encoding of one frame, generate an interrupt

  20. Implementation Our implementation is still ongoing Hopefully finish the design and evaluation before report deadline Development Environment Vivado 2015.3 Zedboard

  21. Workload Characteristics Data intensive Frame by frame, not stream Computation intensive Real-time requirement for some applications

  22. The Encoder Engine Verification We simulated the encoder engine with test bench, it can generate the correct NAL unit stream. Source YUV: coastguard(30 frames), 352x288, YUV420 QP: 28 PSNR: 36.3 dB Compression ratio: 10.6:1 Reference: x264 s PSNR with same QP, medium preset Only I frame. PSNR 40.31 dB, compression ratio 8.3:1 With P, B frame. PSNR 37.0 dB, compression ratio 42.6:1

  23. Design Review A H.264 video encoder IP for Zynq SoC Our design needs to follow the design paradigm of AXI4 based IP Minimum communication is the key to power/performance gain Use high throughput mode (burst), or other optimization (future work: pipelining) to optimize the communication between CPU and FPGA logic Vivado is a powerful tool, but the learning curve is very high (especially for software guys)

  24. Discussion Can we use some general framework to simplify our work? No Example: RSoC framework:http://rsoc- framework.com/ Suitable for stream applications

  25. Conclusion Task offloading is not always helpful, communication latency matters Video encoding on CPU+FPGA SoC can be both efficient and flexible Our system can be used to integrate other encoder engines CPU+FPGA SoCs are powerful tools for designers, do simple thing fast on HW, intelligence on SW .

  26. Future Works Hardware Optimization, e.g. pipelining Video encoder engine improvement, e.g. support P frame and B frame Software Bitrate control logic OS support Integration with multimedia software framework (ffmpeg, gstreamer, etc.)

  27. Thanks! Q&A

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#