OBPMark and OBPMark-ML: Computational Benchmarks for Space Applications

 
 
 
 
 
O
B
P
M
a
r
k
 
a
n
d
 
O
B
P
M
a
r
k
-
M
L
 
C
o
m
p
u
t
a
t
i
o
n
a
l
 
B
e
n
c
h
m
a
r
k
s
 
f
o
r
 
O
n
-
B
o
a
r
d
 
D
a
t
a
 
P
r
o
c
e
s
s
i
n
g
a
n
d
 
M
a
c
h
i
n
e
 
L
e
a
r
n
i
n
g
 
i
n
 
S
p
a
c
e
 
A
p
p
l
i
c
a
t
i
o
n
s
 
David Steenari
1
, Ivan Rodriguez-Ferrandez
1,2,3
, Sanath Muret
1
,
Marc Solé Bonet
2,3
, Jannis Wolf
2,3
, Luis Mansilla
1
, Leonidas Kosmidis
2,3
1
 ESA/ESTEC
2
 Barcelona Supercomputing 
Center
 (BSC)
3
 
Universitat Politècnica
 
de Catalunya 
(UPC)
 
03/10/2023
 
ESA ESTEC
 
 
 
B
a
c
k
g
r
o
u
n
d
 
a
n
d
 
M
o
t
i
v
a
t
i
o
n
Motivation:
FPGAs and Multicore CPUs have been the most on-board processing devices in recent years.
As of late, several new device types with hardware accelerations have been introduced in on-board
systems: Multicore FPGA SoCs, GPUs, VPUs, AI Accelerators, array and many-core processors.
Different device types utilize different programming paradigms.
Different device/HW accelerator types increase the difficulty to easily compare computational performances
in key processing applications.
L
a
c
k
 
o
f
 
o
p
e
n
l
y
 
a
v
a
i
l
a
b
l
e
 
s
p
a
c
e
 
a
p
p
l
i
c
a
t
i
o
n
 
b
e
n
c
h
m
a
r
k
s
 
l
e
d
 
t
o
 
t
h
e
 
d
e
f
i
n
i
t
i
o
n
 
a
n
d
 
r
e
l
e
a
s
e
 
o
f
 
O
B
P
M
a
r
k
 
(
i
n
 
2
0
2
1
)
…as operations per second (FLOPS, GMACs, MIPS, etc) do not give the full picture. Neither do synthetic
benchmarks.
…and commercially available processing benchmarks do not:
…target space applications (hyperspectral, radar, etc.)  – and do not lend better understanding
for OBP applications.
…performance per power metrics not always included.
…FPGA implementation not considered.
…not easily portable to non-standard/parallel programming paradigms
GR740 (Gaisler)
TX2 (NVIDIA)
ZUS+ (Xilinx)
Myriad 2 (Intel)
HPDP (ISD)
Kalray (MPPA)
 
 
 
O
B
P
M
a
r
k
 
S
u
i
t
e
 
O
v
e
r
v
i
e
w
 
a
n
d
 
O
b
j
e
c
t
i
v
e
s
OBPMark (“On-Board Processing Benchmarks”) is suite a computational performance benchmarks for space
applications.
Developed by ESA and BSC/UPC. Funded through ESA GSP contract “GPU4S” – contract now closed.
Source code implementations in public release.
Early usage in a number of ESA activities (HW/SW developments, etc. as reference application cases).
Objectives:
T
o
 
p
r
o
m
o
t
e
 
a
 
s
t
a
n
d
a
r
d
 
s
e
t
 
o
f
 
a
p
p
l
i
c
a
t
i
o
n
-
l
e
v
e
l
 
b
e
n
c
h
m
a
r
k
s
,
 
a
s
 
t
o
 
e
n
a
b
l
e
 
a
 
m
e
t
h
o
d
 
o
f
 
c
o
m
p
a
r
i
n
g
 
e
n
d
-
u
s
e
r
p
e
r
f
o
r
m
a
n
c
e
 
o
f
 
d
i
f
f
e
r
e
n
t
 
d
e
v
i
c
e
s
 
a
n
d
 
s
y
s
t
e
m
s
 
 
s
u
c
h
 
a
s
 
b
o
t
h
 
R
H
B
D
 
a
n
d
 
C
O
T
S
 
p
r
o
c
e
s
s
o
r
s
,
 
F
P
G
A
s
 
a
n
d
 
A
S
I
C
s
T
o
 
b
e
t
t
e
r
 
u
n
d
e
r
s
t
a
n
d
 
l
i
m
i
t
a
t
i
o
n
s
 
o
f
 
d
i
f
f
e
r
e
n
t
 
t
y
p
e
s
 
o
f
 
d
e
v
i
c
e
s
 
a
n
d
 
s
y
s
t
e
m
s
 
a
n
d
 
t
o
 
q
u
i
c
k
l
y
 
d
e
c
i
d
e
 
o
n
d
i
v
i
s
i
o
n
 
t
a
s
k
 
i
n
 
h
a
r
d
w
a
r
e
 
a
n
d
 
s
o
f
t
w
a
r
e
 
f
o
r
 
i
m
p
l
e
m
e
n
t
a
t
i
o
n
s
 
i
n
 
h
e
t
e
r
o
g
e
n
e
o
u
s
 
s
y
s
t
e
m
s
T
o
 
a
l
l
o
w
 
E
S
A
 
t
o
 
q
u
i
c
k
l
y
 
p
r
o
v
i
d
e
 
r
e
c
o
m
m
e
n
d
a
t
i
o
n
s
 
f
o
r
 
p
r
o
c
e
s
s
i
n
g
 
s
y
s
t
e
m
s
 
i
n
 
f
u
t
u
r
e
 
m
i
s
s
i
o
n
s
,
 
t
h
r
o
u
g
h
i
d
e
n
t
i
f
y
i
n
g
 
k
e
y
 
p
a
r
a
m
e
t
e
r
s
 
t
o
g
e
t
h
e
r
 
w
i
t
h
 
t
h
e
 
p
r
o
j
e
c
t
 
t
e
a
m
s
B
e
n
c
h
m
a
r
k
 
s
t
a
n
d
a
r
d
 
o
n
-
b
o
a
r
d
 
p
r
o
c
e
s
s
i
n
g
 
f
u
n
c
t
i
o
n
s
,
 
s
o
 
t
h
a
t
 
i
m
p
l
e
m
e
n
t
e
r
s
 
w
i
l
l
 
h
a
v
e
 
t
h
e
 
p
o
s
s
i
b
i
l
i
t
y
 
f
o
r
r
e
u
s
i
n
g
 
t
h
e
 
i
n
v
e
s
t
e
d
 
w
o
r
k
 
i
n
 
r
e
a
l
-
w
o
r
l
d
 
u
s
e
 
c
a
s
e
s
 
 
 
O
B
P
M
a
r
k
 
R
e
q
u
i
r
e
m
e
n
t
s
Coverage:
 …shall cover common OBP applications: image processing, compression, radar processing, encryption, common building blocks (for
radar, radiometry, SDR, etc) and machine learning
 …shall allow to add future benchmarks (through version update).
Comparable:
 …ensure identical output among different implementations
 …shall provide comparable results for: overall performance, performance / power, absolute power.
 …shall provide all necessary configuration parameters and test data, including golden reference output.
 Portable:
 …shall provide a reference implementation in standard C.
 …shall support reference implementations using standard parallelization schemes: OpenMP, OpenCL and CUDA
 …shall be possible to port to FPGA implementations*.
Openness:
 …shall be openly available (open-source license, open repository)
 …shall be open for community response/feedback
 …shall be open for community contributions (porting etc.)
 
 
 
O
B
P
M
a
r
k
 
S
u
i
t
e
 
 
B
e
n
c
h
m
a
r
k
s
The OBPMark suite provides three groups of benchmarks:
O
B
P
M
a
r
k
 
 
c
l
a
s
s
i
c
a
l
 
a
p
p
l
i
c
a
t
i
o
n
 
b
e
n
c
h
m
a
r
k
s
O
B
P
M
a
r
k
-
M
L
 
 
m
a
c
h
i
n
e
 
l
e
a
r
n
i
n
g
 
a
p
p
l
i
c
a
t
i
o
n
 
b
e
n
c
h
m
a
r
k
s
O
B
P
M
a
r
k
-
K
e
r
n
e
l
s
 
(
G
P
U
4
S
_
B
e
n
c
h
)
 
 
p
r
o
c
e
s
s
i
n
g
 
b
u
i
l
d
i
n
g
 
b
l
o
c
k
 
k
e
r
n
e
l
s
Each benchmark group has the following components:
1.
A
 
T
e
c
h
n
i
c
a
l
 
N
o
t
e
 
(
T
N
)
 
-
 
d
e
f
i
n
i
n
g
 
t
h
e
 
b
e
n
c
h
m
a
r
k
 
a
l
g
o
r
i
t
h
m
s
,
 
p
a
r
a
m
e
t
e
r
s
,
 
r
e
s
u
l
t
 
r
e
p
o
r
t
i
n
g
 
t
e
m
p
l
a
t
e
s
2.
R
e
f
e
r
e
n
c
e
 
i
n
p
u
t
 
a
n
d
 
o
u
t
p
u
t
 
d
a
t
a
 
 
t
o
 
b
e
 
u
s
e
d
 
f
o
r
 
b
e
n
c
h
m
a
r
k
i
n
g
 
a
n
d
 
v
e
r
i
f
i
c
a
t
i
o
n
3.
R
e
f
e
r
e
n
c
e
 
i
m
p
l
e
m
e
n
t
a
t
i
o
n
s
 
 
f
o
r
 
e
x
e
c
u
t
i
o
n
 
o
n
 
c
o
m
m
o
n
 
h
a
r
d
w
a
r
e
/
s
o
f
t
w
a
r
e
 
p
l
a
t
f
o
r
m
s
4.
A
 
d
a
t
a
b
a
s
e
 
o
f
 
r
e
p
o
r
t
e
d
 
t
e
s
t
 
r
e
s
u
l
t
s
 
 
 
O
B
P
M
a
r
k
 
 
A
p
p
l
i
c
a
t
i
o
n
 
B
e
n
c
h
m
a
r
k
s
 
 
1
 
o
f
 
5
O
v
e
r
v
i
e
w
OBPMark “apps” contains classical on-board processing benchmarks for:
Optical and RF instruments: pre-processing, image compression
Generic: data compression and encryption
Signal processing: on-board modulation and demodulation for satcom applications
OBPMark TN defines the algorithms and benchmark parameters.
Input benchmark data and output verification data are provided.
Source code for each benchmark is available, as: standard C (sequential), OpenMP, OpenCL and CUDA.
 
 
 
O
B
P
M
a
r
k
 
 
A
p
p
l
i
c
a
t
i
o
n
 
B
e
n
c
h
m
a
r
k
s
 
 
2
 
o
f
 
5
B
e
n
c
h
m
a
r
k
 
#
1
 
 
I
m
a
g
e
 
P
r
o
c
e
s
s
i
n
g
On-board image processing is a typical required for
Science and EO payloads.
Two sub-benchmarks defined and implemented:
#1.1 Image Calibration and Correction
#1.2 Radar Image Processing
#1.1 is based on typical processing performed in ESA
science optical payloads with long exposure times.
#1.2 is based on the range-Doppler algorithm
Reconstruction of radar images on-board can
enable additional on-board processing applications
(such as image segmentation)
Figure: Benchmark #1.1 “Image Calibration and Correction” processing chain.
Figure: Benchmark #1.2 “Radar Image Processing” processing chain.
 
 
 
O
B
P
M
a
r
k
 
 
A
p
p
l
i
c
a
t
i
o
n
 
B
e
n
c
h
m
a
r
k
s
 
 
2
 
o
f
 
5
B
e
n
c
h
m
a
r
k
 
#
2
 
 
S
t
a
n
d
a
r
d
 
C
o
m
p
r
e
s
s
i
o
n
CCSDS compression algorithms are widely used on ESA (and NASA) missions, we selected:
CCSDS 121.0-B-3 Lossless Data Compression;
CCSDS 122.0-B-2 Image Data Compression;
CCSDS 123.0-B-2 Low-Complexity Lossless and Near-Lossless Multispectral and Hyperspectral Image
Compression
The throughput of each of the algorithms depends on multiple compression settings
In particular the CCSDS 123 has many dozens of parameters
Performance and throughput of two implementations cannot be compared fairly unless the same parameters
and data are used
For OBPMark, three compression-related sub-benchmarks defined:
#2.1 CCSDS 121.0 Data Compression
#2.2 CCSDS 122.0 Image Compression
#2.3 CCSDS 123.0 Hyperspectral Compression
Existing compressor implementations can be benchmarked using the provided input and output verification data; and
configuring the specified compressor parameter settings
New implementations can be reused for future payload applications
For future: planning to add also 124.0 HKTM compression, 125.0 raw SAR data compression, video encoding (H.264/H.265)
 
 
 
O
B
P
M
a
r
k
 
 
A
p
p
l
i
c
a
t
i
o
n
 
B
e
n
c
h
m
a
r
k
s
 
 
2
 
o
f
 
5
B
e
n
c
h
m
a
r
k
 
#
3
 
 
S
t
a
n
d
a
r
d
 
E
n
c
r
y
p
t
i
o
n
On-board encryption used for sensitive data, in particular in commercial
applications
Currently one encryption benchmark defined, choice based on existing space
standard:
CCSDS 352.0-B-2, CCSDS Cryptographic Algorithms
… which uses the well-known AES encryption standard
Benchmark #3.1 defines guidelines for benchmarking of AES and provides input
and output data for verification
Several key-lengths defined: 128-bit, 192-bit, 256-bit
Performance parameters include: samples/s and samples/s/W
Implementations can be reused in on-board applications
 
 
 
O
B
P
M
a
r
k
 
 
A
p
p
l
i
c
a
t
i
o
n
 
B
e
n
c
h
m
a
r
k
s
 
 
2
 
o
f
 
5
B
e
n
c
h
m
a
r
k
 
#
4
 
 
S
i
g
n
a
l
 
P
r
o
c
e
s
s
i
n
g
Signal processing benchmarks targeting application level algorithms, mainly for
satcom applications.
DVB-S2 is used on many satellite communications missions for
modulation/demodulation.
Useful to know the number of channels can be implemented on a set of
hardware, within a given power budget.
Two sub-benchmarks have been defined:
#4.1 DVB-S2(X) Demodulation
#4.2 DVB-S2(X) Modulation
Definition of the benchmarks and implementation still to be done.
 
 
 
O
B
P
M
a
r
k
-
M
L
 
 
M
a
c
h
i
n
e
 
L
e
a
r
n
i
n
g
 
B
e
n
c
h
m
a
r
k
s
 
 
1
 
o
f
 
2
O
v
e
r
v
i
e
w
 
a
n
d
 
a
p
p
r
o
a
c
h
S
u
r
v
e
y
 
p
e
r
f
o
r
m
e
d
 
o
f
 
o
p
e
n
l
y
 
a
v
a
i
l
a
b
l
e
 
a
n
n
o
t
a
t
e
d
 
/
 
l
a
b
e
l
l
e
d
 
t
r
a
i
n
i
n
g
 
d
a
t
a
 
s
e
t
s
 
f
o
r
 
m
a
c
h
i
n
e
 
l
e
a
r
n
i
n
g
 
a
p
p
l
i
c
a
t
i
o
n
s
S
u
r
v
e
y
 
a
l
s
o
 
i
n
c
l
u
d
e
d
 
p
o
s
s
i
b
l
e
 
s
t
a
n
d
a
r
d
 
C
N
N
 
a
r
c
h
i
t
e
c
t
u
r
e
s
 
t
h
a
t
 
c
o
u
l
d
 
b
e
 
a
p
p
l
i
e
d
.
OBPMark-ML contains machine learning applications on-board processing benchmarks for:
Data reduction and data selection algorithms in EO imaging data
Image classification for scientific data selection, and event generation
A
l
l
 
a
p
p
l
i
c
a
t
i
o
n
s
 
a
r
e
 
i
m
p
l
e
m
e
n
t
e
d
 
a
s
 
D
N
N
 
(
D
e
e
p
 
N
e
u
r
a
l
 
N
e
t
w
o
r
k
)
.
The OBPMark-ML Technical Note (TN) defines the DNN algorithm architectures and explains the application cases.
Uses standard models (e.g. SSD MobileNetv2 for object detection)
Provide pre-quantized models in standard formats (TF, TFLite, etc.) for different data types (INT8, INT16, FP16, FP32)
Training data openly available (if re-training is required to support specific platforms)
Reference inference script implementations provided in common frameworks. 
In the future; adding also AOCS / pose estimation (algorithm exists), RNN (recurrent neural network) and SNN (spiking neural network) / event based processing.
 
 
 
O
B
P
M
a
r
k
-
M
L
 
 
M
a
c
h
i
n
e
 
L
e
a
r
n
i
n
g
 
B
e
n
c
h
m
a
r
k
s
 
 
2
 
o
f
 
2
D
e
f
i
n
e
d
 
B
e
n
c
h
m
a
r
k
s
Three ML benchmarks have been defined:
#
M
L
-
1
 
C
l
o
u
d
 
S
c
r
e
e
n
i
n
g
 
 
a
 
c
o
m
m
o
n
 
a
p
p
l
i
c
a
t
i
o
n
,
 
l
a
t
e
l
y
 
u
s
i
n
g
 
m
a
c
h
i
n
e
 
l
e
a
r
n
i
n
g
 
(
s
e
e
e
.
g
.
 
Φ
s
a
t
-
1
 
a
n
d
 
C
H
I
M
E
 
m
i
s
s
i
o
n
s
)
Allows reduction of up 40% of EO optical data downlink, by selection of pixels by
their cloud content.
Benchmark implemented based of a standard U-Net network.
#
M
L
-
2
 
S
h
i
p
 
D
e
t
e
c
t
i
o
n
 
 
a
n
 
o
n
-
b
o
a
r
d
 
d
a
t
a
 
s
e
l
e
c
t
i
o
n
 
a
p
p
l
i
c
a
t
i
o
n
 
t
h
a
t
 
h
a
s
 
b
e
e
n
 
s
t
u
d
i
e
d
i
n
 
s
e
v
e
r
a
l
 
s
t
u
d
i
e
s
,
 
i
n
c
l
u
d
i
n
g
 
E
S
A
-
f
u
n
d
e
d
 
a
c
t
i
v
i
t
i
e
s
.
Application allows optimization of the data to be downlinked, and saves on
bandwidth.
Benchmark implemented based on YOLO architecture.
M
L
-
#
3
 
C
M
E
 
C
l
a
s
s
i
f
i
c
a
t
i
o
n
 
 
a
 
s
o
l
a
r
 
m
o
n
i
t
o
r
i
n
g
 
a
p
p
l
i
c
a
t
i
o
n
 
t
o
 
d
e
t
e
c
t
 
C
M
E
 
(
C
o
r
o
n
a
l
M
a
s
s
 
E
j
e
c
t
i
o
n
)
 
b
y
 
d
e
t
e
c
t
i
o
n
 
d
i
f
f
e
r
e
n
c
e
 
b
e
t
w
e
e
n
 
t
w
o
 
i
m
a
g
e
s
.
Allows to create alerts in the case of event. Can be downlinked at
very low bandwidth usage.
 
 
 
S
e
l
e
c
t
 
B
e
n
c
h
m
a
r
k
 
R
e
s
u
l
t
s
 
 
1
 
o
f
 
6
O
v
e
r
v
i
e
w
B
e
n
c
h
m
a
r
k
s
 
w
e
r
e
 
p
e
r
f
o
r
m
e
d
 
o
n
 
s
e
l
e
c
t
 
h
a
r
d
w
a
r
e
 
t
a
r
g
e
t
s
 
a
v
a
i
l
a
b
l
e
 
i
n
 
t
h
e
E
S
A
 
T
E
C
-
E
D
 
l
a
b
 
O
B
P
 
&
 
A
I
 
F
a
c
i
l
i
t
y
F
o
r
 
t
h
i
s
 
w
o
r
k
 
w
e
 
f
o
c
u
s
e
d
 
o
n
 
t
h
e
 
O
B
P
M
a
r
k
 
c
l
a
s
s
i
c
a
l
 
a
p
p
l
i
c
a
t
i
o
n
 
b
e
n
c
h
m
a
r
k
s
.
More details are provided in the paper.
A few disclaimers on HW utilisation:
On AMD Xilinx Zynq US+ and Versal AI Core, only ARM Cortex-A cores were
benchmarked. Due to implementation effort – no utilization of FPGA fabric or AI
Engines are presented here.
In the Unibap ix5, only the x86 cores were used, the AMD GPU and Myriad X
were not used.
No intrinsic/assembly instruction optimizations for SIMD/vector extensions on
CPUs, i.e. ARM NEON, and x86 AVX instruction.
The next slides present results per application benchmark – all results are
preliminary, see notes above.
 
 
 
S
e
l
e
c
t
 
B
e
n
c
h
m
a
r
k
 
R
e
s
u
l
t
s
 
 
2
 
o
f
 
6
B
e
n
c
h
m
a
r
k
 
#
1
.
1
 
I
m
a
g
e
 
C
a
l
i
b
r
a
t
i
o
n
 
a
n
d
 
C
o
r
r
e
c
t
i
o
n
Recap: the #1.1 benchmark targets classical on-board
image pre-processing.
Processing is done of multiple image frame in a
series.
Images are corrected, binned and stacked.
Radiation scrubbing utilizes a multi-frame method.
Image size: 2048x2048 pixels, 16-bit per pixel
(input)
Take-aways:
Multi-core CPU performance more-or-less as
expected for the used technology
node/frequency/number of cores.
GPU has high performance, draws from
 highly-
parallelisable processing steps
.
Results are preliminary 
- recap HW disclaimers:
Only ARM cores used on AMD Xilinx Zynq US+ and
Versal AI.
No GPU used on Unibap ix5.
No SIMD/vector instruction optimizations done on
ARM (NEON) or x86 (AVX).
 
 
 
S
e
l
e
c
t
 
B
e
n
c
h
m
a
r
k
 
R
e
s
u
l
t
s
 
 
3
 
o
f
 
6
B
e
n
c
h
m
a
r
k
 
#
1
.
2
 
R
a
d
a
r
 
I
m
a
g
e
 
P
r
o
c
e
s
s
i
n
g
Recap: the #1.2 benchmark targets on-board SAR
image formation using the range-Doppler algorithm.
Processing is done by performing compression in
range and azimuth directions.
Each compression requires an FFT and inverse FTT
of the input data.
Image data size: 2752x14357
Take-aways:
Multi-core CPU performance more-or-less as
expected for the used technology
node/frequency/number of cores – clearly see the
impact of utilization multicore core (see e.g.
LX2160A – 16-cores).
GPU has similar performance to the multi-core
CPUs, limit in parallelisation due to issue in the GPU
implementation – to be fixed in future.
Performance depends on scaling (input size) – not
shown in the data here, but previously published.
Results are preliminary 
- recap HW disclaimers:
Only ARM cores used on AMD Xilinx Zynq US+ and
Versal AI.
No GPU used on Unibap ix5.
No SIMD/vector instruction optimizations done on
ARM (NEON) or x86 (AVX).
 
 
 
S
e
l
e
c
t
 
B
e
n
c
h
m
a
r
k
 
R
e
s
u
l
t
s
 
 
4
 
o
f
 
6
B
e
n
c
h
m
a
r
k
 
#
2
.
1
 
C
C
S
D
S
 
1
2
1
.
0
 
D
a
t
a
 
C
o
m
p
r
e
s
s
i
o
n
Recap: the #2.1 benchmark targets the CCSDS 121.0
data compression algorithm
Processing is done on in two stages: pre-processing
(unit delay predictor) and encoding (adaptive
entropy / RICE encoding)
Image size: 2048x2048 pixels, 16-bit per pixel
Take-aways:
Algorithm is 
not highly parallelisable 
(only
between reference intervals)
Single-core CPU performance more-or-less as
expected for the used technology
node/frequency/number of cores.
Multi-core CPU performance poor – more effort
spent for fairly low amount of data – improvements
to be done in future (OpenMP tasking system).
GPU performance dependent on the device – and
the input size (larger size, higher performance)
Results are preliminary 
- recap HW disclaimers:
Only ARM cores used on AMD Xilinx Zynq US+ and
Versal AI.
No GPU used on Unibap ix5.
No SIMD/vector instruction optimizations done on
ARM (NEON) or x86 (AVX).
 
 
 
S
e
l
e
c
t
 
B
e
n
c
h
m
a
r
k
 
R
e
s
u
l
t
s
 
 
5
 
o
f
 
6
B
e
n
c
h
m
a
r
k
 
#
2
.
2
 
C
C
S
D
S
 
1
2
2
.
0
 
I
m
a
g
e
 
C
o
m
p
r
e
s
s
i
o
n
Recap: the #2.2 benchmark targets CCSDS 122.0
image compression
Processing is done in two steps: a DWT (discrete
wavelet transform) and an encoding stage (bit-
plane encoding)
Image size: 2048x2048 pixels, 16-bit per pixel
Take-aways:
The DWT stage is parallelisable – but the encoding
stage is not easily parallelisable, leading to poor
performance in non-sequential.
Multi-core and GPU implementations suffer from
the poor parallellisation – have to go to huge input
sizes to penalty for the initialization cost for
running the application.
Tested with DWT on GPU/multi-core, and encoding
on single-core CPU.
..or use hard accelerators for compression
(JPEG2000) in current SoC (e.g. NVIDIA)
Results are preliminary 
- recap HW disclaimers:
Only ARM cores used on AMD Xilinx Zynq US+ and
Versal AI.
No GPU used on Unibap ix5.
No SIMD/vector instruction optimizations done on
ARM (NEON) or x86 (AVX).
 
 
 
S
e
l
e
c
t
 
B
e
n
c
h
m
a
r
k
 
R
e
s
u
l
t
s
 
 
5
 
o
f
 
6
B
e
n
c
h
m
a
r
k
 
#
3
.
1
 
A
E
S
 
E
n
c
r
y
p
t
i
o
n
Recap: the #3.1 benchmark targets AES encryption.
Utilizes custom implementation of standard.
Input size: 16777216 words (similar to a 4096 x
4096 image)
Key length: 256-bit
Take-aways:
Multi-core CPU performance more-or-less as
expected for the used technology
node/frequency/number of cores.
GPU performance similar to multi-core performance
– scales with input size, due to initialization cost.
Vendor specific improvements (e.g. instructions on
ARM and x86) can give performance benefits when
utilizing optimized implementations (not done
here).
Results are preliminary 
- recap HW disclaimers:
Only ARM cores used on AMD Xilinx Zynq US+ and
Versal AI.
No GPU used on Unibap ix5.
No SIMD/vector instruction optimizations done on
ARM (NEON) or x86 (AVX).
 
 
 
O
B
P
M
a
r
k
 
&
 
O
B
P
M
a
r
k
-
M
L
 
 
S
u
m
m
a
r
y
A lack of openly available performance general and reusable benchmarks for space applications has been identified
Currently benchmarks are application-specific and (often) closed source.
Khe number of different devices used for processing on-board spacecraft is increasing, making accurate comparison of computational performances difficult
A
 
n
e
w
 
s
u
i
t
e
 
o
f
 
O
p
e
n
 
S
o
u
r
c
e
 
C
o
m
p
u
t
a
t
i
o
n
a
l
 
P
e
r
f
o
r
m
a
n
c
e
 
B
e
n
c
h
m
a
r
k
s
 
f
o
r
 
S
p
a
c
e
 
A
p
p
l
i
c
a
t
i
o
n
s
 
h
a
s
 
b
e
e
n
 
d
e
v
e
l
o
p
e
d
:
 
O
B
P
M
a
r
k
 
(
O
n
-
B
o
a
r
d
 
P
r
o
c
e
s
s
i
n
g
 
B
e
n
c
h
m
a
r
k
s
)
B
o
t
h
 
O
B
P
M
a
r
k
 
a
n
d
 
O
B
P
M
a
r
k
-
K
e
r
n
e
l
s
 
/
 
G
P
U
4
S
 
B
e
n
c
h
 
a
r
e
 
a
v
a
i
l
a
b
l
e
 
o
n
 
t
h
e
 
p
u
b
l
i
c
 
g
i
t
 
r
e
p
o
s
i
t
o
r
i
e
s
.
S
e
e
 
O
B
P
M
a
r
k
.
o
r
g
 
f
o
r
 
m
o
r
e
 
i
n
f
o
r
m
a
t
i
o
n
 Currently in beta mode, seeking community feedback from early adopters
O
B
P
M
a
r
k
-
M
L
 
i
s
 
c
u
r
r
e
n
t
l
y
 
a
v
a
i
l
a
b
l
e
 
i
n
 
a
 
p
r
i
v
a
t
e
 
E
S
A
 
r
e
p
o
s
i
t
o
r
y
 
 
a
c
c
e
s
s
 
t
o
 
E
u
r
o
p
e
a
n
 
e
n
t
i
t
i
e
s
.
Planned to be released public in near future.
W
a
n
t
 
t
o
 
s
t
a
r
t
 
b
e
n
c
h
m
a
r
k
i
n
g
?
 
G
e
t
 
i
n
 
c
o
n
t
a
c
t
 
O
B
P
M
a
r
k
@
e
s
a
.
i
n
t
Community feedback is welcome
Porting and optimization of benchmark algorithms to new HW targets is supported by ESA.
Next phase: GSTP element 1 Compendium (2022) activity:
G
T
1
7
-
6
2
0
E
D
 
-
 
O
B
P
M
a
r
k
 
-
 
O
n
-
B
o
a
r
d
 
P
r
o
c
e
s
s
i
n
g
 
B
e
n
c
h
m
a
r
k
s
,
 
8
0
0
k
E
U
R
 
 
t
o
 
b
e
 
i
n
i
t
i
a
t
e
d
 
b
a
s
e
d
 
o
n
 
i
n
d
u
s
t
r
y
 
i
n
t
e
r
e
s
t
Interested? Get in contact: 
David.Steenari@esa.int
 
 
 
 
 
 
 
 
T
h
a
n
k
 
y
o
u
 
f
o
r
 
y
o
u
r
 
a
t
t
e
n
t
i
o
n
!
 
O
B
P
M
a
r
k
 
a
n
d
 
O
B
P
M
a
r
k
-
M
L
C
o
m
p
u
t
a
t
i
o
n
a
l
 
B
e
n
c
h
m
a
r
k
s
 
f
o
r
 
O
n
-
B
o
a
r
d
 
D
a
t
a
 
P
r
o
c
e
s
s
i
n
g
 
a
n
d
 
M
a
c
h
i
n
e
 
L
e
a
r
n
i
n
g
 
i
n
 
S
p
a
c
e
 
A
p
p
l
i
c
a
t
i
o
n
s
David Steenari
1
, Ivan Rodriguez-Ferrandez
1,2,3
, Sanath Muret
1
,
Marc Solé Bonet
2,3
, Jannis Wolf
2,3
, Luis Mansilla
1
, Leonidas Kosmidis
2,3
1
 ESA/ESTEC
2
 Barcelona Supercomputing 
Center
 (BSC)
3
 
Universitat Politècnica
 
de Catalunya 
(UPC)
 
03/10/2023
 
ESA ESTEC
 
T
h
a
n
k
 
y
o
u
 
f
o
r
 
y
o
u
r
 
a
t
t
e
n
t
i
o
n
!
 
C
o
n
t
a
c
t
:
O
B
P
M
a
r
k
@
e
s
a
.
i
n
t
 
D
a
v
i
d
.
S
t
e
e
n
a
r
i
@
e
s
a
.
i
n
t
Slide Note
Embed
Share

OBPMark and OBPMark-ML are computational benchmarks developed by ESA and BSC/UPC for on-board data processing and machine learning in space applications. These benchmarks aim to standardize performance comparison across different processing devices, identify key parameters, and provide recommendations for future missions. The suite covers common applications like image processing, compression, radar processing, encryption, and machine learning, with a focus on portability and comparability. Reference implementations in C and support for standard parallelization schemes are provided.


Uploaded on Apr 16, 2024 | 10 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. OBPMark and OBPMark-ML Computational Benchmarks for On-Board Data Processing and Machine Learning in Space Applications David Steenari1, Ivan Rodriguez-Ferrandez1,2,3, Sanath Muret1, Marc Sol Bonet2,3, Jannis Wolf2,3, Luis Mansilla1, Leonidas Kosmidis2,3 1 ESA/ESTEC 2 Barcelona Supercomputing Center (BSC) 3 Universitat Polit cnica de Catalunya (UPC) ESA ESTEC 03/10/2023 1 ESA UNCLASSIFIED For ESA Official Use Only

  2. Background and Motivation Motivation: FPGAs and Multicore CPUs have been the most on-board processing devices in recent years. As of late, several new device types with hardware accelerations have been introduced in on-board systems: Multicore FPGA SoCs, GPUs, VPUs, AI Accelerators, array and many-core processors. Different device types utilize different programming paradigms. Different device/HW accelerator types increase the difficulty to easily compare computational performances in key processing applications. GR740 (Gaisler) Lack of openly available space application benchmarks led to the definition and release of OBPMark (in 2021) TX2 (NVIDIA) as operations per second (FLOPS, GMACs, MIPS, etc) do not give the full picture. Neither do synthetic benchmarks. and commercially available processing benchmarks do not: target space applications (hyperspectral, radar, etc.) and do not lend better understanding for OBP applications. ZUS+ (Xilinx) Kalray (MPPA) performance per power metrics not always included. FPGA implementation not considered. HPDP (ISD) not easily portable to non-standard/parallel programming paradigms Myriad 2 (Intel) 2

  3. OBPMark Suite Overview and Objectives OBPMark ( On-Board Processing Benchmarks ) is suite a computational performance benchmarks for space applications. Developed by ESA and BSC/UPC. Funded through ESA GSP contract GPU4S contract now closed. Source code implementations in public release. Early usage in a number of ESA activities (HW/SW developments, etc. as reference application cases). Objectives: To promote a standard set of application-level benchmarks, as to enable a method of comparing end-user performance of different devices and systems such as both RHBD and COTS processors, FPGAs and ASICs To better understand limitations of different types of devices and systems and to quickly decide on division task in hardware and software for implementations in heterogeneous systems To allow ESA to quickly provide recommendations for processing systems in future missions, through identifying key parameters together with the project teams Benchmark standard on-board processing functions, so that implementers will have the possibility for reusing the invested work in real-world use cases 3

  4. OBPMark Requirements Coverage: shall cover common OBP applications: image processing, compression, radar processing, encryption, common building blocks (for radar, radiometry, SDR, etc) and machine learning shall allow to add future benchmarks (through version update). Comparable: ensure identical output among different implementations shall provide comparable results for: overall performance, performance / power, absolute power. shall provide all necessary configuration parameters and test data, including golden reference output. Portable: shall provide a reference implementation in standard C. shall support reference implementations using standard parallelization schemes: OpenMP, OpenCL and CUDA shall be possible to port to FPGA implementations*. Openness: shall be openly available (open-source license, open repository) shall be open for community response/feedback shall be open for community contributions (porting etc.) 4

  5. OBPMark Suite Benchmarks The OBPMark suite provides three groups of benchmarks: OBPMark classical application benchmarks OBPMark-ML machine learning application benchmarks OBPMark-Kernels (GPU4S_Bench) processing building block kernels Each benchmark group has the following components: 1. A Technical Note (TN) - defining the benchmark algorithms, parameters, result reporting templates 2. Reference input and output data to be used for benchmarking and verification 3. Reference implementations for execution on common hardware/software platforms 4. A database of reported test results 5

  6. OBPMark Application Benchmarks 1 of 5 Overview ID #1 Benchmark Name Image Processing Sub ID #1.1 #1.2 #2.1 #2.2 #2.3 #3.1 #4.1 #4.2 Sub-Benchmark Name Image Calibration and Correction Radar Image Processing CCSDS 121.0 Data Compression CCSDS 122.0 Image Compression CCSDS 123.0 Hyperspectral Image Compression Defined AES Encryption DVB-S2X Demodulation DVB-S2X Modulation Status Available Available Available Available #2 Standard Compression #3 Standard Encryption #4 Signal Processing Available TBA TBA OBPMark apps contains classical on-board processing benchmarks for: Optical and RF instruments: pre-processing, image compression Generic: data compression and encryption Signal processing: on-board modulation and demodulation for satcom applications OBPMark TN defines the algorithms and benchmark parameters. Input benchmark data and output verification data are provided. Source code for each benchmark is available, as: standard C (sequential), OpenMP, OpenCL and CUDA. 6

  7. OBPMark Application Benchmarks 2 of 5 Benchmark #1 Image Processing On-board image processing is a typical required for Science and EO payloads. Offset Correction Bad Pixel Correction Radiation Scrubbing Gain Spatial Binning Temporal Binning Correction Two sub-benchmarks defined and implemented: #1.1 Image Calibration and Correction #1.2 Radar Image Processing Figure: Benchmark #1.1 Image Calibration and Correction processing chain. Range Compression #1.1 is based on typical processing performed in ESA science optical payloads with long exposure times. Range Matched Filter Range iFFT Range FFT #1.2 is based on the range-Doppler algorithm Reconstruction of radar images on-board can enable additional on-board processing applications (such as image segmentation) Range Compression Corner Turn Azimuth Compression Multilook Figure: Benchmark #1.2 Radar Image Processing processing chain. 7

  8. OBPMark Application Benchmarks 2 of 5 Benchmark #2 Standard Compression CCSDS compression algorithms are widely used on ESA (and NASA) missions, we selected: CCSDS 121.0-B-3 Lossless Data Compression; CCSDS 122.0-B-2 Image Data Compression; CCSDS 123.0-B-2 Low-Complexity Lossless and Near-Lossless Multispectral and Hyperspectral Image Compression The throughput of each of the algorithms depends on multiple compression settings In particular the CCSDS 123 has many dozens of parameters Performance and throughput of two implementations cannot be compared fairly unless the same parameters and data are used For OBPMark, three compression-related sub-benchmarks defined: #2.1 CCSDS 121.0 Data Compression #2.2 CCSDS 122.0 Image Compression #2.3 CCSDS 123.0 Hyperspectral Compression Existing compressor implementations can be benchmarked using the provided input and output verification data; and configuring the specified compressor parameter settings New implementations can be reused for future payload applications For future: planning to add also 124.0 HKTM compression, 125.0 raw SAR data compression, video encoding (H.264/H.265) 8

  9. OBPMark Application Benchmarks 2 of 5 Benchmark #3 Standard Encryption On-board encryption used for sensitive data, in particular in commercial applications Currently one encryption benchmark defined, choice based on existing space standard: CCSDS 352.0-B-2, CCSDS Cryptographic Algorithms which uses the well-known AES encryption standard Benchmark #3.1 defines guidelines for benchmarking of AES and provides input and output data for verification Several key-lengths defined: 128-bit, 192-bit, 256-bit Performance parameters include: samples/s and samples/s/W Implementations can be reused in on-board applications 9

  10. OBPMark Application Benchmarks 2 of 5 Benchmark #4 Signal Processing Signal processing benchmarks targeting application level algorithms, mainly for satcom applications. DVB-S2 is used on many satellite communications missions for modulation/demodulation. Useful to know the number of channels can be implemented on a set of hardware, within a given power budget. Two sub-benchmarks have been defined: #4.1 DVB-S2(X) Demodulation #4.2 DVB-S2(X) Modulation Definition of the benchmarks and implementation still to be done. 10

  11. OBPMark-ML Machine Learning Benchmarks 1 of 2 Overview and approach ID #ML-1 #ML-2 #ML-3 Benchmark Name Cloud Screening Ship Detection CME Classification Type Semantic Segmentation Object Detection Image classification Status Available Available Available Survey performed of openly available annotated / labelled training data sets for machine learning applications Survey also included possible standard CNN architectures that could be applied. OBPMark-ML contains machine learning applications on-board processing benchmarks for: Data reduction and data selection algorithms in EO imaging data Image classification for scientific data selection, and event generation All applications are implemented as DNN (Deep Neural Network). The OBPMark-ML Technical Note (TN) defines the DNN algorithm architectures and explains the application cases. Uses standard models (e.g. SSD MobileNetv2 for object detection) Provide pre-quantized models in standard formats (TF, TFLite, etc.) for different data types (INT8, INT16, FP16, FP32) Training data openly available (if re-training is required to support specific platforms) Reference inference script implementations provided in common frameworks. In the future; adding also AOCS / pose estimation (algorithm exists), RNN (recurrent neural network) and SNN (spiking neural network) / event based processing. 11

  12. OBPMark-ML Machine Learning Benchmarks 2 of 2 Defined Benchmarks Three ML benchmarks have been defined: #ML-1 Cloud Screening a common application, lately using machine learning (see e.g. sat-1 and CHIME missions) Allows reduction of up 40% of EO optical data downlink, by selection of pixels by their cloud content. Benchmark implemented based of a standard U-Net network. #ML-2 Ship Detection an on-board data selection application that has been studied in several studies, including ESA-funded activities. Application allows optimization of the data to be downlinked, and saves on bandwidth. Benchmark implemented based on YOLO architecture. ML-#3 CME Classification a solar monitoring application to detect CME (Coronal Mass Ejection) by detection difference between two images. Allows to create alerts in the case of event. Can be downlinked at very low bandwidth usage. 12

  13. Select Benchmark Results 1 of 6 Overview Benchmarks were performed on select hardware targets available in the ESA TEC-ED lab OBP & AI Facility For this work we focused on the OBPMark classical application benchmarks. More details are provided in the paper. A few disclaimers on HW utilisation: On AMD Xilinx Zynq US+ and Versal AI Core, only ARM Cortex-A cores were benchmarked. Due to implementation effort no utilization of FPGA fabric or AI Engines are presented here. In the Unibap ix5, only the x86 cores were used, the AMD GPU and Myriad X were not used. No intrinsic/assembly instruction optimizations for SIMD/vector extensions on CPUs, i.e. ARM NEON, and x86 AVX instruction. Provider Device Type Class NXP / Teledyne e2v LS1046A Quad-core ARM A72 RT NXP / Teledyne e2v LX2160A 16-core ARM A72 RT AMD Xilinx Zynq US+ CPU/FPGA SoC COTS AMD Xilinx Versal AI Core CPU/FPGA SoC RT The next slides present results per application benchmark all results are preliminary, see notes above. NVIDIA Xavier NX CPU/GPU SoC COTS NVIDIA Xavier IND CPU/GPU SoC COTS NVIDIA Orin AGX CPU/GPU SoC COTS 13

  14. Select Benchmark Results 2 of 6 Benchmark #1.1 Image Calibration and Correction Recap: the #1.1 benchmark targets classical on-board image pre-processing. Processing is done of multiple image frame in a series. Images are corrected, binned and stacked. Radiation scrubbing utilizes a multi-frame method. Image size: 2048x2048 pixels, 16-bit per pixel (input) Take-aways: Multi-core CPU performance more-or-less as expected for the used technology node/frequency/number of cores. GPU has high performance, draws from highly- parallelisable processing steps. Results are preliminary - recap HW disclaimers: Only ARM cores used on AMD Xilinx Zynq US+ and Versal AI. No GPU used on Unibap ix5. No SIMD/vector instruction optimizations done on ARM (NEON) or x86 (AVX). Offset Correction Bad Pixel Correction Radiation Scrubbing Gain Spatial Binning Temporal Binning Correction Figure: Benchmark #1.1 Image Calibration and Correction processing chain. 14

  15. Select Benchmark Results 3 of 6 Benchmark #1.2 Radar Image Processing Recap: the #1.2 benchmark targets on-board SAR image formation using the range-Doppler algorithm. Processing is done by performing compression in range and azimuth directions. Each compression requires an FFT and inverse FTT of the input data. Image data size: 2752x14357 Take-aways: Multi-core CPU performance more-or-less as expected for the used technology node/frequency/number of cores clearly see the impact of utilization multicore core (see e.g. LX2160A 16-cores). GPU has similar performance to the multi-core CPUs, limit in parallelisation due to issue in the GPU implementation to be fixed in future. Performance depends on scaling (input size) not shown in the data here, but previously published. Results are preliminary - recap HW disclaimers: Only ARM cores used on AMD Xilinx Zynq US+ and Versal AI. No GPU used on Unibap ix5. No SIMD/vector instruction optimizations done on ARM (NEON) or x86 (AVX). Range Compression Corner Turn Azimuth Compression Multilook Figure: Benchmark #1.2 Radar Image Processing processing chain. 15

  16. Select Benchmark Results 4 of 6 Benchmark #2.1 CCSDS 121.0 Data Compression Recap: the #2.1 benchmark targets the CCSDS 121.0 data compression algorithm Processing is done on in two stages: pre-processing (unit delay predictor) and encoding (adaptive entropy / RICE encoding) Image size: 2048x2048 pixels, 16-bit per pixel Take-aways: Algorithm is not highly parallelisable (only between reference intervals) Single-core CPU performance more-or-less as expected for the used technology node/frequency/number of cores. Multi-core CPU performance poor more effort spent for fairly low amount of data improvements to be done in future (OpenMP tasking system). GPU performance dependent on the device and the input size (larger size, higher performance) Results are preliminary - recap HW disclaimers: Only ARM cores used on AMD Xilinx Zynq US+ and Versal AI. No GPU used on Unibap ix5. No SIMD/vector instruction optimizations done on ARM (NEON) or x86 (AVX). 16

  17. Select Benchmark Results 5 of 6 Benchmark #2.2 CCSDS 122.0 Image Compression Recap: the #2.2 benchmark targets CCSDS 122.0 image compression Processing is done in two steps: a DWT (discrete wavelet transform) and an encoding stage (bit- plane encoding) Image size: 2048x2048 pixels, 16-bit per pixel Take-aways: The DWT stage is parallelisable but the encoding stage is not easily parallelisable, leading to poor performance in non-sequential. Multi-core and GPU implementations suffer from the poor parallellisation have to go to huge input sizes to penalty for the initialization cost for running the application. Tested with DWT on GPU/multi-core, and encoding on single-core CPU. ..or use hard accelerators for compression (JPEG2000) in current SoC (e.g. NVIDIA) Results are preliminary - recap HW disclaimers: Only ARM cores used on AMD Xilinx Zynq US+ and Versal AI. No GPU used on Unibap ix5. No SIMD/vector instruction optimizations done on ARM (NEON) or x86 (AVX). 17

  18. Select Benchmark Results 5 of 6 Benchmark #3.1 AES Encryption Recap: the #3.1 benchmark targets AES encryption. Utilizes custom implementation of standard. Input size: 16777216 words (similar to a 4096 x 4096 image) Key length: 256-bit Take-aways: Multi-core CPU performance more-or-less as expected for the used technology node/frequency/number of cores. GPU performance similar to multi-core performance scales with input size, due to initialization cost. Vendor specific improvements (e.g. instructions on ARM and x86) can give performance benefits when utilizing optimized implementations (not done here). Results are preliminary - recap HW disclaimers: Only ARM cores used on AMD Xilinx Zynq US+ and Versal AI. No GPU used on Unibap ix5. No SIMD/vector instruction optimizations done on ARM (NEON) or x86 (AVX). 18

  19. OBPMark & OBPMark-ML Summary A lack of openly available performance general and reusable benchmarks for space applications has been identified Currently benchmarks are application-specific and (often) closed source. Khe number of different devices used for processing on-board spacecraft is increasing, making accurate comparison of computational performances difficult A new suite of Open Source Computational Performance Benchmarks for Space Applications has been developed: OBPMark (On-Board Processing Benchmarks) Both OBPMark and OBPMark-Kernels / GPU4S Bench are available on the public git repositories. See OBPMark.org for more information Currently in beta mode, seeking community feedback from early adopters OBPMark-ML is currently available in a private ESA repository access to European entities. Planned to be released public in near future. Want to start benchmarking? Get in contact OBPMark@esa.int Community feedback is welcome Porting and optimization of benchmark algorithms to new HW targets is supported by ESA. Next phase: GSTP element 1 Compendium (2022) activity: GT17-620ED - OBPMark - On-Board Processing Benchmarks , 800kEUR to be initiated based on industry interest Interested? Get in contact: David.Steenari@esa.int 19

  20. OBPMark and OBPMark-ML Computational Benchmarks for On-Board Data Processing and Machine Learning in Space Applications David Steenari1, Ivan Rodriguez-Ferrandez1,2,3, Sanath Muret1, Marc Sol Bonet2,3, Jannis Wolf2,3, Luis Mansilla1, Leonidas Kosmidis2,3 1 ESA/ESTEC 2 Barcelona Supercomputing Center (BSC) 3 Universitat Polit cnica de Catalunya (UPC) Thank you for your attention! Thank you for your attention! Contact: OBPMark@esa.int David.Steenari@esa.int ESA ESTEC 03/10/2023 20 20 ESA UNCLASSIFIED - For ESA Official Use Only

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#