Bayesian Optimization at LCLS Using Gaussian Processes

 
Joseph Duris, Mitch McIntire, Daniel Ratner
 
ICFA mini-workshop on machine learning for accelerators
Feb 28, 2018
 
Bayesian Optimization at LCLS
using Gaussian Processes
Optimization of FEL pulse energy
 
 
2
 
B
a
y
e
s
i
a
n
 
O
p
t
i
m
i
z
a
t
i
o
n
 
C
u
r
r
e
n
t
 
a
p
p
r
o
a
c
h
 
t
o
 
t
u
n
i
n
g
:
One objective: FEL pulse energy
Mostly operator controlled
Optimization is slow and costly
 
Ocelot optimizer
Collaboration with DESY
Local simplex optimizer
Small batches of devices
 
E
x
a
m
p
l
e
 
o
p
p
o
r
t
u
n
i
t
i
e
s
 
f
o
r
 
t
i
m
e
 
s
a
v
i
n
g
s
 
3
 
B
a
y
e
s
i
a
n
 
O
p
t
i
m
i
z
a
t
i
o
n
 
Tuning strategy tradeoffs
Human optimization
-
m
e
n
t
a
l
 
m
o
d
e
l
s
-
e
x
p
e
r
i
e
n
c
e
-
(relatively) slow execution
 
Numerical optimization
-
f
a
s
t
 
e
x
e
c
u
t
i
o
n
-
blind, local search
limited search space
 
B
a
y
e
s
i
a
n
 
o
p
t
i
m
i
z
a
t
i
o
n
G
a
u
s
s
i
a
n
 
p
r
o
c
e
s
s
 
p
r
o
v
i
d
e
s
 
p
r
o
b
a
b
i
l
i
s
t
i
c
 
m
o
d
e
l
G
P
 
p
r
o
b
a
b
i
l
i
t
i
e
s
 
+
 
B
a
y
e
s
 
r
u
l
e
 
e
n
a
b
l
e
s
 
u
s
e
 
o
f
 
p
r
i
o
r
 
k
n
o
w
l
e
d
g
e
A
c
q
u
i
s
i
t
i
o
n
 
f
u
n
c
t
i
o
n
 
u
s
e
s
 
r
e
s
u
l
t
i
n
g
 
p
r
o
b
a
b
i
l
i
t
i
e
s
 
t
o
 
g
u
i
d
e
 
s
e
a
r
c
h
 
4
 
P
r
i
o
r
 
m
e
a
n
 
GP needs 2 key parameters: 
kernel
 and 
prior mean
. Ample training
data available with wide range of tuned configs from historical tuning.
 
Widths of scans
=>
GP
 
kernel
parameters
 
Trends in peaks of
scans=>
GP prior mean
 
Prior mean
biases and
constrains
search
 
5
 
U
s
i
n
g
 
t
h
e
 
p
r
i
o
r
 
m
e
a
n
 
t
o
 
t
u
n
e
 
u
p
 
f
r
o
m
 
n
o
i
s
e
 
We used a Bayes prior from summer 2017 data to tune up a brand new
config from noise. Simplex could not do this as it needs signal to tune on.
 
6
 
S
a
m
p
l
i
n
g
 
s
t
a
t
i
s
t
i
c
s
 
Measure FEL pulse energy with gas detector
Collection efficiency is small => shot noise dominates
Variance proportional to amplitude => std dev is wider
Moral of story: sample near the high end of the distribution
 
7
 
T
u
n
i
n
g
 
1
2
 
q
u
a
d
s
 
s
t
a
r
t
i
n
g
 
w
i
t
h
 
3
0
%
 
o
f
 
p
e
a
k
 
F
E
L
 
Mean of 120 shots
 
GP, expected improvement, Jan 2018 prior
 
80
th
 percentile of
120 shots
 
Simplex
 
8
 
T
u
n
i
n
g
 
1
2
 
q
u
a
d
s
 
s
t
a
r
t
i
n
g
 
w
i
t
h
 
1
0
%
 
o
f
 
p
e
a
k
 
F
E
L
 
Mean of 120 shots
 
GP, expected
improvement,
Jan 2018 prior
 
80
th
 percentile of
120 shots
 
GP, UCB
Jan 2018 prior
 
Simplex
 
9
 
A
c
c
o
m
m
o
d
a
t
i
n
g
 
c
o
r
r
e
l
a
t
i
o
n
s
 
b
e
t
w
e
e
n
 
d
e
v
i
c
e
s
 
FEL vs quads with RBF kernel
 
Current implementation uses diagonal kernel matrix =>
ignores correlations between quads
One approach: vary kernel matrix elements to maximize
marginal likelihood for a set of prior scans
Another approach: map x to linearly independent basis y with
diagonal kernel matrix
 
n x n kernel matrix
 
n devices
 
10
 
T
u
n
i
n
g
 
t
h
e
 
k
e
r
n
e
l
 
p
a
r
a
m
e
t
e
r
s
 
GP gives likelihood of observations y given
samples X and kernel parameters 
γ
Each scan is potentially a unique set of points, yet
the characteristic shape of the function remains
similar.
Determine kernel parameters (and uncertainties)
which maximize net likelihood for a group of scans
 
Work in
progress
 
 
is asymptotically normal,
 
Gaussian process kernel fitting is done via maximum likelihood
estimation. The maximum likelihood estimator (MLE) is useful since it
is unbiased,
is consistent,
 
11
 
T
e
s
t
i
n
g
 
a
c
q
u
i
s
i
t
i
o
n
 
s
t
r
a
t
e
g
i
e
s
 
o
f
f
l
i
n
e
 
Monte-Carlo platform to investigate optimization
strategies offline (i.e. acquisition function parameters)
 
50 Gaussian processes on
simulated data (1mJ peak)
with 8 quads
 
Simplex optimizing 10 quads
 
GP with prior mean optimizing 10 quads
 
Optimize trends in offline MC
runs to determine best
acquisition function
parameters.
 
12
 
G
P
 
p
r
o
g
r
e
s
s
 
Prior mean: fits to aggregate historical data
 
Kernel parameters
Simultaneous GP likelihood maximization for group of scans
determines kernel parameters
Monte Carlo Sims => tune acquisition function parameters
 
Expand use-cases
Tune quads to minimize beam losses
Tuning quads, undulator taper to maximize FEL pulse energy
Self-seeding optics vs. FEL peak brightness
Control x-ray optics to maximize experimental signal
 
E
x
t
r
a
 
s
l
i
d
e
s
 
14
 
I
m
p
r
o
v
i
n
g
 
G
P
 
f
l
e
x
i
b
i
l
i
t
y
 
Arbitrary devices may not be related linearly. Example: FEL ~
f(quads / energy)
A
 
n
e
u
r
a
l
 
n
e
t
w
o
r
k
 
k
e
r
n
e
l
 
i
s
 
a
 
n
o
n
-
l
i
n
e
a
r
 
m
a
p
p
i
n
g
 
w
h
i
c
h
 
c
a
n
h
e
l
p
 
c
a
p
t
u
r
e
 
t
h
e
s
e
 
m
o
r
e
 
c
o
m
p
l
i
c
a
t
e
d
 
r
e
l
a
t
i
o
n
s
 
b
e
t
w
e
e
n
 
i
n
p
u
t
p
a
r
a
m
e
t
e
r
s
.
 
 
The LDRD is now supporting a Stanford CS masters student
(Mitch McIntire) working on implementing this kernel to improve
the generality of GP to arbitrary inputs.
 
Deep Kernel Learning. Wilson,
Hu, Salakhutdinov, Xing.
arXiv:1511.02222
 
Rich adaptive
basis functions
from DNN map
on inputs
structures data
 
GP equivalent to
a single layer with
infinite nodes
 
 
is efficient,
Is asymptotically normal,
 
15
 
M
o
d
e
l
 
s
e
l
e
c
t
i
o
n
 
Gaussian process kernel fitting is done via maximum likelihood
estimation. The maximum likelihood estimator (MLE) is useful since it
is unbiased,
is consistent,
 
 
 
Furthermore since the likelihood is
normalized, it automatically regularizes.
 
This gives us a statistically sound way to
choose our model and give meaningful
parameter confidence interval estimates.
 
Example:
 
 
Acquisition point
 
Acquisition point
 
Acquisition point
 
Acquisition point
 
Acquisition point
 
Acquisition point
 
Acquisition point
 
Acquisition point
 
Acquisition point
16
G
a
u
s
s
i
a
n
 
P
r
o
c
e
s
s
 
c
o
m
p
o
n
e
n
t
s
B
a
y
e
s
i
a
n
 
o
p
t
i
m
i
z
a
t
i
o
n
 
w
i
t
h
 
i
n
s
t
a
n
c
e
 
b
a
s
e
d
 
l
e
a
r
n
i
n
g
G
a
u
s
s
i
a
n
 
p
r
o
c
e
s
s
 
p
r
o
v
i
d
e
s
 
p
r
o
b
a
b
i
l
i
s
t
i
c
 
m
o
d
e
l
 
b
a
s
e
d
 
o
n
 
k
e
r
n
e
l
 
l
e
a
r
n
i
n
g
G
P
 
p
r
o
b
a
b
i
l
i
t
i
e
s
 
+
 
B
a
y
e
s
 
r
u
l
e
 
e
n
a
b
l
e
s
 
u
s
e
 
o
f
 
p
r
i
o
r
 
k
n
o
w
l
e
d
g
e
 
p
r
o
b
a
b
i
l
i
t
y
A
c
q
u
i
s
i
t
i
o
n
 
f
u
n
c
t
i
o
n
 
u
s
e
s
 
r
e
s
u
l
t
i
n
g
 
p
r
o
b
a
b
i
l
i
t
i
e
s
 
t
o
 
g
u
i
d
e
 
s
e
a
r
c
h
A
c
q
u
i
s
i
t
i
o
n
 
f
u
n
c
t
i
o
n
G
r
o
u
n
d
 
t
r
u
t
h
P
o
s
t
e
r
i
o
r
Slide Note
Embed
Share

Bayesian optimization is being used at LCLS to tune the Free Electron Laser (FEL) pulse energy efficiently. The current approach involves a tradeoff between human optimization and numerical optimization methods, with Gaussian processes providing a probabilistic model for tuning strategies. Prior mean Gaussian process parameters and techniques for tuning from noise using historical data are explored, alongside sampling statistics for measuring FEL pulse energy.

  • Bayesian Optimization
  • LCLS
  • Gaussian Processes
  • Free Electron Laser
  • Optimization

Uploaded on Oct 09, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Bayesian Optimization at LCLS using Gaussian Processes Optimization of FEL pulse energy Joseph Duris, Mitch McIntire, Daniel Ratner ICFA mini-workshop on machine learning for accelerators Feb 28, 2018

  2. Bayesian Optimization Current approach to tuning: One objective: FEL pulse energy Mostly operator controlled Optimization is slow and costly Action Time (mins) Controller Search space Ocelot optimizer Collaboration with DESY Local simplex optimizer Small batches of devices Config change 10 Operators small Tune to find FEL 5-10 Operators large Tune quads 15 Simplex 24 Undulator tuning 5-10 Operators 30 Pointing / focusing 5 Operators small Example opportunities for time savings 2

  3. Bayesian Optimization Tuning strategy tradeoffs Human optimization - mental models - experience - (relatively) slow execution Numerical optimization - fast execution - blind, local search limited search space Bayesian optimization Gaussian process provides probabilistic model GP probabilities + Bayes rule enables use of prior knowledge Acquisition function uses resulting probabilities to guide search 3

  4. Prior mean GP needs 2 key parameters: kernel and prior mean. Ample training data available with wide range of tuned configs from historical tuning. Prior mean biases and constrains search Widths of scans => GPkernel parameters Trends in peaks of scans=> GP prior mean 4

  5. Using the prior mean to tune up from noise We used a Bayes prior from summer 2017 data to tune up a brand new config from noise. Simplex could not do this as it needs signal to tune on. Beam power L3 energy Change 14 -> 6.5 GeV GDET noise? GDET ~ 50 uJ GP run on LI26 quads GP run on LTU quads 5

  6. Sampling statistics Measure FEL pulse energy with gas detector Collection efficiency is small => shot noise dominates Variance proportional to amplitude => std dev is wider Moral of story: sample near the high end of the distribution 6

  7. Tuning 12 quads starting with 30% of peak FEL Simplex GP, expected improvement, Jan 2018 prior 80th percentile of 120 shots Mean of 120 shots 7

  8. Tuning 12 quads starting with 10% of peak FEL GP, expected improvement, Jan 2018 prior GP, UCB Jan 2018 prior Simplex 80th percentile of 120 shots Mean of 120 shots 8

  9. Accommodating correlations between devices n devices FEL vs quads with RBF kernel n x n kernel matrix Current implementation uses diagonal kernel matrix => ignores correlations between quads One approach: vary kernel matrix elements to maximize marginal likelihood for a set of prior scans Another approach: map x to linearly independent basis y with diagonal kernel matrix 9

  10. Tuning the kernel parameters Gaussian process kernel fitting is done via maximum likelihood estimation. The maximum likelihood estimator (MLE) is useful since it is asymptotically normal, is unbiased, is consistent, GP gives likelihood of observations y given samples X and kernel parameters Each scan is potentially a unique set of points, yet the characteristic shape of the function remains similar. Determine kernel parameters (and uncertainties) which maximize net likelihood for a group of scans Work in progress 10

  11. Testing acquisition strategies offline Monte-Carlo platform to investigate optimization strategies offline (i.e. acquisition function parameters) 50 Gaussian processes on simulated data (1mJ peak) with 8 quads GP with prior mean optimizing 10 quads Simplex optimizing 10 quads Optimize trends in offline MC runs to determine best acquisition function parameters. 11

  12. GP progress Prior mean: fits to aggregate historical data Kernel parameters Simultaneous GP likelihood maximization for group of scans determines kernel parameters Monte Carlo Sims => tune acquisition function parameters Expand use-cases Tune quads to minimize beam losses Tuning quads, undulator taper to maximize FEL pulse energy Self-seeding optics vs. FEL peak brightness Control x-ray optics to maximize experimental signal 12

  13. Extra slides

  14. Improving GP flexibility Arbitrary devices may not be related linearly. Example: FEL ~ f(quads / energy) A neural network kernel is a non-linear mapping which can help capture these more complicated relations between input parameters. GP equivalent to a single layer with infinite nodes Rich adaptive basis functions from DNN map on inputs structures data Deep Kernel Learning. Wilson, Hu, Salakhutdinov, Xing. arXiv:1511.02222 The LDRD is now supporting a Stanford CS masters student (Mitch McIntire) working on implementing this kernel to improve the generality of GP to arbitrary inputs. 14

  15. Model selection Gaussian process kernel fitting is done via maximum likelihood estimation. The maximum likelihood estimator (MLE) is useful since it is efficient, Is asymptotically normal, is unbiased, is consistent, Furthermore since the likelihood is normalized, it automatically regularizes. This gives us a statistically sound way to choose our model and give meaningful parameter confidence interval estimates. Example: 15

  16. Gaussian Process components Bayesian optimization with instance based learning Gaussian process provides probabilistic model based on kernel learning GP probabilities + Bayes rule enables use of prior knowledge probability Acquisition function uses resulting probabilities to guide search Acquisition point Acquisition point Acquisition point Acquisition point Acquisition point Acquisition point Acquisition point Acquisition point Acquisition point Acquisition function Ground truth Posterior 16

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#