Understanding Gaussian Processes: A Comprehensive Overview
Gaussian Processes (GPs) have wide applications in statistics and machine learning, encompassing regression, spatial interpolation, uncertainty quantification, and more. This content delves into the nature of GPs, their use in different communities, modeling mean and covariance, as well as the nuances between statistics and machine learning in their application and interpretation. Expert insights on GPs are shared, shedding light on their significance in data analysis methodologies.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Gaussian Processes I have known Tony O Hagan
Outline GPs in statistics and machine learning Regression Radiocarbon dating Spatial interpolation Uncertainty Quantification for computer models Emulators Model discrepancy Expert knowledge elicitation Doing science Model discrepancy revisited Bayes and subjectivity 17/7/2023 Cambridge Ellis Unit Summer School 2
GPs in statistics and machine learning 17/7/2023 Cambridge Ellis Unit Summer School 3
GPs A Gaussian process is a probability distribution for an unknown function f(.) At any input x, f(x) is normal with mean m(x) and variance v(x,x) At any finite set of points {x1, , xn}, {f(x1), , f(xn)} is multivariate normal, covariances v(xi,xj) So it s an infinite-dimensional multivariate normal distribution Defined by the mean function m(.) and the covariance function v(.,.) The main difficulties and skill arise in modelling these functions Usually in terms of hyperparameters which are estimated from the data 17/7/2023 Cambridge Ellis Unit Summer School 4
Two communities I have used GPs for various problems in statistics GPs are also widely used in machine learning, but there are important differences between the two communities Modelling the mean and covariance Nature of the data Nature of the task 17/7/2023 Cambridge Ellis Unit Summer School 5
Modelling mean and covariance I generally use the following formulations: m(x) = h(x)T Regression mean v(x,x ) = 2 exp{ (x x )TD(x x )} Gaussian covariance, implies f(.) is very smooth In machine learning m(x) = 0 Always a zero mean process v(x,x ) has additional terms like h(x)TV h(x) Putting the mean structure into the covariance More general covariance kernels 17/7/2023 Cambridge Ellis Unit Summer School 6
Nature of the data and the task In statistics, we may have rather little data, and it can be of a very wide range of types See examples later in this talk We tend to be concerned to get the best analysis for the case at hand Particularly in Bayesian applications As I understand it, machine learning is typically concerned with large quantities of data Probably collected by some automated process Concern is for automated analysis Default rather than customised Observations always have error 17/7/2023 Cambridge Ellis Unit Summer School 7
Regression 17/7/2023 Cambridge Ellis Unit Summer School 8
Early days I started using GPs in 1977 I was introduced to them by Jeff Harrison when I was at Warwick The problem I was trying to solve was design of experiments to fit regression models Observations y = h(x)Tb(x) + e Usual regression model except coefficients vary over the x space I used a GP prior distribution for b(.) So the regression model deforms slowly and smoothly 17/7/2023 Cambridge Ellis Unit Summer School 9
A more general case I generalised to nonparametric regression y = (x) + The regression function (.) has a GP prior distribution The GP is observed with error Posterior mean smoothes through the data points This is one of the classical uses of GPs in machine learning The paper I wrote was intended to solve a problem of experimental design using the special varying-coefficient GP But it is only cited for the general theory 17/7/2023 Cambridge Ellis Unit Summer School 10
GPs take off Since then I have used GPs extensively to represent (prior beliefs about) unknown functions And they have become almost ubiquitous tools in many fields I ll concentrate on applications I have been involved with First, two brief examples to show the diversity Radiocarbon dating Interpolating pollution monitoring stations Then an area that has become a big research field Uncertainty in the predictions of simulation models With an aside on expert judgement 17/7/2023 Cambridge Ellis Unit Summer School 11
Radiocarbon dating Archaeologists date objects by using radioactive decay of carbon-14 The technique yields a radiocarbon age x, when the true age of the object is y If the level of carbon-14 in the biosphere were constant, then y = x Unfortunately, it isn't, and there is an unknown calibration curve y = f(x) It s not even monotone Data comprise points where y is known and x is measured by fairly accurate radiocarbon dating 17/7/2023 Cambridge Ellis Unit Summer School 12
Bayesian approach Treat the radiocarbon calibration curve f(.) as a GP Like nonparametric regression except different prior beliefs about the curve Related to knowledge about processes generating C-14 in the atmosphere 17/7/2023 Cambridge Ellis Unit Summer School 13
A portion of the calibration curve 17/7/2023 Cambridge Ellis Unit Summer School 14
Spatial interpolation Monitoring stations measure atmospheric pollutants at various sites We wish to estimate pollution at other sites by interpolating the gauged sites So we observe f(xi) at gauged sites xi and want to interpolate to f(x) Standard geostatistical methods employ kriging methods, Kriging is also equivalent to a GP prior on f(.) But these methods typically rely on the process f(.) being stationary and isotropic We know this is not true for this f(.) 17/7/2023 Cambridge Ellis Unit Summer School 15
Latent space methods Sampson and Guttorp developed an approach in which the geographical locations map into locations in a latent space called D space Corr(f(x),f(x )) is a function not of x x but of d(x) d(x ), their distance apart in D space S&G estimate d(xi)s by multi-dimensional scaling, then interpolate by thin-plate splines A Bayesian approach assigns a GP prior to the mapping d(.), avoiding the arbitrariness of MDS and splines Fitting is hard because the GP is deep in the model Possibly the first use anywhere of deep GPs 17/7/2023 Cambridge Ellis Unit Summer School 16
Uncertainty Quantification for computer models 17/7/2023 Cambridge Ellis Unit Summer School 17
Bayesian quadrature The second time I used GPs was for numerical integration Problem: estimate integral of a function f(.) over some range Data: values f(xi) at some points xi Treat f(.) as an unknown function GP prior Observed without error Posterior mean interpolates the data Derive posterior distribution of integral Field of probabilistic numerics has grown up around this kind of modelling 17/7/2023 Cambridge Ellis Unit Summer School 18
Uncertainty analysis That theory was a natural answer to another problem that arose We have a computer model (called a simulator) that produces output y = f(x) when given input x But for a particular application we do not know x precisely X is a random variable, and so therefore is Y = f(X) We are interested in the distribution of Y Called the uncertainty distribution In particular, E(Y) is an integral 17/7/2023 Cambridge Ellis Unit Summer School 19
Monte Carlo A simple solution in principle Sample values of x from its distribution Run the model for all these values to produce sample values yi = f(xi) These are a sample from the uncertainty distribution of Y Neat but impractical if each run of the model takes minutes, hours or even days We can then only make a small number of runs 17/7/2023 Cambridge Ellis Unit Summer School 20
GP solution Treat f(.) as an unknown function with GP prior distribution Training set of runs Observations without error Derive posterior GP distribution for f(.) Make inference about the uncertainty distribution E.g. The mean of Y is the integral of f(x) with respect to the distribution of X Use Bayesian quadrature theory 17/7/2023 Cambridge Ellis Unit Summer School 21
UQ The engineering and applied maths community named this Uncertainty Quantification Terrible term but it has stuck! Grown into a big research field Basic idea Use training set of runs to build a fast approximation of the simulator a surrogatemodel Propagate uncertainty in X through the surrogate to approximate the uncertainty distribution of Y The GP mean function is a popular surrogate Can approximate the simulator output accurately from a relatively small training set 17/7/2023 Cambridge Ellis Unit Summer School 22
But theres more uncertainty Code uncertainty Uncertainty of approximation due to use of surrogate Model uncertainty Uncertainty about the real world phenomenon being simulated, because the simulator itself is an approximation Input distribution uncertainty Uncertainty because the distribution of X is treated in UQ as given, whereas in practice it s an imperfect estimate 17/7/2023 Cambridge Ellis Unit Summer School 23
Code uncertainty The GP has this covered automatically! It s more than a surrogate Full probability distribution for f(.) I call the GP an emulator A surrogate gives an estimate/approximation for E(Y) An emulator expresses uncertainty around the estimate Any surrogate that also has a credible expression of uncertainty can be an emulator But I m not aware of any others But the word emulator has been misappropriated And is widely used for more or less any surrogate 17/7/2023 Cambridge Ellis Unit Summer School 24
Model uncertainty All models are wrong Even when given the true/optimal inputs x, f(x) will not equal the reality, z, that is being simulated The difference is called model discrepancy or model error We can hope to learn about model discrepancy if we can observe reality Although there will inevitably also be observation error How do we model this? 17/7/2023 Cambridge Ellis Unit Summer School 25
Control inputs and parameters First, distinguish between two kinds of inputs Parameters are inputs that have fixed but uncertain values Control inputs define the specific real instance being simulated For example A simulator predicts deposition of radionuclides at specified locations after a reactor accident Control inputs will specify the locations Parameters will include the amount released, the wind direction and wind speed 17/7/2023 Cambridge Ellis Unit Summer School 26
Modelling model discrepancy We have training data fj = f(xj, tj) f is the simulator (GP) xj is the control input vector for the j-th training run tj is the vector of values for parameters in the j-th run And we have real-world observations zi = (xi) + i = f(xi, ) + (xi) + i xi is the control input vector for the i-th observation is the real process iis the observation error in the i-th observation is the vector of true values of parameters is the model discrepancy function (another GP) 17/7/2023 Cambridge Ellis Unit Summer School 27
Calibration Using the training data and real-world observations, we make inference about: the simulator (emulation) the model discrepancy the parameters of the simulator (calibration) the hyperparameters of the GPs reality (calibrated and discrepancy-corrected prediction) Substantial resource requirement But not in comparison to getting more training runs Or real-world observations! 17/7/2023 Cambridge Ellis Unit Summer School 28
Validation This is complex statistical modelling Important to criticise such models wherever possible Validating the emulator Make additional simulator runs Compare with emulator predictions Various tools and tests developed Expect failure Validating parameters/discrepancy Make additional real-world observations (in principle!) Compare with predictions of reality 17/7/2023 Cambridge Ellis Unit Summer School 29
Input uncertainty UQ propagates input uncertainty through the simulator, but where do input distributions come from? Informed judgement, usually This is another active area of my research Elicitation of expert knowledge Typically using a group of experts Talking through the evidence Reaching a kind of consensus distribution I use the Sheffield Elicitation Framework (SHELF) 17/7/2023 Cambridge Ellis Unit Summer School 30
Input uncertainty uncertainty But this can never be an exact science So we should allow for uncertainty/imprecision in the elicited input distributions In SHELF, the experts provide a small number of probability judgements P(X < c) = F(p) for 3 or 4 values of c Then we fit a suitable probability distribution E.g. normal, beta, There is arbitrariness in the fitted distribution And imprecision in the original judgements 17/7/2023 Cambridge Ellis Unit Summer School 31
Yet another GP Inference about experts underlying distribution The experts density is an unknown function Specify GP prior Generally uninformative but including beliefs about smoothness, probably unimodal, reasonably symmetric Experts judgements are data Posterior GP provides estimate of expert s density and specification of uncertainty We are observing integrals of the GP Possibly with error 17/7/2023 Cambridge Ellis Unit Summer School 32
Example of elicited distribution, without and with error in expert s judgements 17/7/2023 Cambridge Ellis Unit Summer School 33
Multi-fidelity models This is another topic that seems to be trending We have a model f1 that incorporates all the science But it takes forever to run We have a lower fidelity model f2 that runs quickly We can make a few runs of f1 and many more runs of f2 Can we combine the information from these two sources? Model f2 = f1 + g where the difference function g is again modelled as a GP Equivalently the model discrepancy 2 for f2 is 1+ g 17/7/2023 Cambridge Ellis Unit Summer School 34
Doing science 17/7/2023 Cambridge Ellis Unit Summer School 35
Model discrepancy revisited Remember the model discrepancy framework says (x) = f(x, ) + (x) where is the real-world process x is the control vector determining a particular instance f is the simulator is the parameter vector is the model discrepancy function And I suggested that given some runs of the simulator and some real-world observations we could learn about the model discrepancy and the true parameter values But that is not quite true! 17/7/2023 Cambridge Ellis Unit Summer School 36
Nonidentifiability The model discrepancy framework is not identifiable For any , there is a to match reality perfectly Reality is (x) = f(x, ) + (x) Suppose we had an unlimited number of observations We would learn reality s true function exactly Within the range of the data But we would still not learn It could in principle be anything Given and , model discrepancy is (x) = (x) f(x, ) So we would still not learn either And we would not learn outside the range of the data 17/7/2023 Cambridge Ellis Unit Summer School 37
It gets worse We are used to statistical inference being consistent, i.e. As we get more data, estimates converge on true values And posterior uncertainty shrinks to zero This does not happen when we have nonidentifiability As we obtain more and more real-world observations, the posterior uncertainty about parameters, model discrepancy or reality outside the range of the data does not go to zero And true values may be out in the tails of their posterior distributions This is an intrinsically insoluble problem But there is something we can do 17/7/2023 Cambridge Ellis Unit Summer School 38
Prior information As in any Bayesian analysis, prior information influences the answers And because of non-identifiability, that influence is there no matter how much data we have Although it cannot solve the problem it can reduce it Realistic prior information will typically mean that posterior distributions are narrower and centred closer to the true values In particular, we need to think hard about how and where the simulator might be wrong Because this provides us with meaningful prior information about the discrepancy 17/7/2023 Cambridge Ellis Unit Summer School 39
Bayes and subjectivity Bayesian methods have always been criticised for the fact that prior information is subjective Unfair on several counts but persistent In order to define a prior distribution it is necessary to make judgements about probability And these probability judgements are invariably subjective Surely this is totally unscientific? A common reaction because scientists are taught that science is objective They need education Subjective!! 17/7/2023 Cambridge Ellis Unit Summer School 40
Subjective, but scientific (1) You want to use subjective probability judgements? Isn t that totally unscientific? Science is supposed to be objective. Yes, objectivity is the goal of science, but scientists still have to make judgements. These judgements include theories, insights, interpretations of data. Science progresses by other scientists debating and testing those judgements. Making good judgements of this kind is what distinguishes a top scientist. 41
Subjective, but scientific (2) But subjective judgements are open to bias, prejudice, sloppy thinking Subjective probabilities are judgements but they should be careful, honest, informed judgements. In science we must always be as objective as possible. Probability judgements are like all the other judgements that a scientist necessarily makes, and should be argued for in the same careful, honest, informed way. 42
Do we need to worry? It is rare to see proper, informative priors on model parameters in Bayesian analyses Instead, It is extremely common to see vague priors used On the grounds that any proper prior distribution will be overwhelmed by the data This is OK, but it is important to remember The analysis began by specifying a model for the data This is a subjective choice And if we want to compare models, we can no longer use improper priors on the various model parameters Or if we do, we must use an approximation like the Fractional Bayes Factor 17/7/2023 Cambridge Ellis Unit Summer School 43
On the other hand There are many situations where prior information does matter When there is not much data (or even no data) When there is potentially strong prior information And in some technical situations like non-identifiability or model comparison In my career, I have sought out such problems Because then Bayesian methods deliver better answers than can be obtained from classical analyses And this is the underlying motivation for much of my research being on the elicitation of expert probability judgements 17/7/2023 Cambridge Ellis Unit Summer School 44
Going back to Gaussian processes Remember that a GP requires a mean function and a covariance function These are often modelled parametrically Assumed to have a particular form, but with unknown parameters Models for covariance functions typically have one of more correlation length parameters Specifying how rapidly the GP varies in response to changes in its inputs These are often ill-determined by the data, and proper, informative prior distributions on these parameters can really improve the analysis Particularly in a fully Bayesian approach 17/7/2023 Cambridge Ellis Unit Summer School 45
In conclusion 17/7/2023 Cambridge Ellis Unit Summer School 46
Take home messages GPs are very versatile I have used them in many different applications They have made an enormous contribution in UQ Particularly in respect of model discrepancy But it is important to be aware of the fundamental non- identifiability Genuine prior information can also be important Distributions for model inputs in UQ Priors for correlation length parameters Prior information about the model discrepancy Not to use available information is unscientific! 17/7/2023 Cambridge Ellis Unit Summer School 47
Bibliography Nonparametric regression O'Hagan, A. (1978). Curve fitting and optimal design for prediction (with discussion). Journal of the Royal Statistical Society B 40, 1-42. Radiocarbon dating Gomez Portugal Aguilar, D., Litton, C. D. and O'Hagan, A. (2002). A new piece-wise linear radiocarbon calibration curve with more realistic variance. Radiocarbon 44, 195-212. Buck, C. E., Gomez Portugal Aguilar, D., Litton, C. D. and O'Hagan, A. (2006). Bayesian nonparametric estimation of the radiocarbon calibration curve. Bayesian Analysis 1, 265-288. Spatial interpolation Schmidt, A. M. and O'Hagan, A. (2003). Bayesian inference for non-stationary spatial covariance structure via spatial deformations. Journal of the Royal Statistical Society B 65, 745-758. Numerical integration O'Hagan, A. (1991). Bayes-Hermite quadrature. Journal of Statistical Planning and Inference 29, 245-260. Uncertainty analysis Haylock, R. G. and O'Hagan, A. (1996). On inference for outputs of computationally expensive algorithms with uncertainty on the inputs. In Bayesian Statistics 5, J. M. Bernardo et al (eds.). Oxford University Press, 629-637. 17/7/2023 Cambridge Ellis Unit Summer School 48
Emulators O'Hagan, A., Kennedy, M. C. and Oakley, J. E. (1999). Uncertainty analysis and other inference tools for complex computer codes (with discussion). In Bayesian Statistics 6, J. M. Bernardo et al (eds.). Oxford University Press, 503-524. Oakley, J. E. and O'Hagan, A. (2004). Probabilistic sensitivity analysis of complex models: a Bayesian approach. Journal of the Royal Statistical Society B 66, 751-769. O'Hagan, A. (2006). Bayesian analysis of computer code outputs: a tutorial. Reliability Engineering and System Safety 91, 1290-1300. Model discrepancy Kennedy, M. C. and O'Hagan, A. (2001). Bayesian calibration of computer models (with discussion). Journal of the Royal Statistical Society B 63, 425--464. Non-identifiability Brynjarsdottir, J. and O'Hagan, A. (2014). Learning about physical parameters: The importance of model discrepancy. Inverse Problems, 30, 114007. Validating emulators Bastos, L. S. and O'Hagan, A. (2009). Diagnostics for Gaussian process emulators. Technometrics 51, 425-438. 17/7/2023 Cambridge Ellis Unit Summer School 49
Multi-fidelity Kennedy, M. and O'Hagan, A. (2000). Predicting the output from a complex computer code when fast approximations are available. Biometrika 87, 1-13. Expert knowledge elicitation Oakley J. E. and O'Hagan, A. (2019). SHELF: the Sheffield Elicitation Framework (version 4). School of Mathematics and Statistics, University of Sheffield, UK. (http://tonyohagan.co.uk/shelf) O'Hagan, A. (2019). Expert Knowledge Elicitation: Subjective but Scientific. The American Statistician, 73:sup1, 69-81, doi: 10.1080/00031305.2018.1518265. Imprecision in elicitation Oakley, J. E. and O'Hagan, A. (2007). Uncertainty in prior elicitations: a nonparametric approach. Biometrika 94, 427-441. Gosling, J. P., Oakley, J. E. and O'Hagan, A. (2007). Nonparametric elicitation for heavy-tailed prior distributions. Bayesian Analysis 2, 693-718. Model comparison O'Hagan, A. (1995). Fractional Bayes factors for model comparison (with discussion). Journal of the Royal Statistical Society B 57, 99--138. 17/7/2023 Cambridge Ellis Unit Summer School 50