Sequential Monte Carlo Methods for Dynamic Systems
This discusses Sequential Monte Carlo methods for estimating functions when direct sampling is difficult. It explains the basic idea, conditions on the distribution, handling known normalizing constants, weight diagnostics for importance distribution, and effective sample size considerations.
Uploaded on Nov 16, 2024 | 0 Views
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Sequential Monte Carlo Methods for Dynamic Systems Nir Keret Based on an identically named article by Jun S.Liu, and on his book Monte Carlo Strategies in Scientific Computing
The basic idea: Suppose that ?~?, and we wish to estimate ?( (?)). A straightforward way to go about it is to sample ?1, ,?? from F, and to calculate 1 ? ?=1 ? (??). From the WLLN we know that (?) ?( (?)). ? However, sometimes sampling directly from F is hard/costly. Instead, we can sample ?1, ,??~ ?, and calculate density ratio ? ?? = ? ? ? ? ? 1 ? ?=1 ?? ? ?? . The ? ? ?? ? ?? is called weight and we will refer to it as ??. ? ? ? ? ? ? ?? = ? ? ? ?? ?( ? )
Conditions on the distribution G: easy to sample from ????(?) ????(G) g(x) should be as proportional as possible to h(x)f(x) (in order to minimize the variance) ? ? ? ? ? ? ?2 2 ? ? ? 2 ? = ?(?)?? ? ? 2 ? ? ? ?(?) 2 = ?? ? ? ? ? ? ?( ? ? ? )) (So, when ?(?) 2 ? ? 2= 0 ? ?
When f is known only up to a normalizing constant (as is often the case in Bayesian statistics), we can standardize the weights so that they sum to 1: ? ??/?(??) ?=1 ? ??/?(??) ? ?? = ? And the normalizing constant will cancel out in the numerator and the denominator. It introduces a bias of O(1/n) but when the sample size is large it can be overlooked.
Weight Diagnostics: how to check that we chose a good importance distribution (G): ? ? ? = larger than 1, or ever w(x)>>1. When that happens, one or a few observations might dominate our estimate, as we are calculating 1 ? ?=1 ? ? ? ?? ? ?? = 1, so we know that some weights will be ?? ?(Xi). We should balance the w(x) and h(x) values. ? ? ? 1 then we would like to assign low weights for ?? ? ? high values of h(X) and vice-versa, therefore we could graphically plot w(x)~h(x) and check whether it seems like the case (we should expect a mono-decreasing curve). ? ? ?= ? A badly chosen G might result in a large or even unbounded variance. (It may happen when we choose a light-tailed distribution such as the normal dist. as the importance dist. for a heavy-tailed one, such as the logistic distribution).
Weight Diagnostics: In general, having a few weights with very high values is unfavorable, as it drives the other weights towards zero when we standardize them, and our estimate will be concentrated on a few samples. Effective sample size: how many unweighted samples our weighted samples are worth? ? ?=1 ?=1 ???? ?? Consider a hypothetical linear combination: ? = , Ziare iid with variance ?2. The unweighted mean of ??Z samples has variance ?2 sample size: ? ??. Setting ?(?) ?2 ??while solving for ??yields the effective 2 =n w2 ? ??)2 ?? ??=( ?=1 ? ?=1 ?2
When we use the standardized weights, we can essentially think of our estimate as the expectation of the discrete distribution which puts mass ??on each observed ??coming from G. Therefore, in a way, we are approximating f by this distribution. Perhaps we can sample from this discrete distribution in order to get an approximate sample from f? We can sample with replacement ?1, ,??from the observed sample ?1, ,??with probabilities ?1, ,??. ? ? 0 *If we want to sample ? , then we need ? and Theorem: A random variable Y drawn in this manner has distribution that converges to f as n .
Proof: Consider a set A: Let denote the original (unstandardized) weight. ? ? 1 ? ?=1 ? ? ?? ? ? (??) ? ? ? ?1, ,?? = 1 ? ?=1 ? ? (??) ? 1 ? ?=1 ? ? E ? ? ? ? ? ? ?? ? ? (??) ? ? ? ? ?? = = ? ? ?? ? ? 1 ? ?=1 ? ? ? ? ? ?? = 1 ? ? ? ? ?1, ,?? And by Lebesgue s dominated convergence theorem: ? ? ? = ?{?(? ?)|?1, ,??} ? ? ?? ? ? ?? ? ??
Definition: A sequence of evolving probability distributions ????, indexed by discrete time t=0,1,2 , is called a probabilistic dynamic system. The state variable ??has an increasing dimension as a function of t. That is, ??+1= (??,??+1), where ??+1can be a multidimensional component. ??represents the entire history of the stochastic process up to time t. The difference between ??+1and ??is caused by the incorporation of new information in the analysis. Suppose that we wish to estimate ? ?? with respect to ??.
One way to estimate ? ?? calculate the average. We can utilize advanced sampling algorithms such as the Gibbs sampler in order to draw such sequences. However, this solution requires us to start the process from scratch in every time index t. It is especially problematic when we want to do on- line estimation. is to draw a sample of ??sequences and At time t, it would be better to update previous inferences than to act as if we had no previous information. In our approach, we will use sequential importance sampling: we will append the simulated ??to the previously simulated ?? 1and adjust the importance weights accordingly.
We can express our target density in the following manner: ?? ?? = ?1?1 ?1(?1) ???? ?? 1?? 1 ?2(?2) ?3(?3) ?2(?2) = ? ?1? ?2?1? ?3 ?2) ? ?? ?? 1) Suppose we adopt the same form for the importance distribution: ???? = ? ?1? ?2?1? ?3?2) ? ???? 1) Then the importance weights take the form: ?? ?? =? ?1? ?2?1? ?3 ?2) ? ?? ?? 1) ? ?1? ?2?1? ?3 ?2) ? ?? ?? 1)
And we can notice the recursive relation: ?? ?? = ?? 1( ?? 1)? ?? ?? 1) ? ?? ?? 1) Therefore, the algorithm for obtaining a sequence ??and a corresponding weight ??is given in the following steps: 1. Sample ?1 ?1. Let ?1= ?1= ?1(?1)/?1(?1). Set t = 2. 2. Sample ??|?? 1 ???|?? 1. 3. . Append X?to ?? 1, obtaining ??. 4. Let ??= ?(??|?? 1)/?(??|?? 1). 5. Let ??= ?? 1??. At the current time, ??is the importance weight for ?? 6. Increment t and return to step 2.
In this way we can break a difficult task such as directly sampling a sequence from the distribution ?? into smaller, easier to handle, bits. A simple example: Consider the following sequence of probability densities: ?? ?? = ??exp{ ?? ||*|| is the Euclidean norm. There is no easy way to sample from this density directly. We can employ a standard normal distribution as the Imp. Dist. The conditional density can be expressed as: ? ?? ?? 1 ???? ?? 1(?? 1) 3/ 3} where ??is a normalizing constant and =
1. Let t=1. Sample n points ?1 standard normal distribution. ?from a 1, ,?1 3 2/3 ? exp ?1 2. Calculate initial weights as ?1 ? is the standard normal density. The constants ?? cancel out when the weights are standardized. ?= ? ? ?1 3. For t>1, sample n points ?? standard normal distribution, and append them to ?? 1 ?from a 1, ,?? ?, yielding ?? ?.
4. Calculate the weight adjustment factors ?? ? ?? ? , set ?? 5. Increment t and return to step 2. ? ??? 1 ? ?, and standardize. ?= ?? 1 ??? = ? ?? In each time t, we will use our sample of n sequences in order to estimate the desired expectation, using IS. What if instead we sampled a multivariate normal directly at each time t?
One major problem with Sequential IS is that the weights become increasingly degenerate as t is getting bigger. Each time an unlikely new component is appended to a sequence, the weight for that entire sequence decreases proportionally. Eventually, since we re dealing with long time-sequences, only a few sequences will have avoided this phenomenon, thus the weight will be more and more concentrated around a few sequences. Degenerate weights reduce the effective sample size and degrade estimation performance.
Therefore, practically, we have to perform rejuvenation steps every once in a while in order to keep the effective sample size large enough. We can perform those rejuvenation steps at pre- defined checkpoints (like, at times t=5,10,15 ), or instead we can constantly monitor our effective sample size and perform rejuvenation whenever it falls below a pre-defined threshold. This is a stochastic rejuvenation schedule. But how do we rejuvenate?
First option: Discard any sequence whose weight falls beneath a threshold and sample a new sequence instead. Problematic? Second option: Resampling. Out of the current sequences we have, we can resample (with replacement) new n sequences. After resampling, we reset the weights (1/n) to all sequences. It should be noted that resampling does not facilitate inference at the current time, but rather improves the chances for the good for later times. Hence, inference at the current time should be done before resampling has taken place.
Resampling schemes: 1. Simple Random Sampling: Resample from the current sequences according to their standardized weights. (reminiscent of Importance Resampling) 2. Residual Resampling: a. Retain ??= ??? and ? = 1, ,?. Let ??= ? ?1 ??. b. Obtain ??iid draws from ??(the sample of sequences at time t) with probabilities proportional to ??? c. ? copies of ?? ?, where ?? ?= ?? ?/?? ? ??,? = 1, ,?. Reset the weights.
Residual Resampling (example): S S ? k ?? ?? ?? ?? ?? 0.5/5 0 0.5 0.25 1.25/5 1 0.25 0.125 0.25/5 0 0.25 0.125 2.1/5 2 0.1 0.05 0.9/5 0 0.9 0.45 ??= 5 3 = 2 ?? ? ? Sample 2 additional sequences according to ?
Particle Filters: Resampling deals with the weight degeneracy problem but creates a new one: now the different sequences are not longer statistically independent. High-weight sequences are merely replicated rather than diversified. Particle filters employ additional measures in order to diversify the sequences at each resampling step (such as resampling from mixture of smooth densities centered at some or all the current particles). The simplest filter is the Bootstrap Filter which is simply performing Simple Random Sampling at each time t. It is proper only for Markovian models.
The self avoiding random walk (SAW) in a two or three dimensional lattice is often used as a simple model for chain polymers such as polyester. A realization of a chain polymer of length N is fully characterized by the positions of all of its molecules (?1, ,??) where ??is a point in the 2-D lattice. The distance between two adjacent points has to be 1, and ??+1 ??for all ? < ?. The lines connecting two points is called a covalent bond.
One possible distribution for the chain polymer is the uniform distribution. More formally, ?? ?? = possible SAW chains of length t. 1 ??where ??is the total number of A basic method for sampling a SAW chain: At each step, the walker chooses with equal probabilities one of the 3 allowed neighbouring points (it cannot go backwards). If the chosen position has already been chosen before, the walker goes back to (0,0), and starts again. Problem? A more efficient method is the one-step-look-ahead method. In this method, at each step the walker checks what the possible neighbours are, and chooses one out of them with equal probabilities. However, this introduces a bias, as can be demonstrated by the following example:
Look at the following two instances of a 5-atom chain: Using the one-step-look-ahead method, what are the probabilities for generating each of those? ?(?) =1 4 3 3 2 4 3 3 3 Hence, chains generated in this method do not comply with the uniform distribution assumption. 1 1 1 ?(?) =1 1 1 1
We can correct for this bias using the IS framework: Essentially, we are using an importance distribution ???? ?? 1 = ?? 1. Since ???? ?? 1 =1 and ??= ?? 1?? 1 ?? 1where ?? 1denotes the number of unoccupied neighbours to point 3 1 then the weight increment is ??= 1 1 = nt 1, nt 1 When t is large two problems occur: 1. weight degeneracy 2. the one-step- look-ahead method can become ineffective. How would we implement a two-step-look-ahead method? ?? ? ?? 1??where ?? 1is the unoccupied neighbourhood of ?? 1. ???? ?? 1 = Advantages and disadvantages of bigger k-step-look-ahead methods?
The following example is taken from the book Computational Statistics by Givens and Hoeting Consider a Markov sequence of unobservable random variables ?0,?1,?2 indexed by time t, such that the distribution of ??| ?? 1depends only on ?? 1. Consider another sequence of observable random variables ?0,?1, such that ??is dependent on the current state ??and the observations ??are conditionally independent given the first process values. Thus, we have the model: ??and ??are density functions. ??~?????? ??~?????? 1 We wish to use the observed ??as data to estimate the states ??of the hidden Markov process. In the Importance Sampling framework, the target distribution is ?? ?? ??.
We can decompose ?? ?? ?? like before : ?? ?? ?? = ? ?1| ??? ?2?1, ??? ?3?2, ??) ? ???? 1, ??) But it isn t very helpful. However, note that there is a recursive relationship between ??and ?? 1: ?? ?? ?? =?? 1 ?? 1 ?? 1?????? 1??(??|??) ??(??| ?? 1)
??( ??| ??) can be decomposed similarly: ?? ?? ?? = ? ?1| ??? ?2?1, ??? ?3?2, ??) ? ???? 1, ??) However, we can choose g to be independent of y?, specifically it is convenient to choose ?? ?? ?? = pxx? = ???1??x2x1 px(xt|x? 1) ???? ?? 1, ?? = ?????? 1 (markovian)
Then the weight increment at time t is: ?????? 1??(??|??) ??(??|?? 1)???? ?? 1, ?? ?????? 1??(??|??) ?????? 1??(??|?? 1) ?????? ??= = And now for the concrete example.
An airplane flying over uneven terrain can use information about the ground elevation beneath it to estimate its current location. Simultaneously, an inertial navigation system provides an estimated travel direction and distance. At each time point the previously estimated location of the plane is updated using both types of new information. Interest in such problems arises in, for example, military applications where the approach could serve as an alternative or backup to global satellite systems. Let the two-dimensional variable ??= (?1?,?2?) denote the true location of the plane at time t, and let ??denote the measured drift in the plane location as measured by the inertial navigation system. The key data for terrain navigation come from a map database that contains the true elevation m(??) at any location ??.
The hidden Markov model for terrain navigation is: ??= ? ?? + ?? ??= ?? ?+ ??+ ?? Y is the observed elevation. ?? ? 0,?2 Let us suppose that the plane is following a circular arc specified by 101 angles ??(for ? = 0,...,100) equally spaced between ?/2 and 0, with the true location at time t being ??= cos ?? ,sin ?? Suppose that the random error ??+?= ????where ??= ?1? ?2? ?2? ?1? and ??~ ? ?,?2 1 0 ?2 and k < 1. 0
This distribution ??(??) effectively constitutes the importance distribution ?????? 1. This complicated specification is more simply described by saying that ??has a bivariate normal distribution with standard deviations q and kq, rotated so that the major axis of density contours is parallel to the tangent of the flight path at the current location. A standard bivariate normal distribution would be an alternative choice, but ours simulates the situation where uncertainty about the distance flown during the time increment is greater than uncertainty about deviations orthogonal to the direction of flight.
The sequential importance sampling algorithm for this problem proceeds as follows: 1. Initialize at t = 0. Draw n starting points ?? ?for i = 1, . . . , n. In real life, one could imagine that the initialization point corresponds to the departure airport or to some position update provided by occasional detection stations along the flight path, which provide highly accurate location data allowing the current location to be reinitialized. 2. Receive observed elevation data ??.
3. Calculate the weight update factor ?? 4. Update the weights according to ?? 5. Standardize the weights so that they sum to 1. 6. Estimate ??= ?? 7. Check for weight degeneracy and rejuvenate if needed. 8. Sample a set of location errors ??+? locations according to ??+? = ?? 9. Increment t and return to step 2. ?= ? ??;? ?? ?= ?? 1 ? ,?2 1 ?. ?. If ? = 0 then ?? 1 ??? ? = 1 ? ?=1 ??? ? ? ~ ??+1(?) and advance the set of ?+ ??+?+ ??+? ? ? ?.