Temporal Correlation in Statistical Modeling

General
ised
 and 
mixed
 GAM
models
Claudia von Brömssen
Dept of Energy and Technology
Mixed models- dependent observations
Statistical models always require 
independent observations
.
If observations are 
not independent 
usually
-
p-values are too low
-
confidence intervals are too narrow
In many studies were GAM models are interesting to use we
have a time or space component, which also naturally leads to
temporal or spatial autocorrelation between observations.
Mixed models- dependent observations
The presence of temporal or spatial correlations in when
estimating smooth functions can also lead to
-
too complex functions = too little smoothing
If correlations between observations can be estimated they
can also be incorporated in the model  => models are called
mixed models.
Correlation in time
To analyze if there is autocorrelation in a time series we can
use the autocorrelation function.
acf(river$logTot.P, na.action=na.exclude)
Autocorrelations for
the total phosphorus
transport series.
There is high
autocorrelation,
especially a clear
seasonal structure.
Correlation in time
Autocorrelations in time series can have several causes, e.g. :
-
Seasonal variation
-
Effects of autocorrelated explanatory variables
-
Trend
If the model we define accounts for some of these sources the
remaining autocorrelations in the residuals of the model is
typically much smaller than in the raw data.
Correlation in time
Autocorrelation in original series    Autocorrelation when seasonal 
 
     
variation is removed
Correlation in time
Without seasonal variation
  
without seasonal variation,  
 
     
influence of flow and trend
Correlation in time
model4g <- gamm(logTot.P~s(datenr)+s(Month, bs='cc') +
s(logRunoff), data=river, 
correlation = corAR1()
)
summary(model4g$lme)
Correlation Structure: AR(1)
 Formula: ~1 | g/g.0/g.1
 Parameter estimate(s):
      Phi
0.2774044
Fixed effects: y ~ X - 1
                     Value  Std.Error  DF   t-value p-value
X(Intercept)      5.575500 0.01947148 357 286.34182       0
Xs(datenr)Fx1    -0.154891 0.01946324 357  -7.95812       0
Xs(logRunoff)Fx1  1.271152 0.12136436 357  10.47385       0
Correlation in time
From lme for model with correlations estimated
Fixed effects: y ~ X - 1
                     Value  Std.Error  DF   t-value p-value
X(Intercept)      5.575500 
0.01947148
 357 286.34182       0
Xs(datenr)Fx1    -0.154891 
0.01946324
 357  -7.95812       0
Xs(logRunoff)Fx1  1.271152 0.12136436 357  10.47385       0
From lme for model with no correlations estimated
Fixed effects: y ~ X - 1
                     Value  Std.Error  DF  t-value p-value
X(Intercept)      5.575513 
0.01463065
 357 381.0845       0
Xs(datenr)Fx1    -0.154499 
0.01467732
 357 -10.5264       0
Xs(logRunoff)Fx1  1.320282 0.13299947 357   9.9270       0
Correlation in time
It is generally difficult to separate high temporal
autocorrelation and time trends.
Ignoring autocorrelation can result in exaggerated magnitudes
or too low p-values for trend estimates.
In some application we have, however, also observed over-
smoothing when autocorrelation is estimated simultaneously.
Correlation in time
How do we know that the model we used for autocorrelations
is sufficient.
Here we were using a AR(1) model indicating that observations
one time step are correlated.
What if the correlations structure is more complex?
We can check the residuals to see:
Residuals of mixed models in R
model4g <- gamm(logTot.P~s(datenr)+s(Month, bs='cc')+s(logRunoff) , data=river,
 
correlation=corAR1())
acf(residuals(model4g$gam))
not modelling autocorrelation      modelling autocorrelations
What is the difference?
Residuals of mixed models in R
The residual function for gam in R returns the residuals
without adjustment or the correlation structure.
We can use instead:
acf(residuals(model4g$
lme
, 
type="normalized"
))
Correlation in data
More about other types of dependencies in data later.
Generalized additive models
Just as for GLM the response variable in GAM can have other
distributions than normal:
-
Binomial for 0/1 data, proportions
-
Poisson for count data
Generalized additive models
In GLM
-
for a binomial model the expected value after logit-
transformation is modelled to be linearly dependent on the
explanatory variables
-
for a Poisson model the expected value after log
transformation is modelled to be linearly dependent on the
explanatory variables
Again using GAM we can remove the assumption of linearity
and allow a free form for the relationship.
Generalized additive models
As before, we have the choice between several functions/
packages in R.
-
gam
-
gamm
-
gamm4
Estimations in generalized models are more complicated than
when assuming normality for the response and therefore
different functions can give different results.
Generalized additive models
General recommendations for the choice of function:
 
No random terms or correlations: 
gam
 
Random terms or correlations and normally distributed
  
errors or Poisson data: 
gamm
 
Random terms or correlations and binomial data: 
 
  
gamm4
GAM model for Count data
Data from the ’Svensk fågeltaxering’
Robins are counted each year on fixed routes. Here we use
2009 data in Southern Sweden.
Spatial expansion:
Larger dots if more robins are
observed,
smaller ones for few
GAM model for Count data
The count of Robins during breeding season can be explained
by e.g.
- temperature during spring time
 
- location
 
- kind of habitat, proportion of arable land, 
  
  
forest…
 
- other types of geographical structures
GAM model for Count data
A Poisson model for counts of Robin using temperature in April
(
temp4
), proportion of coniferous forests (
CF
) and a spatial
component (coordinates 
o1
 and 
n1
), 
lind
 is the count of
robins:
modelC<-gam(lind ~ te(o1,n1) + s(temp4) + s(CF),
 
family="poisson", data=bird_sub )
Thin plate splines for temperature and proportion of forest.
Tensor products for the spatial component to allow differnt
smoothing parameters in the north-south and the east-west
component.
Poisson distribution is used.
GAM model for Count data
Family: poisson
Link function: log
Formula:
lind ~ te(o1, n1) + s(temp4) + s(CF)
Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  2.32783    0.02576   90.37   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
             edf Ref.df Chi.sq  p-value
te(o1,n1) 22.444 22.911 147.49  < 2e-16 ***
s(temp4)   8.796  8.984  60.29 7.59e-10 ***
s(CF)      6.259  7.407 106.01  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) =   0.29   Deviance explained = 46.3%
UBRE = 2.5024  Scale est. = 1         n = 171
GAM model for Count data
High curvature in all smooth. What have we missed?
GAM model for Count data
Family: poisson
Link function: log
Formula:
lind ~ te(o1, n1) + s(temp4) + s(CF)
Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  2.32783    0.02576   90.37   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
             edf Ref.df Chi.sq  p-value
te(o1,n1) 22.444 22.911 147.49  < 2e-16 ***
s(temp4)   8.796  8.984  60.29 7.59e-10 ***
s(CF)      6.259  7.407 106.01  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) =   0.29   Deviance explained = 46.3%
UBRE = 2.5024  
Scale est. = 1         
n = 171
GAM model for Count data
In Poisson models the basic assumption is that data is Poisson
distributed, i.e. the mean is the same as the variance.
We can see that in the output as the scale estimate is set to 1,
no specific variance is estimated.
However, in many practical applications this assumption does
not hold: The data is overdispersed = we have more variation
than what the Poisson model allows.
GAM model for Count data
Overdispersion is often quantified by computing the residual
deviance divided by the residual degrees of freedom.
For 
gam
 there is no easy way to see how high this value is and
therefore we check how the fit changes if we use
quasipoisson
 instead, i.e. we allow the estimation of a
separate variance.
modelCa<-gam(lind ~ te(o1,n1) + s(temp4) + s(CF),
 
family="quasipoisson", data=bird_sub )
GAM model for Count data
Family: quasipoisson
Link function: log
Formula:
lind ~ te(o1, n1) + s(temp4) + s(CF)
Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  2.37411    0.04997   47.51   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
             edf Ref.df     F  p-value
te(o1,n1) 
10.059
 12.636 1.654   0.0710 .
s(temp4)   
1.000
  1.000 3.660   0.0575 .
s(CF)      
2.286
  2.835 7.903 8.87e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) =  0.248   Deviance explained = 34.2%
GCV = 4.4589  
Scale est. = 4.2207
    n = 171
GAM model for Count data
The scale estimate is larger than 1 and the resulting smooth are
less wiggly after adjustment for overdispersion.
       not accounting for overdispersion                  accounting for overdispersion
GAM model for Count data
If we use the gamm function instead a scale parameter is
automatically estimated:
Approximate significance of smooth terms:
            edf Ref.df      F  p-value
te(o1,n1) 3.001  3.001  2.018    0.113
s(temp4)  1.000  1.000  0.952    0.331
s(CF)     1.608  1.608 20.931 4.55e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) =  0.173
  Scale est. = 4.4881    n = 171
GAM model for Count data
A contour plot for the spatial component of
the fit can be plotted:
vis.gam(modelD$gam,plot.type='contour')
This represents only the spatial component
(not temp and CF) and shows the linear
predictor function, i.e. the relationship to
the log-count.
GAM model for Count data
More interesting is usually to plot the
predicted response = the modelled
number of robins:
vis.gam(modelD$gam,plot.type='contour’
, type="response")
The variables that are not plotted (temp4
and CF) are set to their mean values.
But why is the predicted area rectangular?
Do we have observations everywhere?
GAM model for Count data
We can include points to indicate
where the observations lie:
points(bird_sub$o1, bird_sub$n1)
In areas with large changes we
actually have no observations.
Maybe predicted values should only
be presented close to observations.
GAM model for Count data
We can use the 
too.far
 statement
to avoid predictions in areas where no
observations are made:
vis.gam(modelD$gam,
plot.type='contour',
type="response", too.far=0.1)
points(bird_sub$o1, bird_sub$n1)
Correlation in space
The same consideration as for correlation in time are also true
for correlations in space.
We can estimate correlation between residuals in all types of
generalized additive models.
Generally, when data is Poisson or Binomially distributed the
models including correlations are quite complex and
convergence is not always possible.
Correlation in space
To illustrate correlations in time we
used autocorrelation functions.
The corresponding function for
correlations in space is called
variogram.
The lower the value in the left corner
of the plot the higher the spatial
correlation.
original data
Correlation in space
Again we can explain a part of the
autocorrelation by the components
in the model, mainly the spatial
component.
As before we can use the
correlation=  
function to
include an estimate of the
correlation in residuals.
residuals from model
Correlation in space
There are several spatial correlation structures that can be used,
e.g. the corExp or the corGaus. We can include longitude and
latitude and the correlation is computed between observations
with the same distance:
modelD<-gamm(lind~te(o1,n1)+s(temp4)+s(CF),
family="poisson", data=bird_sub,
correlation=corExp(form=~o1+n1)
)
It is assumed that correlations are isotropic, i.e. the correlations
are equally strong in all directions.
Correlation in space
from $lme
Correlation Structure: Exponential spatial correlation
 Formula: ~o1 + n1 | g/g.0/g.1
 Parameter estimate(s):
     range
0.09702329
from $gam
Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  2.38310    0.05869   40.61   <2e-16 ***
Approximate significance of smooth terms:
            edf Ref.df      F  p-value
te(o1,n1) 3.001  3.001  2.018    0.113
s(temp4)  1.000  1.000  0.952    0.331
s(CF)     1.608  1.608 20.931 4.55e-06 ***
Correlation in space
In this case there are basically no differences between the fit
with and without the spatial correlation, since correlations in
the residuals are small.
Here we model the spatial mean explicitly, but we could also
look for relations between explanatory variables and the
number of birds only without the spatial component.
Then the spatial correlations would be more important and
should be included.
(The model we use here has very low predictive power = not a
good model)
Correlation in space
Another interesting type of spline can be used for spatial areas
with clear borders, e.g. a lake.
Taken from Gavin Simpsons blogg:
http://www.fromthebottomoftheheap.net/2016/03/27/soap-film-smoothers/
 
Correlation in space
Soap film smooth can adjust to the shape of the object that is
modelled.
Taken from Gavin Simpsons blogg:
http://www.fromthebottomoftheheap.net/2016/03/27/soap-film-smoothers/
 
Other types of dependencies in data
Hierarchical sampling leads to data on different levels, e.g. 
 
-
several forests are selected,
-
within each forest a few plots are selected, and
-
within each plot 5 trees are selected
Observations of the individual trees can not be considered
independent since their values can depend on with forest and
which plot they stand on.
The dependency structure can be accounted for by
nested/hierarchical factors.
Other types of dependencies in data
gamm(response~ s(expl.var),
random=list(Forest= ~1, Plot=~1),
data=urval)
You might recognize other coding of hierarchical factors:
+ (1|Forest/Plot)
random = ~ 1|Forest/Plot
In gamm only the named-list structure works.
An example of splines in SAS
Splines can be used in all types of GLM models in the GLIMMIX
procedure:
proc
 
glimmix
 data=in2 outdesign=x plots=all;
class cow;
effect myspline = spline(lactation_week);
model log_methane=  myspline / s noint dist=normal;
random _residual_ /subject=cow type=un;
output out=gmxout pred(noblup)=p lcl(noblup)=lower
 
ucl(noblup)=upper;
run
;
An example of splines in SAS
proc
 
sort
 data=gmxout;
by lactation_week;
run
;
proc
 
sgplot
 data=gmxout ;
series  y=p x=lactation_week ;
keylegend / position=bottom border
 
across=
5
;
band x=lactation_week lower=lower
 
upper=upper ;
series  y=p x=lactation_week ;
run
;
An example of splines in SAS
An example of splines in SAS
In GLIMMIX we can model data with normally
distributed residual, as well as binomial and Poisson
data.
We can also model temporal or spatial
autocorrelations or other random factors.
Two-dimensional splines can not be fitted, but
interactions to categorical variables are possible.
An example of splines in SAS
Slide Note
Embed
Share

Statistical models require independent observations, as dependent observations can lead to low p-values and narrow confidence intervals. Generalized and mixed GAM models are used when there is temporal or spatial autocorrelation between observations. Correlation in time series can be analyzed using autocorrelation functions to identify seasonal variations, effects of autocorrelated variables, and trends. Modeling for these factors can reduce autocorrelations in model residuals. Images showcase autocorrelation analysis in time series with and without seasonal variation.

  • Statistical models
  • Generalised and mixed GAM models
  • Temporal correlation
  • Autocorrelation function
  • Seasonal variation

Uploaded on Feb 24, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Generalised and mixed GAM models Claudia von Br mssen Dept of Energy and Technology

  2. Mixed models- dependent observations Statistical models always require independent observations. If observations are not independent usually - p-values are too low - confidence intervals are too narrow In many studies were GAM models are interesting to use we have a time or space component, which also naturally leads to temporal or spatial autocorrelation between observations.

  3. Mixed models- dependent observations The presence of temporal or spatial correlations in when estimating smooth functions can also lead to - too complex functions = too little smoothing If correlations between observations can be estimated they can also be incorporated in the model => models are called mixed models.

  4. Correlation in time To analyze if there is autocorrelation in a time series we can use the autocorrelation function. acf(river$logTot.P, na.action=na.exclude) Series river$logTot.P Autocorrelations for the total phosphorus transport series. 1.0 0.8 0.6 There is high autocorrelation, especially a clear seasonal structure. ACF 0.4 0.2 0.0 0 5 10 15 20 25 Lag

  5. Correlation in time Autocorrelations in time series can have several causes, e.g. : - Seasonal variation - Effects of autocorrelated explanatory variables - Trend If the model we define accounts for some of these sources the remaining autocorrelations in the residuals of the model is typically much smaller than in the raw data.

  6. Correlation in time Series river$logTot.P Series residuals(model4a$gam) 1.0 1.0 0.8 0.8 0.6 0.6 0.4 ACF ACF 0.4 0.2 0.0 0.2 -0.2 0.0 -0.4 0 5 10 15 20 25 0 5 10 15 20 25 Lag Lag Autocorrelation in original series Autocorrelation when seasonal variation is removed

  7. Correlation in time Series residuals(model4f$gam) Series residuals(model4a$gam) 1.0 1.0 0.8 0.8 0.6 0.6 ACF ACF 0.4 0.4 0.2 0.2 0.0 0.0 0 5 10 15 20 25 0 5 10 15 20 25 Lag Lag Without seasonal variation without seasonal variation, influence of flow and trend

  8. Correlation in time model4g <- gamm(logTot.P~s(datenr)+s(Month, bs='cc') + s(logRunoff), data=river, correlation = corAR1()) summary(model4g$lme) Correlation Structure: AR(1) Formula: ~1 | g/g.0/g.1 Parameter estimate(s): Phi 0.2774044 Fixed effects: y ~ X - 1 Value Std.Error DF t-value p-value X(Intercept) 5.575500 0.01947148 357 286.34182 0 Xs(datenr)Fx1 -0.154891 0.01946324 357 -7.95812 0 Xs(logRunoff)Fx1 1.271152 0.12136436 357 10.47385 0

  9. Correlation in time From lme for model with correlations estimated Fixed effects: y ~ X - 1 Value Std.Error DF t-value p-value X(Intercept) 5.575500 0.01947148 357 286.34182 0 Xs(datenr)Fx1 -0.154891 0.01946324 357 -7.95812 0 Xs(logRunoff)Fx1 1.271152 0.12136436 357 10.47385 0 From lme for model with no correlations estimated Fixed effects: y ~ X - 1 Value Std.Error DF t-value p-value X(Intercept) 5.575513 0.01463065 357 381.0845 0 Xs(datenr)Fx1 -0.154499 0.01467732 357 -10.5264 0 Xs(logRunoff)Fx1 1.320282 0.13299947 357 9.9270 0

  10. Correlation in time It is generally difficult to separate high temporal autocorrelation and time trends. Ignoring autocorrelation can result in exaggerated magnitudes or too low p-values for trend estimates. In some application we have, however, also observed over- smoothing when autocorrelation is estimated simultaneously.

  11. Correlation in time How do we know that the model we used for autocorrelations is sufficient. Here we were using a AR(1) model indicating that observations one time step are correlated. What if the correlations structure is more complex? We can check the residuals to see:

  12. Residuals of mixed models in R model4g <- gamm(logTot.P~s(datenr)+s(Month, bs='cc')+s(logRunoff) , data=river, correlation=corAR1()) acf(residuals(model4g$gam)) not modelling autocorrelation modelling autocorrelations Series residuals(model4f$gam) Series residuals(model4g$gam) 1.0 1.0 0.8 0.8 0.6 0.6 ACF ACF 0.4 0.4 0.2 0.2 0.0 0.0 0 5 10 15 20 25 0 5 10 15 20 25 Lag Lag What is the difference?

  13. Residuals of mixed models in R The residual function for gam in R returns the residuals without adjustment or the correlation structure. We can use instead: acf(residuals(model4g$lme, type="normalized")) Series residuals(model4g$lme, type = "normalized") Series residuals(model4f$gam) 1.0 1.0 0.8 0.8 0.6 0.6 ACF ACF 0.4 0.4 0.2 0.2 0.0 0.0 0 5 10 15 20 25 0 5 10 15 20 25 Lag Lag

  14. Correlation in data More about other types of dependencies in data later.

  15. Generalized additive models Just as for GLM the response variable in GAM can have other distributions than normal: - Binomial for 0/1 data, proportions - Poisson for count data

  16. Generalized additive models In GLM - for a binomial model the expected value after logit- transformation is modelled to be linearly dependent on the explanatory variables - for a Poisson model the expected value after log transformation is modelled to be linearly dependent on the explanatory variables Again using GAM we can remove the assumption of linearity and allow a free form for the relationship.

  17. Generalized additive models As before, we have the choice between several functions/ packages in R. - gam - gamm - gamm4 Estimations in generalized models are more complicated than when assuming normality for the response and therefore different functions can give different results.

  18. Generalized additive models General recommendations for the choice of function: No random terms or correlations: gam Random terms or correlations and normally distributed errors or Poisson data: gamm Random terms or correlations and binomial data: gamm4

  19. GAM model for Count data Data from the Svensk f geltaxering Robins are counted each year on fixed routes. Here we use 2009 data in Southern Sweden. lind Spatial expansion: Larger dots if more robins are observed, smaller ones for few 1 5 10 16 48

  20. GAM model for Count data The count of Robins during breeding season can be explained by e.g. - temperature during spring time - location - kind of habitat, proportion of arable land, forest - other types of geographical structures

  21. GAM model for Count data A Poisson model for counts of Robin using temperature in April (temp4), proportion of coniferous forests (CF) and a spatial component (coordinates o1 and n1), lind is the count of robins: modelC<-gam(lind ~ te(o1,n1) + s(temp4) + s(CF), family="poisson", data=bird_sub ) Thin plate splines for temperature and proportion of forest. Tensor products for the spatial component to allow differnt smoothing parameters in the north-south and the east-west component. Poisson distribution is used.

  22. GAM model for Count data Family: poisson Link function: log Formula: lind ~ te(o1, n1) + s(temp4) + s(CF) Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.32783 0.02576 90.37 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Approximate significance of smooth terms: edf Ref.df Chi.sq te(o1,n1) 22.444 22.911 147.49 < 2e-16 *** s(temp4) 8.796 8.984 60.29 7.59e-10 *** s(CF) 6.259 7.407 106.01 < 2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 p-value R-sq.(adj) = 0.29 Deviance explained = 46.3% UBRE = 2.5024 Scale est. = 1 n = 171

  23. GAM model for Count data High curvature in all smooth. What have we missed? -1se te(o1,n1,22.44) +1se 66 2 0 -1 -1 0 0 0 1 65 -2 -1 -1 -1 1 0 0 -1 -2 0 s(temp4,8.8) -1 0 0 64 -1 0 n1 -1 -3 -3 -3 -5 -2 -2 -4 -5 -5 -8 0 -1 -1 -1 0 63 -2 -2 -2 0 -1 -3 -2 -1 -2 62 -1 -3 0 -1 -4 7 8 9 10 13 14 15 16 17 o1 temp4 2 1 0 s(CF,6.26) -1 -2 -3 -4 0.0 0.2 0.4 0.6 0.8 1.0 CF

  24. GAM model for Count data Family: poisson Link function: log Formula: lind ~ te(o1, n1) + s(temp4) + s(CF) Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.32783 0.02576 90.37 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Approximate significance of smooth terms: edf Ref.df Chi.sq te(o1,n1) 22.444 22.911 147.49 < 2e-16 *** s(temp4) 8.796 8.984 60.29 7.59e-10 *** s(CF) 6.259 7.407 106.01 < 2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 p-value R-sq.(adj) = 0.29 Deviance explained = 46.3% UBRE = 2.5024 Scale est. = 1 n = 171

  25. GAM model for Count data In Poisson models the basic assumption is that data is Poisson distributed, i.e. the mean is the same as the variance. We can see that in the output as the scale estimate is set to 1, no specific variance is estimated. However, in many practical applications this assumption does not hold: The data is overdispersed = we have more variation than what the Poisson model allows.

  26. GAM model for Count data Overdispersion is often quantified by computing the residual deviance divided by the residual degrees of freedom. For gam there is no easy way to see how high this value is and therefore we check how the fit changes if we use quasipoisson instead, i.e. we allow the estimation of a separate variance. modelCa<-gam(lind ~ te(o1,n1) + s(temp4) + s(CF), family="quasipoisson", data=bird_sub )

  27. GAM model for Count data Family: quasipoisson Link function: log Formula: lind ~ te(o1, n1) + s(temp4) + s(CF) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.37411 0.04997 47.51 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Approximate significance of smooth terms: edf Ref.df te(o1,n1) 10.059 12.636 1.654 0.0710 . s(temp4) 1.000 1.000 3.660 0.0575 . s(CF) 2.286 2.835 7.903 8.87e-05 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 F p-value R-sq.(adj) = 0.248 Deviance explained = 34.2% GCV = 4.4589 Scale est. = 4.2207 n = 171

  28. GAM model for Count data The scale estimate is larger than 1 and the resulting smooth are less wiggly after adjustment for overdispersion. linear predictor linear predictor 2.6 65 65 0 2.8 64 64 1. 4 n1 n1 63 63 2.4 2.2 62 62 -60 2 -100 -120 -160 -20 -80 1.8 -40 -20 0 0 0 1.6 1.4 13 14 15 16 13 14 15 16 o1 o1 not accounting for overdispersion accounting for overdispersion

  29. GAM model for Count data If we use the gamm function instead a scale parameter is automatically estimated: Approximate significance of smooth terms: edf Ref.df te(o1,n1) 3.001 3.001 2.018 0.113 s(temp4) 1.000 1.000 0.952 0.331 s(CF) 1.608 1.608 20.931 4.55e-06 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 F p-value R-sq.(adj) = 0.173 Scale est. = 4.4881 n = 171

  30. GAM model for Count data linear predictor 2.1 A contour plot for the spatial component of the fit can be plotted: 2.5 65 2.7 vis.gam(modelD$gam,plot.type='contour') This represents only the spatial component (not temp and CF) and shows the linear predictor function, i.e. the relationship to the log-count. 1.6 64 2.8 n1 1.7 1.8 2.6 63 1.9 2.4 2.3 2.2 2 62 1.8 1.7 1.9 1.6 13 14 15 16 o1

  31. GAM model for Count data response More interesting is usually to plot the predicted response = the modelled number of robins: 7 17 8 16 65 9 15 vis.gam(modelD$gam,plot.type='contour , type="response") 10 14 64 11 12 n1 13 The variables that are not plotted (temp4 and CF) are set to their mean values. 63 But why is the predicted area rectangular? Do we have observations everywhere? 11 62 18 12 19 16 17 14 13 15 13 14 15 16 o1

  32. GAM model for Count data response We can include points to indicate where the observations lie: 7 17 8 16 65 points(bird_sub$o1, bird_sub$n1) 9 15 10 14 In areas with large changes we actually have no observations. 64 11 12 n1 13 Maybe predicted values should only be presented close to observations. 63 11 62 18 12 19 16 17 14 13 15 13 14 15 16 o1

  33. GAM model for Count data response We can use the too.far statement to avoid predictions in areas where no observations are made: 7 17 8 16 65 9 10 15 vis.gam(modelD$gam, plot.type='contour', type="response", too.far=0.1) 10 11 14 64 11 12 n1 13 12 points(bird_sub$o1, bird_sub$n1) 13 63 15 62 14 12 13 13 14 15 16 o1

  34. Correlation in space The same consideration as for correlation in time are also true for correlations in space. We can estimate correlation between residuals in all types of generalized additive models. Generally, when data is Poisson or Binomially distributed the models including correlations are quite complex and convergence is not always possible.

  35. Correlation in space To illustrate correlations in time we used autocorrelation functions. 80 The corresponding function for correlations in space is called variogram. 60 semivariance 40 The lower the value in the left corner of the plot the higher the spatial correlation. 20 original data 0.5 1.0 1.5 2.0 distance

  36. Correlation in space Again we can explain a part of the autocorrelation by the components in the model, mainly the spatial component. 4 3 semivariance 2 As before we can use the correlation= function to include an estimate of the correlation in residuals. 1 residuals from model 0.5 1.0 1.5 2.0 distance

  37. Correlation in space There are several spatial correlation structures that can be used, e.g. the corExp or the corGaus. We can include longitude and latitude and the correlation is computed between observations with the same distance: modelD<-gamm(lind~te(o1,n1)+s(temp4)+s(CF), family="poisson", data=bird_sub, correlation=corExp(form=~o1+n1)) It is assumed that correlations are isotropic, i.e. the correlations are equally strong in all directions.

  38. Correlation in space from $lme Correlation Structure: Exponential spatial correlation Formula: ~o1 + n1 | g/g.0/g.1 Parameter estimate(s): range 0.09702329 from $gam Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.38310 0.05869 40.61 <2e-16 *** Approximate significance of smooth terms: edf Ref.df te(o1,n1) 3.001 3.001 2.018 0.113 s(temp4) 1.000 1.000 0.952 0.331 s(CF) 1.608 1.608 20.931 4.55e-06 *** F p-value

  39. Correlation in space In this case there are basically no differences between the fit with and without the spatial correlation, since correlations in the residuals are small. Here we model the spatial mean explicitly, but we could also look for relations between explanatory variables and the number of birds only without the spatial component. Then the spatial correlations would be more important and should be included. (The model we use here has very low predictive power = not a good model)

  40. Correlation in space Another interesting type of spline can be used for spatial areas with clear borders, e.g. a lake. Taken from Gavin Simpsons blogg: http://www.fromthebottomoftheheap.net/2016/03/27/soap-film-smoothers/

  41. Correlation in space Soap film smooth can adjust to the shape of the object that is modelled. Taken from Gavin Simpsons blogg: http://www.fromthebottomoftheheap.net/2016/03/27/soap-film-smoothers/

  42. Other types of dependencies in data Hierarchical sampling leads to data on different levels, e.g. - several forests are selected, - within each forest a few plots are selected, and - within each plot 5 trees are selected Observations of the individual trees can not be considered independent since their values can depend on with forest and which plot they stand on. The dependency structure can be accounted for by nested/hierarchical factors.

  43. Other types of dependencies in data gamm(response~ s(expl.var), random=list(Forest= ~1, Plot=~1), data=urval) You might recognize other coding of hierarchical factors: + (1|Forest/Plot) random = ~ 1|Forest/Plot In gamm only the named-list structure works.

  44. An example of splines in SAS Splines can be used in all types of GLM models in the GLIMMIX procedure: proc glimmix data=in2 outdesign=x plots=all; class cow; effect myspline = spline(lactation_week); model log_methane= myspline / s noint dist=normal; random _residual_ /subject=cow type=un; output out=gmxout pred(noblup)=p lcl(noblup)=lower ucl(noblup)=upper; run;

  45. An example of splines in SAS proc sort data=gmxout; by lactation_week; run; proc sgplot data=gmxout ; series y=p x=lactation_week ; keylegend / position=bottom border across=5; band x=lactation_week lower=lower upper=upper ; series y=p x=lactation_week ; run;

  46. An example of splines in SAS

  47. An example of splines in SAS In GLIMMIX we can model data with normally distributed residual, as well as binomial and Poisson data. We can also model temporal or spatial autocorrelations or other random factors. Two-dimensional splines can not be fitted, but interactions to categorical variables are possible.

  48. An example of splines in SAS

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#