Random model

From Intamap

Basic statistical model


At the INTAMAP project meeting in Utrecht, it was agreed to start a discussion on two issues to make sure that all partners develop modules that are consistent with each other. The first of these issues is the definition of a basic statistical model to be used in the computations. The second issue is a description of which modules we will develop in this project, and how they are supposed to interact with each other.


The present document is a draft to get this discussion started. It is expected that you (the project partners) will contribute to this document, and change it yourself, or make suggestions for changes.


Statistical model


1) Basic definitions

This part is based on the model proposed by Olivier Baume and Gerard Heuvelink from Wageningen University. The first principle is that we want to separate the natural drifts in the target state variable from the country- and network specific heterogeneities that affect the measurements. To do so we distinguish between a state variable z and a measured variable y:

y(s) = z(s) + e(s)

where s is a position vector and where e represents measurement error. We assume that z and y are realisations of random variables that satisfy the following model:

Y = Z + ε

where Z is the state variable and ε is measurement error (including network dependent bias). Our assumptions are that the state variable depends on

1. k drift components {Fi} through the linear model Fa. F is the design matrix and a is a vector of coefficients.

2. a spatially correlated random component δ, resulting in

Z = Fa + δ


The measurement error can be decomposed into

1. a bias: drift components that depend on the network or device type (i.e., artefacts that cause systematic differences between the state variable and the measured variable)

2. a zero-mean random measurement error: in many cases it will be spatially uncorrelated but we may allow spatial dependencies.

ε = Gb + ζ

where Gb is the weighted sum of l artefact factors {Gj}.

The a and b coefficients can be partially known and unknown. Known coefficients are supplied by the user, unknown coefficients need to be estimated (e.g. using BLUE). The GLS formulation of the estimation procedure includes the variance-covariance matrices of the measurement errors and state variables.

Prediction of the state variable from the measurements can be done using BLUP (universal kriging). In this way, the influence of the artefact drift components G is removed while the effect of the natural drift components F remains. In order to apply universal kriging, the natural drift components must be known for the entire geographic domain (spatially exhaustive).


In summary, the model parameters are:

1. design matrices F and G that are derived from environmental covariates and artefact information (i.e., country code, instrument type);

2. the value of the coefficients a and b of all natural and artefact drifts. These coefficients may either be known or need to be estimated from the measurements;

3. the variance-covariance matrix of the random measurement errors and the covariance function of the spatially correlated random component δ.

Once these parameters are known, spatial prediction and stochastic simulation of the target variable is done using universal kriging and spatial stochastic sequential simulation.


2) Discussion

As harmonization is the action or process undertaken to bring something into harmony, we must agree on what the state variable Z is and which artefacts must be taken out so that harmonization of the data actually takes places.

Our target is to provide a model and procedure:

  • If the model is supplied with useful information – apply it
  • Remaining differences will be handled by Universal kriging
  • The model will default to no harmonisation

Some other issues:

  • How do we estimate the coefficients a and b. We may need specific approaches that can separate the true drift from the artefacts (multicollinearity).
  • How do we estimate the variance-covariance structures of δ and ζ.
  • BLUE and BLUP (UK) are optimal for Gaussian distributions, how to extend the procedure to non-Gaussian situations.
  • How do we do all of the above in automatic mode.
  • The basic model presented needs to be elaborated and extended. In what directions?