Some more requirements for UncertML 10/5/07

From Intamap

Contents

Usage scenarios

Here are some scenarios for how UncertML might be used ....

  • To encode our uncertainty in observations from sensors, using the <quality> tag - this is really extending the concepts in SWE common, to provide a more rigorous definition of uncertainty.
  • To represent the results of our interpolation methods; for example to return the (marginal) pdf from the INTAMAP service over a grid.
  • (possibly) to help encode the algorithm we used to obtain the interpolation (meta-data how???)

Requirements

The requirement will be to:

  • represent the uncertainty in a vector (possibly of size 1) of values of
  • continuous variables.

Note that for discrete values things are somewhat different - I will address this somewhere else.

So what representations might we want to encode in UncertML?

Moments of the distribution

  • centeredMoment(momentOrder) - note this could return a vector or matrix (array), or even potentialy higher order data structures

Special named components

  • mean - centeredMoment(momentOrder=1) - this would be a vector. (also average, expectation, ...)
  • covariance - centeredMoment(momentOrder=2) joint covariance matrix for all variables (note this could be huge!) - this would be a matrix (array)

We should also support the following for marginal specification

  • variance - (marginal) variances for each variable (vector) - diagonal of covariance
  • standard_deviation - square root of the above
  • skewness - centeredMoment(momentOrder=3) - note these are only ever used for univariate marginal distributions in my experience, thus would be vectors
  • curtosis - centeredMoment(momentOrder=4)- note these are only ever used for univariate marginal distributions in my experience, thus would be vectors

Other common descriptors of the data

  • mode - The most frequently occurring value / most probable value - vector
  • median - The number that occurs midway through the ordered list - vector
  • range - The maximum and minimum values - two vectors or a vector of two values (array)?
  • confidence interval - return a specified confidence interval for the (posterior) predictive distribution - request has the CI, e.g. [.05,.95] - return value as for range.
  • probability of a value lying in a defined range (including -infinity to +infinity) - thus the minimum and maximum values are specified e.g. [-5, -1] - return value is a vector (of probabilities).

Other non-parametric representations

  • quantiles - this would be more general than confidence intervals since a range of quantiles of the distribution might be requested - a series of vectors, or a vector of series (array).
  • histogram - return a histogram of the distribution - request specifies the number / location of bins. This would be a vector of histograms (array).
  • samples - return samples from the distribution - request passes the number of samples - return type is a vector of samples or samples of a vector (array).

Parametric and semi-parametric representations

  • probability distribution - return the probability density function of the variables. This should include the functional form of the pdf, e.g. in MathML? and the parameters of the distribution; these could be scalars, vectors or matrices (arrays). The encoding should be flexible enough so that new distribution functions can be easily added. Mixture models should be included.

Role in INTAMAP

The main roles for these schema in INTAMAP would be:

  • Encoding information about the distribution of errors on the observations to the interpolation process; it is interesting to contemplate the way this might work. One might be inclined to believe that the observation is the value encoded in a SWE common quantity element, with the uncertainty encoded in the quality flag, but this might seem very logical to some. The thing we are describing in the quality part is the error on the observation, typically in terms of a particular distribution, or measure of spread (variance, standard deviation, confidence interval, range). This seems to work fine.
  • Encoding information about the results of the interpolation process. This might typically be a distribution, a set of quantiles, some selected (centred) moments (1 and 2 normally). Here only the quality information is used, there is no value in the quantity.

Discussion re: SWE common

This is what worries me about fitting in with SWE; the structure does not seem right to me. It could be the use of the name "quality", where I think one might better use "uncertainty", but also relates to my intuition that all things (that are interesting / useful) are uncertain. The "Quantity" element is meant to represent the result of an observation I guess - or is it more general? Are the SWE people thinking this could be something more general - e.g. things in GML, such as coordinates, could be represented using "Quantity"?

Other things to think about (to do list)

Still to address are:

  • discrete variables
  • fuzzy representations [appropriate when a given real variable might really take a range of values, for example when defining whether a proportion of land in a given area is woodland; this might make sense as a fuzzy number]
  • how to implement this
  • how it fits with other schema
  • what semantics should we use - by which I simply mean what should we call things