InterpML

From Intamap

InterpML discussion page

The aim of this page is to evolve into a clear rationale for the specification of InterpML. Note that InterpML might not exist!

InterpML meeting: Aston 9/7/08 (DC, MW, LB, JdJ, JS)

The initial requirements for InterpML can be found here: InterpML schema requirements. These remain valid, and inform much of the discussion that follows.

The following issues were discussed:

  • How should the WPS be called. DC's original idea was to have one big InterpML document that stored all the information needed by, and probably also returned by, the WPS Interpolation method. Both JdJ and MW pointed out the limitations, and it was agreed that rather than one big InterpML document, the arguments for the WPS should be broken down as much as possible. The rationale for this was that this really helps interoperability, if a user can see the arguments required, however the complexity of the service means that almost all the arguments are complex types.
  • We discussed how to use existing schema wherever possible in the process to assist interoperability. MW will write and circulate a summary of a minimal schema that will be implemented for September 2008.

This is a (not very good) photo of the whiteboard after 6 hours, focussing on the WPS side:

Image:Interpml_wps.png

  • After 6 hours discussion, one of the key realisations, as EP had stated prior to the meeting, was that we might not have been paying enough attention to the *Automatic* part of INTAMAP. In terms of inputs to the WPS we should ask for the minimum, although this should be properly structured, and will impose limitations on what we can do, without a rather clever expert system, which INTAMAP never envisaged producing?

The inputs we decided were important:

  • Observations [mandatory] : a collection of om:ObservationCollections, with identifiers.
    • the reason for the collection is that this would be a mechanism for grouping networks, with each network in a separate ObservationCollection, for bias removal or to return the results of clustering later on. The identifiers allow us to stipulate the reference network in the bias removal case.
    • note that the ObservationCollection, from the Observations and Measurements schema, can encode the target of the observation (might have non-point support, GML), the observation result (SWE Common?), the observation quality (UncertML), the sensor function (SensorML), and standard things like time, ID etc.
  • PredictionDomain [optional] : a restriction of gml:Geometry to points, polygons and grids (coverages).
  • InterpolationMethod [optional] : this caused us a lot of problems because many, if not all, methods would benefit from the ability to pass a range of optional parameters:
    • InterpolationParameters: these would control issues such as the number of iterations, samples, active points, points in the kriging neighbourhood, lag class in the variogram estimation, ... there are so many options here we got a bit lost. Our initial solution was to use a dictionary of parameters, with the a collection of these - this is very flexible, but not very elegant or easy to use from the client side, and has issues over, for example naming parameters since the dictionaries are XML instances, which we might have no control over.
    • RandomFunction: this probably caused us most problems. By random function DC means a stochastic process, random field, random process. In fact DC really means a Gaussian version of the above. This is of course typically specified through a mean function and covariance (variogram) function (DC has started to think about copulas too). This is different from a multivariate Gaussian distribution, since this is an infinite dimensional object, defined at all points in a given domain, typically the reals.
      • This quickly got rather complicated and lead to the view that in the version for September we should not have any information about this in the request, after all that is not automatic!
      • However, it also highlighted the need for an additional schema, which uses UncertML. UncertML is designed to describe unconditional distributions (or conditional distributions, where the conditioning no longer explicitly matters, e.g. the posterior (predictive) distribution from a Bayesian random function model, where only the predictive distribution is of interest). I propose to call this schema ProbabilisticModelML. In essence it encode explicit conditional distributions, and thus should be able to cope with a range of probabilistic models (not just random variables as in UncertML). The most important class of ProbabilisticModel for INTAMAP would be the GaussianRandomFunction model. See image:

Image:PMML.png

  • InterpolationResult[optional] : a document to encode random variables using UncertML.
  • A number of optional parameters to define the bias correction, anisotropy, parameter estimation, other things??? We realised that these are not easy to pin down and used the same dictionary idea to allow these to be extensible, although we did not work this fully up. This is now seen as something very much for the future, if resources allow - for now we suggest simply having booleans to define whether these activities are carried out.

Unresolved problems

Several issues came up, that are not yet resolved.

  • The use of external variables, e.g. elevation. To DC these seem a very serious issue, since they raise several issues:
    • how they enter into the random function model: mean function? (Universal kriging), locally (external drift kriging), jointly (cokriging).
    • how they are communicated to the WPS in the 'training set' (observations) - e.g. as a separate observation collection? Both in the initial observation collection, where they are collocated?
    • how they are communicated to the WPS in the 'test set' (prediction locations) - e.g. as another observation collection? Somehow with the geometry - how??? The observation collection route seems best, as this is really what they are in the SWE world (or derived observations in the DC world).
    • how to link the variables in the mean function with corresponding strcuture: ID is not so easy to use here???
  • How complete a level of control do we actually want to offer to our users? INTAMAP was sold as a single interpolation solution to fit all problems, but I as a Bayesian I believe that automatic can include judgements from an expert where available.
  • Even if we allow very little control we ought to be communicating the meta-data about the interpolation method used and it's internal parameters to any user, for lineage issues in a processing chain? How structured should this be? This is where a good ProbabilisticModel schema would be very handy, since this could also contain posteriors, or simply estimated parameters from the application of our process. But how much effort should be putting into this?