What can you send to the xval
From Intamap
The CrossValidation process splits the sample data into subsets such that the analysis is initially performed in a single subset, while the other subset(s) are retained for subsequent use in confirming and validating the initial analysis. In K-fold crossvalidation the original data is partitioned into K subsamples. The crossvalidation is then repeated K times. If K == number of observations we have a Leave-One-Out CrossValidation 'LOOCV', where a single observations from the original data is removed and its value estimated using the remaning set. The crossvalidation process has a default K-fold of 10. The use of LOOCV is not advised since it is extremely time consuming
The crossvalidation process (xval) is designed to use the same inputs/outpus of the INTAMAP system. Therefore it uses UncertML and Obs&Meas as major XML structures.
Contents |
List of inputs
Input that can be sent to the crossvalidation service, the bold parameter is mandatory
- ObervationCollection
- Kfold
- InterpolationServer
- ProcessName
- MethodName
- MethodParameters
- MaxTime
ObservationCollection
Within the ObservationCollection should be only 1 Observation or Measurement types,with minimum number of observations that is required by INTAMAP is 20, any fewer and an exception the INTAMAP exception will be raised, and passed to the xval service. The service will parse the gml:Point, om:result and the gml:point SRS information.
<om:ObservationCollection xmlns:gml="http://www.opengis.net/gml" xmlns:om="http://www.opengis.net/om/1.0"
xmlns:sa="http://www.opengis.net/sampling/1.0"
xmlns:xlink="http://www.w3.org/1999/xlink">
<om:member>
<om:Observation gml:id="test1">
<om:samplingTime/>
<om:procedure />
<om:observedProperty/>
<om:featureOfInterest>
<sa:SamplingPoint>
<sa:sampledFeature/>
<sa:position>
<gml:Point><gml:pos>330382947 5844444</gml:pos></gml:Point>
</sa:position>
</sa:SamplingPoint>
</om:featureOfInterest>
<om:result>688</om:result>
</om:Observation>
</om:member>
</om:ObservationCollection>
The parsed information will be used by the process and passed to INTMAP system. The SRS will be checked by INTAMAP system and if it is lat/long it will raise an exception.
Note: This is the only mandatory input to the xval service, all the other inputs have default values
Kfold
The Kfold is determines how many training/validation subsets sets will be generated from the data and sent to the INTAMAP system.For a more extensive explanation see: Kfold
Normally Kfold is equal to 10, this is the default value used by the crossvalidation service if the value is not indicated in the request.
To help on the validation the Kfold value has to be introduced as a percentage of the total number of observations, limiting its value between 0 and 100%. The WPS service will check if the value is in this range limits and will rise an WPS input exception
For example a set that contains 200 observations and it is necessary a Kfold == 5, then the WPS Kfold input should be of 2.5 (%) (Kfold=(Kfold%*100)/Size)
The majority of the cases will use the WPS default options (an absolute Kfold of 10) or and input value of 100 (Leave-one-out-crossvalidation). The final crossvalidation report will contain the absolute number of subsets generated.
InterpolationServer
This input located the interpolation server and path to the WPS service, the indicated server will be used to cross validate the submitted data set. The input is a literal string with server location, port (if it is not the standard HTTP port 80) and path, for example:
http://intamap.geo.uu.nl:8180/intamap/WebProcessingService http://intamap.aston.ac.uk:8080/intamap/WebProcessingService http://gis-obama.uni-muenster.de:8180/intamap/WebProcessingService http://remwps2.jrc.ec.europa.eu/intamap/WebProcessingService
If not specified, the Aston University server will be used as the default option.
ProcessName
Currentely there is only one process name, but things could change in future releases.
org.intamap.wps.Interpolate
Therefore there is no need to change or even set this parameter
MethodName
The INTAMAP currentely supports the following methods
automatic automap psgp copula idw
The crossvalidation needs to know what method to use, the default value is automatic.
MethodParameters
The INTAMAP system reports the model and parameters used in a interpolation request as a string output. This information can be used to run a crossvalidation request, were the service will use the introduced methodparameters as spatial model for the process.
In a normal situation the crossvalidation service will start by requesting a spatial model of all the data set and from there it will continue to the training/validation procedure.
<wps:Input>
<ows:Identifier>MethodParameters</ows:Identifier>
<wps:Data>
<wps:ComplexData>
\n vvmod = vgm(16.2655700411874,\"Ste\",1265.48940344574,anis =c(25.1113527463443,0.531507461849014)
,add.to = vgm(0.705683839910609,\"Nug\",0,anis =c(0,1) ) ) \n vvmod$kappa = c( 0,0.3 ) \n \n object$
variogramModel = vvmod \n QQ = matrix(c( 9.50897044633664e-05,4.49600704975795e-05,-3.01079810092796e-05 )
,ncol=3) \n colnames(QQ) = list(\"Q11\",\"Q22\",\"Q12\") \n anisPar = list(ratio = 1.88144113070622 ,
direction = 64.8886472536557 , Q= QQ,doRotation = TRUE )\n object$anisPar = anisPar \nclass(object) =
\"automap\"\n
</wps:ComplexData>
</wps:Data>
</wps:Input>
Please note that this example has been formated for the wiki and it will not work
The INTAMAP method report result should be copied EXACTLY, the ideal situation should be for a script to parse the INTAMAP response and copy the contents to the CV request
MaxTime
The time necessary for the calculation of the spatial model is a big bottleneck, specially in the case of copula. The copula method is far superior than automap or PSGP, in case of trend and hot spots in the data, but it may take several hours if faced with trend/hotspots. The maxtime is the limit of time that the crossvalidation service will wait for the INTAMAP system to calculate the copula, until it raises an exception. This parameter is only used if the crossvalidation service indicates copula or automatic as the MethodName.
Unfortunately, the 52North WPS implementation has a bug and it doesn't update its status response in case of WPS exception. Therefore if the copula crashes the crossvalidation service will be informed of the problem and will continue to check the status response for a result.
The default value is set to 2h and 35 minutes or 10000 seconds (the input is in seconds). This value maybe insuficient if the dataset is big or the data has abnormal hotspots.
List of Outputs
There are 3 types of CV outputs:
- Statistical values
- CVResult / PosCV
- SVG graphics
Statistical values
The crossvalidation service calculates the following parameters (from the residuals):
- Mean Error (ME)
- Mean Absolute Error (MAE)
- Mean Square Error (MSE)
- Root Mean Square Error (RMSE)
These parameters are measures of error between the observed and estimated values.
CVResult / PosCV
CV Result
The CVResult is a WPS ComplexData output that contains an UncertML structure that reports the observation,estimation,residual and zscore. The following is an UncertML example with 2 results.
<un:StatisticsArray xmlns:swe="http://www.opengis.net/swe/1.0" xmlns:un="http://www.uncertml.org" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.uncertml.org
http://schemas.uncertml.org/1.0.0/UncertML.xsd">
<un:elementType>
<un:StatisticsRecord>
<un:field>
<un:Statistic definition="http://dictionary.uncertml.org/statistics/observation"/>
</un:field>
<un:field>
<un:Statistic definition="http://dictionary.uncertml.org/statistics/mean"/>
</un:field>
<un:field>
<un:Statistic definition="http://dictionary.uncertml.org/statistics/residual"/>
</un:field>
<un:field>
<un:Statistic definition="http://dictionary.uncertml.org/statistics/zscore"/>
</un:field>
</un:StatisticsRecord>
</un:elementType>
<un:elementCount>2</un:elementCount>
<swe:encoding>
<swe:TextBlock decimalSeparator="." blockSeparator=" " tokenSeparator=","/>
</swe:encoding>
<swe:values>10340,843.774,496.226,17.0830906774 8.26,843.774,-17.774,-0.611888239833</swe:values>
</un:StatisticsArray>
PosCV
The crossvalidation service generates the training/validation subsets by picking random observations of the dataset. Therefore the order of the data in the UncertML is different from the data set initially submitted. The PosCV XML result contains the spatial location of each block of the UncertML, as a GML multipoint grid.
<gml:MultiPoint>
<gml:pointMember>
<gml:Point>
<gml:pos>30381709 5845601</gml:pos>
</gml:Point>
</gml:pointMember>
<gml:pointMember>
<gml:Point>
<gml:pos>30381712 5845622</gml:pos>
</gml:Point>
</gml:pointMember>
</gml:MultiPoint>
SVG graphics
The crossvalidation service uses R/Rpy to generate SVG graphics. The service generates:
- Statial map of observations (SpatialObs)
- Statial map of estimation (SpatialEst)
- Spatial map of residuals (SpatialRes)
- Bubble map of residuals (BubbleRes)
- Histogram of Z scores (HistZScores)
- Histogram of residual values (HistResiduals)
- SemiVariogram of Residuals (SVResiduals)
- Correlation between estimated and observed values (Correlation)
