next up previous
Next: Further applications Up: Multiple linear regression as Previous: Method

Data interpolation

The usefulness of the MLR method will be demonstrated using the bottle data from the World Ocean Circulation Experiment (WOCE) A9 section (Meteor cruise M15/3 at 19$^o$S) in the South Atlantic (see Figure 1). On each station at least one CTD profile, with a 24 bottle sampler, was taken. A second profile was taken on many of the deep stations, to arrive at a maximum of 36 bottle depths per station. Most of the water samples were analyzed for oxygen and nutrients, other parameters like total dissolved inorganic carbon or freons were analyzed just every second or third station (see Figure 2 for data distribution). We will use MLR to get all parameters on all bottle depths.

Figure 1: The geographical position of the hydrographic stations, Crosses are Meteor 15/3 data (WHP-A09), circles are SAVE data.

Figure: Spatial distribution of bottle samples. Each cross represents a closed bottle and bottles with total carbon measurements are marked also with a circle. Topography is along the 19^S section.

To get an estimate of the accuracy of the interpolation, we make the interpolation for bottles where measurements were available. In this case the actual measurement was not used for the interpolation (i.e. was assumed missing). The interpolated value is then compared with the measured value. The interpolation errors are given in relation to an assumed measurement error, i.e. for oxygen the greater value from 1 umol/kg or 1% of the measured value (see table 1). Several parameter combinations and proximities were used for the MLR. As a reference for the error estimate, we take a linear pressure interpolation. At each station data just above and just below the point to be interpolated are used for a linear interpolation, above/below the uppermost/lowermost data point the interpolated value is taken from the nearest pressure.

Table 1: Parameters used in the MLR with the minimum error (absolute value) and the relative error (as % of the measured value). Given also are the number of measured data points for each parameter in the data set.
Parameter error No. of
  minimum percentage measurements
O$_2$ 1.0 1 3603
NO$_3$ 0.5 1 3497
PO$_4$ 0.1 1 3175
SiO$_4$ 1.0 1 3566
T-C 2.0 0 754
F-11 0.02 1 88
F-12 0.01 1 1384

The proximity, i.e. a region surrounding the value to be interpolated from which data for the interpolation are taken, can be defined as geographical region. E.g. a box extending 2^ of latitude times 2^ of longitude in the horizontal and 150 dbar in the vertical at the surface. (Although for a zonal section as used here, the latitude restriction has no influence). Taking into account that the water mass characteristics are in general more homogeneous at greater depths and that the bottle density is decreasing with depth, the vertical extent can be increased with increasing pressure. With an additional 20% of the pressure an interpolation for a bottle at 1000 dbar therefore would have an vertical extent of 350 dbar (150 + 0.2*1000). For individual hydrographic sections it is also possible to define the horizontal extent using the station number instead of the geographic location. Because station spacing is normally increased in regions of strong gradients (for example the western boundary region compared to the open ocean) such an approach takes into account previous oceanographic knowledge of the parameter distribution. The vertical extent also does not have to be defined using pressure, a definition using density incorporates the assumption that mixing occurs predominantly along density surfaces. Other definitions of proximity, even with more dimensions, are possible. The definitions used here are given in Table 2. It only has to be taken into consideration, that the given proximity must be big enough to include sufficient data to carry out the MLR.

Table 2: Letter denoting the local region ($\pm$Value) for the MLR. Dots denote that no restriction was placed on this parameter. The last column gives a multiplicative factor, which is used to enlarge the region if not enough data points were found and the maximum factor for this enlargement.
Case Longitude Station Pressure factor/max
a ... 8 ... 0/1
b ... 8 250+20% 0/1
c ... 8 150+15% 0/1
d ... ... 150+15% 0/1
e ... 3 150+15% 1.41/ 8
f 2 ... 100+20% 1.41/ 8

As the proximity goes to zero so does the error of the interpolation (assuming a smooth field). For a given data density the linear interpolation has the smallest possible proximity (2 data points or 1 in the case of extrapolation). The MLR uses more parameters, therefore needs more data points and therefore a larger proximity to find these data points. But because the MLR makes use of more information, the larger proximity does not imply a larger error.

A larger proximity also has advantages, measurement uncertainties or even plain wrong input values have a smaller influence in a larger data point ensemble then in a smaller number of data points. A drawback is that wrong data points influence more interpolated values with a larger proximity. Another advantage of using larger proximities is that larger data gaps can be interpolated. Using linear interpolation the 754 measured total dissolved inorganic carbon (T-C) values can be used to estimate a total of 1065 values, while with a larger proximity and using MLR the values for all bottles (N=3743) can be calculated. To be able to interpolate over large data gaps but in general still use small proximities we have the possibility to increase the proximity if there are not enough data points in the original proximity.

The interpolation was carried out using different parameter combinations (see Table 3). These parameter list gives the maximum parameters to be used, if for a certain bottle a parameter is missing, this parameter although given in the parameter list, cannot be used and therefore the MLR is made without this parameter.

Table 3: Different combinations of parameters which are used (if possible) in the MLR. (P=pressure,T=temperature,NDEN=neutral density)
CTD-data Bottle-data
Case longitude P T S $\theta$ NDEN O$_2$ AOU NO$_3$ PO$_4$ SiO$_4$ T-C
lin X
a X X X
b X X X X
c X X X X
d X X X X
e X X X X X
f X X X
g X X X X
h X X X X X
1 X X X X X X X X
4 X X X X X X X
5 X X X X X X X
6 X X X X X X X X

Using more parameters can increase the accuracy of the MLR interpolation, but several things have to be considered:

Table 4: Standard deviation of the errors in various parameters for fits with different proximities (see Table 2). Indices denote the parameters used as given in Table 3.
Parameter lin $a_a$ $f_a$ $a_e$ $e_e$ $a_1$ $b_1$  
O$_2$ 5.13 17.73 6.56 0.33 0.06 0.66 0.40  
NO$_3$ 2.03 6.73 2.66 2.29 1.53 2.00 1.51  
PO$_4$ 0.65 2.12 0.92 0.97 0.67 0.75 0.64  
SiO$_4$ 1.71 8.02 1.21 4.62 0.97 4.52 1.03  
T-C 5.77 9.92 3.94 2.90 2.02 1.71 1.33  
F-11 2.56 2.25 1.80 2.41 1.97 1.37 1.05  
F-12 6.26 12.03 4.06 5.01 3.20 4.29 3.12  

Figure 3: Standard deviation of the errors in the fit of nitrate, total inorganic carbon, silica and Freon-12 (in multiples of the assumed error from table 1). Different symbols represent different proximities (Table 2), the x-axis represents the parameters used (Table 3). A dotted line shows the linear interpolation value.

The results (Table 4 and Figure 3) shows that using no restriction for the vertical proximity (case A) or no restriction in the horizontal proximity (case D) generally gives larger errors then the linear interpolation for the high density data (oxygen and nutrients). We could not conclude if the better results for lower data density parameters like T-C are because the linear regression is not so good or because T-C is better described with the MLR then the nutrients.

As expected for a more local approach (smaller proximity) the interpolation generally gives smaller errors for more local proximities. We can also see that using a larger amount of parameters the interpolation also becomes better. The importance of a parameter varies with the parameter to be interpolated, oxygen is quite important for the interpolation of nutrients and T-C, but it has not such an great influence on silica.

Figure 4: Profiles of the errors in the fit of nitrate and total inorganic carbon (in multiples of the assumed error).
\includegraphics[width=7cm]{PRESdiffr_NO3.eps} \includegraphics[width=7cm]{PRESdiffr_T-C.eps}

Figure 4 shows the errors of some interpolations as a function of pressure. For all interpolations the errors in the upper several hundred meters are much larger then in greater depths. This is not unexpected due to the generally higher temporal and spatial variability near the surface. There is a, possible, positive side-effect associated with these higher interpolation errors in surface waters. Some part of this higher natural variability is, in the context of a temporal and spatial mean field, random noise. For example two data points with the same temperature have different T-C content due to local over- or under-saturation, one water parcel had experienced a fast warming, the other cooling. After some time the two water parcels will again reach equilibrium with the atmosphere. Or the two data points are effectively the same water parcel, just at different times. Assuming that other parameters stayed unchanged, the MLR effectively takes the mean value of the measurements as interpolated value. That is, there is a certain amount of smoothing associated with the MLR, where the interpolated values corresponds more to the mean field then the actual single measurements.

Greater care must be taken in the data quality evaluation (DQE) when using MLR. Using a pressure interpolation just the parameter to be interpolated and pressure have to be consistent. But in the MLR an error in just one of the used parameter leads to an error in the interpolated value. So good data quality control is essential, but MLR can also help us in the DQE. The parameters measured with the CTD (pressure, temperature, salinity, oxygen and inferred parameters as density) are in essence self consistent. A MLR using just these parameters to interpolate another parameter as for example silica gives high deviations for silica data which is inconsistent with the other silica data (for example measurement error) or the CTD-data (for example wrong bottle closing depth). In essence this corresponds to the evaluation of property-property plots. If we use just high quality historic data to infer the coefficients of the MLR and then apply them to the actual data we can also infer measurement offsets or even temporal changes as for example in anthropogenic tracers (see below).

Using just the parameters available from the CTD it is also possible to interpolate the bottle parameters like nutrients, T-C, etc. onto the high resolution CTD data. This can be of importance when calculating transports, as the strongest currrents are generally in the core of a water mass which is characterized by extrema in water mass characteristics.

next up previous
Next: Further applications Up: Multiple linear regression as Previous: Method
Juergen Holfort 2004-08-12