Multivariate

Collection of tools for multivariate geostatistics.

Calculate Likelihood

pygeostat.multivariate.supersec.likelihood(x, y, pvars=None)

Reminiscent of the table generated by the CCG program corrmat when set to ‘prediction’ mode.

Calculates the likelihood (i.e., approximately collocated cokriging weights) weights of dependent variables. These weights are used to generate super-secondary variables by the CCG program supersec and are also displayed using the CCG program corrmat that is set to ‘prediction’ mode.

The weight vector has an error variance \(\sigma^2_E\), based on the weights \(\lambda_i\) and correlation coefficient \(\rho_{i,j}\). All of these values are also returned.

See also

  1. Babak, O., & Deutsch, C. V. (2009). Collocated Cokriging Based on Merged Secondary Attributes. Mathematical Geosciences, 41(8), 921–926.
  2. Deutsch, M. V, & Deutsch, C. V. (2013). A Program to Calculate and Display Correlation Matrices with Relevant Analysis. CCG Annual Report 15 (Vol. 2013). Edmonton AB.
  3. Zanon, S., & Deutsch, C. V. (2003). Predicting Reservoir Performance with Multiple Geological, Geophysical, and Engineering Variables: Bayesian Updating Under a Multivariate Gaussian Model. CCG Annual Report. Edmonton AB.
Parameters:
  • x (pd.DataFrame or pd.Series) – Dataframe containing one or more primary variables (i.e., dependent or variables to predict)
  • y (pd.DataFrame) – Dataframe containing multiple secondary variables collocated with x (i.e., independent) used to predict x
  • vars (list) – The variables to calculate likelihood weights for. Default is to do all
Returns:

Dataframe containing the weights, error variance, and calculated correlation coefficient for each dependent variable.

Return type:

data (pd.DataFrame)

Example

A simple call for one primary variable:

>>> # Load a griddef and pass it to the gs.DataFile class that is used to contain the data
>>> gridstr = "2000 504925 50\n2000 936025 50\n1 0.0 1.0"
>>> griddef = gs.GridDef(gridstr=gridstr)
>>> data = gs.DataFile(primarydata.dat, griddef=griddef, dftype='point', x='X_N83',
...                    y='Y_N83')
>>> # Point to the exhaustive secondary data file and append the collocated data to the
>>> # primary DataFrame
>>> secdatfl = './secdata.dat'
>>> gs.getcollocated(data, secdatfl)
>>> # Load the secondary data into python
>>> secdat = gs.DataFile(secdatfl)
>>> # Calculate the likelihood weights, error, and correlation coefficient
>>> likelihood = gs.likelihood(data.data, secdat, 'NS_AG')

Code author: Warren E. Black - 2016-03-16

Calculate Super-Secondary Variable

pygeostat.multivariate.supersec.supersec(corrmat, pvar, secvars, secdata, tmin=-10, tmax=10, read_kws=None, return_data=False, outfl=None, **kwargs)

Generate super-secondary variables for a single primary variables using many exhaustive secondary variables.

A correlation matrix is required calculated from the collocated primary and secondary data. It does not need to be sliced to only contain the single primary variable and the secondary variables being merged, it will be sliced as necessary.

The parameter secdata can be one of the following types of input:

  1. A tidy long-form pd.DataFrame or np.array already with only the secondary variables as columns in the order specified by the parameter secvars.
  2. A file of any format that can be read by gs.readfile().
  3. A list of files of any format that can be read by gs.readfile(). In this case, each file is loaded and the required secondary variable is extracted.

The function gs.readfile() is used to read and passed files. All arguments other than flname can be passed using the parameter read_kws. For example:

>>> read_kws={'fltype': 'hdf5', 'hdf_key': 'Group1/Group2'}

To avoid memory leaks, the calculation of the super-secondary values is done in a subprocess. Unless the values are going to be used within python later in the work flow, it is recommended that the array of values is not returned by the subprocess. In this case, the values are exported to a file then released from memory.

To export the calculate super-secondary variable, the parameter outfl can be used. To allow more flexibility, the method gs.DataFile.writefile() is used to exporting. All arguments other than flname can be passed as additional keyword arguments. For example:

>>> gs.supersec(corrmat=corrmat, pvar=pvar, secvars=secvars, secdata=secdata,
...             outfl='output.h5', h5path='Group1/Group2')

Can have the rho returned without calculating super-secondary variables be done by leaving outfl and return_data to their default values.

Parameters:
  • corrmat (pd.DataFrame) – Correlation matrix that contains the collocated correlation of the variables specified by pvar and secvars
  • pvar (str) – The name of the primary variable to calculate a super-secondary variable for
  • secvars (list) – List of the secondary variables to merge into a single super-secondary variable
  • secdata – Please see above for a list of permissible input
Keyword Arguments:
 
  • tmin (float) – Minimum allowable super-secondary value
  • tmax (float) – Maximum allowable super-secondary value
  • return_data (bool) – Indicate if the calculate super-secondary values should be returned
  • outfl (str) – Output file name and location
  • kwargs – Optional permissible keyword arguments to pass to gs.DataFile.writefile()
Returns:

Correlation coefficient between the primary variable and the calculated super-secondary variable

Return type:

rho (float)

Returns:

Optional. Array of the calculate super-secondary values

Return type:

supvals (np.array)

Code author: Warren E. Black - 2016-05-27

Multidimensional Scaling (MDS)

pygeostat.multivariate.utils.mds(data, variables=None)

Python implementation of the MDS coordinates calculated by the CCG program corrmat when set to ‘ordination’ mode.

MDS coordinates are calculate from the correlation matrix

See also

  1. Deutsch, M. V, & Deutsch, C. V. (2013). A Program to Calculate and Display Correlation Matrices with Relevant Analysis. Edmonton AB. Retrieved from http://www.ccgalberta.com
Parameters:
  • data – Tidy (long-form) 2-D data where each column is a variable and each row is an observation. A pandas dataframe or numpy array may be passed.
  • variables (list) – Variables from the pd.DataFrame passed with data to calculate coordinates for
Returns:

The 3-D MDS coordinates calculated

Return type:

coords (pd.DataFrame)

Code author: Warren E. Black - 2016-05-30