Data Files

The core class in pygeostat is the DataFile class which contains a Pandas DataFrame with the data values and column names in addition to metadata, such as the name of the x, y and z coordinates or grid definition.

DataFile Class

class pygeostat.data.data.DataFile(flname=None, readfl=None, fltype=None, dftype=None, data=None, columns=None, null=None, title='data', griddef=None, dh=None, x=None, y=None, z=None, ifrom=None, ito=None, weights=None, cat=None, catdict=None, variables=None, notvariables=None, delimiter='\s+', headeronly=False, h5path=None, h5datasets=None, nreals=-1, tmin=None)

This class stores geostatistical data values and metadata.

DataFile classes may be created on initialization, or generated using pygeostat functions. This is the primary class for pygeostat and is used for reading and writing GSLIB, CSV, VTK, and HDF5 file formats.

Parameters
  • flname (str) – Path (or name) of file to read

  • readfl (bool) – True if the data file should be read on class initialization

  • fltype (str) – Type of data file: either csv, gslib or hdf5 or gsb

  • dftype (str) – Data file type as either ‘point’ or ‘grid’ used for writing out VTK files for visualization

  • data (pandas.DataFrame) – Pandas dataframe containing array of data values

  • dicts (List[dict] or dict) – List of dictionaries or dictionary for converting alphanumeric to numeric data

  • null (float) – Null value for missing values in the data file

  • title (str) – Title, or name, of the data file

  • griddef (pygeostat.GridDef) – Grid definition for a gridded data file

  • dh (str) – Name of drill hole variable

  • x (str) – Name of X coordinate column

  • y (str) – Name of Y coordinate column

  • z (str) – Name of Z coordinate column

  • ifrom (str) – Name of ‘from’ columns

  • ito (str) – Name of ‘to’ columns

  • weights (str or list) – Name of declustering weight column(s)

  • cat (str) – Name of categorical (e.g., rock type or facies) column

  • catdict (dict) – Set a dictionary for the categories, which should be formatted as: catdict = {catcode:catname}

  • variables (str or list) – Name of continuous variable(s), which if unspecified, are the columns not assigned to the above attributes (via kwargs or inference)

  • notvariables (str or list) – Name of column(s) to exclude from variables

  • delimiter (str) – Delimiter used in data file (ie: comma or space)

  • headeronly (bool) – True to just read header + 1 line of data file This is useful for getting column numbers of large files OR if reading hdf5 files will only read in the hdf5 store information

  • h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to read the dataset(s) specified by the argument datasets from. The dataset name cannot be passed using this argument, it is interpreted as a group name only. A value of None places the dataset into the root directory of the HDF5 file. A value of False loads a blank pd.DataFrame().

  • h5datasets (str or list) – Name of the dataset(s) to read from the group specified by h5path. Does nothing if h5path points to a dataset.

  • columns (list) – List of column labels to use for the resulting data pd.DataFrame

  • nreals (int) – number of realizations to read in. -1 will read all

  • tmin (float) – If a number is provided, values less than this number (e.g., trimmed or null values) are convernted to NaN. May be useful since NaN’s are more easily handled within python, matplotlib and pandas. Set to None to disable.

Examples

Quickly reading in a GeoEAS data file:

data_file = gs.DataFile(flname='../data/oilsands.dat')

To read in a GeoEAS datafile and assign attributes

# Point Data Example
data_file = gs.DataFile(flname='../data/oilsands.dat',readfl=True,dh='Drillhole Number', x='East',y='North',z='Elevation')
# Gridded Data Example
griddef = gs.GridDef('''10 0.5 1
10 0.5 1
10 0.5 1''')
data_file = gs.DataFile(flname='../data/3DDecor.dat', griddef=griddef)

# To view grid definition string
print(data_file.griddef)
# Access some Grid Deffinition attributes
data_file.griddef.count() # returns number of blocks in grid
data_file.griddef.extents() # returns an array of the extents for all directions
data_file.griddef.nx # returns nunmber of blocks in x direction

HDF5

Using the HDF5 file format has its own positive features. For one it reads and writes much faster then using the ASCII format. Attributes (like the grid definition) can also be saved within the file. All files for a single project can also be saved in the same file. Please refer to the introduction on HDF5 files for more information

This class currently only searches for and loads a grid definition.

Examples

HDF5 file simple read example:

data_file = gs.DataFile(flname='../data/oilsands_out.hdf5')

To view the HDF5 header information (tables stored in the file):

data_file.store

If you have a HDF5 file with multiple tables and you just want to read in the file information to view what tables are in the file and any attributes saved to the file you can do a header style only read:

data_file = gs.DataFile(flname='../data/oilsands_out.hdf5', dftype='hdf5', headeronly=True)

Then to see what tables are written in the hdf5 file:

data_file.store

DataFile Attributes

Attributes of a datafile object are accessed with datafile.<attribute>.

Columns

Access the columns of the datafile. Wrapper for datafile.data.columns.

Num Variables

Access the nvar of the datafile. e.g., the len(datafile.variables)

Locations

Access the locations stored in the datafile. Wrapper for datafile[datafile.xyz].

Example:

>>> datafile = gs.DataFile("somefile.out")  # this file has an x, y[, z] attribute that is found
>>> datafile.locations
... dataframe of x, y, z locations

Shape

Access the shape of the data stored in the datafile. Wrapper for datafile.data.shape

Example:

>>> datafile = gs.DataFile("somefile.out")
>>> datafile.shape
... shape of datafile.data

Rename Columns

DataFile.rename(columns)

Applies a dictionary to alter self.DataFrame column names. This applies the DataFrame.rename function, but updates any special attributes (dh, x, y, etc.) with the new name, if previously set to the old name. Users should consider using the self.columns property if changing all column names.

Parameters

columns (dict) – formatted as {oldname1: newname1, oldname2:newname2}, etc, where the old and new names are strings. The old names must be present in data.columns.

Drop Columns

DataFile.drop(columns)

This applies the DataFrame.drop function, where axis=1, inplace=True and columns is used in place of the labels. It also updates any special attributes (dh, x, y, etc.), setting them to None if dropped. Similarly, if any variables are dropped, they are removed from self.variables.

Parameters

columns (str or list) – column names to drop

Check for Duplicate Columns

DataFile.check_for_duplicate_cols()

Run a quick check on the column names to see if any of them are duplicated. If they are duplicated then print a Warning and rename the columns

Set Columns

DataFile.setcol(colattr, colname=None)

Set a specialized column attribute (dh, ifrom, ito, x, y, z, cat or weights) for the DataFile, where DataFile.data must be initialized. If colname is None, then the attribute will be set if a common name for it is detected in DataFile.data.columns (e.g., if colattr='dh' and colname=None, and 'DHID' is found in DataFile.data, then DataFile.dh='DHID'. The attribute will be None if none of the common names are detected. If colname is not None, then the provided string will be assigned to the attribute, e.g. DataFile.colattr=colname. Note, however, that an error will be thrown if colname is not None and colname is not in DataFile.data.columns. This is used on DataFile initialization, but may also be useful for calling after specialized columns are altered.

Parameters
  • colattr (str) – must match one of: 'dh', 'ifrom', 'ito', 'x', 'y', 'z', 'cat' or 'weights'

  • colname (str or list) – if not None, must be the name(s) of a column in DataFile.data. List is only valid if colattr=weights

Examples

Set the x attribute (dat.x) based on a specified value:

>>> dat.setcol('x', 'Easting')

Set the x attribute (dat.x), where the function checks common names for x:

>>> dat.setcol('x')

Set Variable Columns

DataFile.setvarcols(variables=None, notvariables=None)

Set the variables for the DataFile. If provided, the function checks that the variables are present in the DataFrame. If not provided, the function assigns columns that are not specified as the variables (dh, x, y, z, rt, weights), as well as a list of user specified notvariables.

This is used on DataFile initialization, but may also be useful for calling after variables are added or removed.

Parameters
  • variables (list or str) – list of strings

  • notvariables (list or str) – list of strings

Examples

Set the variables based on a specified list:

>>> dat.setvarcols(variables=['Au', 'Carbon'])

Set the variables based on the function excluding specialized columns (dh, x, y, etc.):

>>> dat.setvarcols()

Set the variables based on the function excluding specialized columns (dh, x, y, etc.), as well as a user specified list of what is not a variable:

>>> dat.setvarcols(notvariables=['Data Spacing', Keyout'])

Set Categorical Dictionary

DataFile.setcatdict(catdict)

Set a dictionary for the categories, which should be formatted as:

>>> catdict = {catcode:catname}

Example

>>> catdict = {0: "Mudstone", 1: "Sandstone"}
>>> self.setcatdict(catdict)

Check DataFile

DataFile.check_datafile(flname, variables, sep, fltype)

Run some quick checks on the DataFile before writing and grab info if not provided

Add Coord

DataFile.addcoord()

Only use on DataFile classes containing GSLIB style gridded data.

If x, y, or z coordinate column(s) do not exist they are created. If the created or current columns only have null values, they are populated based on the GridDef class pass to the DataFile class.

Note

A griddef must be assigned to the DataFile class either at read in like here

>>> data_file = gs.DataFile(flname='test.out', griddef=grid)

Or later such it can be manually assigned such as here

>>> data_file.griddef = gs.GridDef(gridstr=my_grid_str)

Apply Dictionary

DataFile.applydict(origvar, outvar, mydict)

Applies a dictionary to the original variable to get a new variable.

This is particularly useful for alphanumeric drill hole IDs which cannot be used in GSLIB software.

Parameters
  • origvar (str) – Name of original variable.

  • outvar (str) – Name of output variable.

  • mydict (dict) – Dictionary of values to apply.

Examples

>>> data_file.applydict('Drillhole', 'Drillhole-mod', mydict)

Describe DataFile

DataFile.describe(variables=None)

Describe a data set using pandas describe(), but exclude special variables.

Keyword Arguments

variables (List(str)) – List of variables to describe.

Returns

Pandas description of variables.

Return type

self.data[variables]describe()

Examples

Describe all none special variables in the DataFrame (will exclued columns set as dh ID, coordinate columns, etc.)

>>> data_file.describe()

Or describe specific variables

>>> data_file.describe(['Bitumen', 'Fines'])

Infer Grid Definition

DataFile.infergriddef(blksize=None, databuffer=5, nblk=None)

Infer a grid definition with the specified dimensions to cover the set of data values. The function operates with two primary options:

  1. Provide a block size (node spacing), the function infers the required number of blocks (grid nodes) to cover the data

  2. Provide the number of blocks, the function infers the required block size

A data buffer may be used for expanding the grid beyond the data extents. Basic integer rounding is also used for attempting to provide a ‘nice’ grid in terms of the origin alignment.

Parameters
  • blksize (float or 3-tuple) – provides (xsiz, ysiz, zsiz). If blksize is not None, nblk must be None. Set zsiz None if the grid is 2-D. A float may also be provided, where xsiz = ysiz = zsiz = float is assumed.

  • databuffer (float or 3-tuple) – buffer between the data and the edge of the model, optionally for each direction

  • nblk (int or 3-tuple) – provides (nx, ny, nz). If blksize is not None, nblk must be None. Set nz to None or 1 if the grid is 2-D. An int may also be provided, where nx = ny = nz = int is assumed.

Returns

this function returns the grid definition object as well as assigns the griddef to the current gs.DataFile

Return type

griddef (GridDef)

Note

this function assumes things are either 3D or 2D along the xy plane. If nx == 1 or ny == 1, nonsense will result!

Usage:

First, import a datafile using gs.DataFile(), make sure to assign the correct columns to x, y and z:

>>> datfl = gs.DataFile('test.dat',x='x',y='y',z='z')

Now create the griddef from the data contained within the dataframe:

>>> blksize = (100, 50, 1)
>>> databuffer = (10, 25, 0) # buffer in the x, y and z directions
>>> griddef = datfl.infergriddef(blksize, databuffer)

Check by printing out the resulting griddef:

>>> print(griddef)

Examples

For 3D data, infergriddef() returns a 3D grid definition even if zsiz is given as None or 0 or 1:

df3d = gs.ExampleData("point3d_ind_mv")
a = df3d.infergriddef(blksize = [50,60,1])
b = df3d.infergriddef(blksize = [50,60,None])
c = df3d.infergriddef(blksize = [50,60,0])
#a,b,c are returned as Pygeostat GridDef:
#                                       20 135.0 50.0
#                                       19 1230.0 60.0
#                                       82 310.5 1.0

For 3D data, nz given as None or 0 or 1 returns a 2D grid that covers the vertical extent of the 3D data:

d = df3d.infergriddef(nblk = [50,60,1])
e = df3d.infergriddef(nblk = [50,60,None])
f = df3d.infergriddef(nblk = [50,60,0])
#d,e,f are returned as Pygeostat GridDef:
#                                       50 119.8 19.6
#                                       60 1209.1 18.2
#                                       1 350.85 81.7

Where xsiz = ysiz = zsiz, a float can also be provided, or where nx = ny = nz, an int can also be provided:

df3d.infergriddef(blksize = 75)
df3d.infergriddef(blksize = [75,75,75])#returns the same as its above line

df3d.infergriddef(nblk = 60)
df3d.infergriddef(nblk = [60,60,60])#returns the same as its above line

If data is 2-D, zsiz or nz must be provided as None. Otherwise it raise exception:

df2d = gs.ExampleData("point2d_ind")
df2d.infergriddef(nblk = [60, 60, None])
df2d.infergriddef(blksize = [50,60,None])

File Name String

DataFile.__str__()

Return the name of the data file if asked to ‘print’ the data file… or use the datafile in a string!

Generate Dictionary

DataFile.gendict(var, outvar=None)

Generates a dictionary with unique IDs from alphanumeric IDs. This is particularly useful for alphanumeric drill hole IDs which cannot be used in GSLIB software.

Parameters

var (str) – Variable to generate a dictionary for

Keyword Arguments

outvar (str) – Variable to generate using generated dictionary.

Returns

Dictionary of alphanumerics to numeric ids.

Return type

newdict (dict)

Examples

A simple call

>>> data_file.gendict('Drillhole')

OR

>>> dh_dict = data_file.gendict('Drillhole')

GSLIB Column

DataFile.gscol(variables, string=True)

Returns the GSLIB (1-ordered) column given a (list of) variable(s).

Parameters

variables (str or List(str)) – Path, or name, of the data file.

Keyword Arguments

string (bool) – If True returns the columns as a string.

Returns

GSLIB 1-ordered column(s).

Return type

cols (int or List(int) or string)

Note

None input returns a 0, which may be necessary, for example, with 2-D data: >>> data.xyz … [‘East’, ‘North’, None] >>> data.gscol(data.xyz) … ‘2 3 0’

Examples

Some simple calls

>>> data_file.gscol('Bitumen')
... 5
>>> data_file.gscol(['Bitumen', 'Fines'])
... [5, 6]
>>> data_file.gscol(['Bitumen', 'Fines'], string=True)
... '5 6'

Truncate NaN’s

DataFile.truncatenans(variable)

Returns a truncated list with nans removed for a variable.

Parameters

variable (str) – Name of original variable.

Returns

Truncated values.

Return type

truncated (values)

Examples

A simple call that will return the list

>>> data_file.truncatenans('Bitumen')

Unique Categories

DataFile.unique_cats(variable, truncatenans=False)

Returns a sorted list of the unique categories given a variable.

Parameters

variable (str) – Name of original variable.

Keyword Arguments

truncatenans (bool) – Truncates missing values if True.

Returns

Sorted, list of set(object).

Return type

unique_cats (List(object))

Examples

A simple call that

>>> data_file.unique_cats('Drillhole')

Or to save the list

>>> unique_dh_list = data_file.unique_cats('Drillhole')

Write file

DataFile.write_file(flname, title=None, variables=None, fmt=None, sep=None, fltype=None, data=None, h5path=None, griddef=None, null=None, tvar=None, nreals=1)

Writes out a GSLIB-style, VTK, CSV, Excel (XLSX), HDF5 data file.

Parameters:

flname (str): Path (or name) of file to write out.

Keyword Args:

title (str): Title for output file. variables (List(str)): List of variables to write out if only a subset is desired. fmt (str): Format to use for floating point numbers. sep (str): Delimiter to use for file output, generally don’t need to change. fltype (str): Type of file to write either gslib, vtk, csv, xlsx,

or hdf5.

data (str): Subset of data to write out - cannot be used with variables

option!

h5path (str): The h5 group path to write data to (H5 filetype) griddef (obj): a gslib griddef object

tvar (str): Name of variable to use for compression when NaNs exist within it nreals (int): number of realizations you are writing out (needed for GSB)

null (float): If a number is provided, NaN numbers are converted to this value

prior to writing. May be useful since NaN’s are more easily handled within python and pandas than null values, but are not valid in GSLIB. Set to None to disable (but NaN’s must be handled prior to this function call if so).

Note:

pygeostat.write_file is saved for backwards compatibility or as an overloaded class method. Current write functions can be called seperately with the functions listed below:

>>> import pygeostat as gs
>>> import pandas as pd
>>> gs.write_gslib(gs.DataFile or pd.DataFrame)
>>> gs.write_csv(gs.DataFile or pd.DataFrame)
>>> gs.write_hdf5(gs.DataFile or pd.DataFrame)
>>> gs.write_vtk(gs.DataFile or pd.DataFrame)
>>> gs.write_gsb(gs.DataFile or pd.DataFrame)

Note: The GSB format is not specifically intended for general users of pygeostat. Some CCG programs use GSB that is a compressed GSLIB-like binary data format that greatly reduces the computational expense.

The following calls are equivalent:

>>> data_file.write_file('testgslib.out')
>>> data_file.write_file('testgsb.gsb')

is equivalent to:

>>> gs.write_gslib(data_file, 'testgslib.out')
>>> gs.write_gsb(data_file, 'testgsb.gsb')

and similar to:

>>> gs.write_gslib(data_file.data, 'testgslib.out')
>>> gs.write_gsb(data_file.data, 'testgsb.gsb')

Data Spacing

DataFile.spacing(n_nearest, var=None, inplace=True, dh=None, x=None, y=None)

Calculates data spacing in the xy plane, based on the average distance to the nearest n_nearest neighbours. The x, y coordinates of 3-D data may be provided in combination with a dh (drill hole or well), in which case the mean x, y of each dh is calculated before performing the calculation. If a dh is not provided in combination with 3-D xy’s, then calculation is applied to all data and may create memory issues if greater than ~5000-10000 records are provided. A var specifier allows for the calculation to only applied where var is not NaN.

If inplace==True:

The output is concatenated as a ‘Data Spacing ({Parameters[‘plotting.unit’]})’ column if inplace=False (or ‘Data Spacing’ if Parameters[‘plotting.unit’] is None). If var is used, then the calculation is only performed where DataFile[var] is not NaN, and the output is concatenated as ‘{var} Data Spacing ({Parameters[‘plotting.unit’]})’.

If inplace==False:

The funciton returns dspace as a numpy array if dspace.shape[0] is equal to DataFile.shape[0], meaning that dh and var functionality was not used, or did not lead to differences in the length of dspace and DataFile (so that the x and y in DataFile can be used for plotting dspace in map view). The function returns a tuple of the form (dspace, dh, x, y), if dh is not None and dspace.shape[0] is not equal to DataFile.shape[0]. The function returns a tuple of the form (dspace, x, y) if dh is None and and var is not None and dspace.shape[0] is not equal to DataFile.shape[0].

Parameters
  • n_nearest (int) – number of nearest neighbours to consider in data spacing calculation

  • var (str) – variable for calculating data spacing, where the calculation is only applied to locations where var is not NaN. If None, the calculation is to all locations.

  • inplace (bool) – if True, the output data spacing is concatenated

  • dh (str) – dh name, which can override self.dh

  • x (str) – x coordinate name, which can override self.x

  • y (str) – y coordinate name, which can override self.y

Examples

Calculate data spacing without consideration of underlying variables, based on the nearest 8 neighbours.

>>> dat.spacing(8)

Output as a numpy array rather than concatenating a column:

>>> dspace = dat.spacing(8, inplace=False):

Only consider values where Au is non NaN for the calculation:

>>> (dspace, x, y) = dat.spacing(8, inplace=False, var=Au)

Example Data

pygeostat.data.data.ExampleData(testfile, griddef=None, **kwargs)

Get an example pygeostat DataFile

Parameters

testfile (str) – one of the available pygeostat test files, listed below

Test files available in pygeostat include:

  • “point2d_ind”: 2d indicator dataset

  • “point2d_surf”: 2d point dataset sampling a surface

  • “grid2d_surf”: ‘Thickness’ from ‘point2d_surf’ interpolated on the grid

  • “point3d_ind_mv”: 3d multivariate and indicator dataset

  • “oilsands”: 3D Oil sands data set

  • “accuracy_plot”: Simulated realizations to test accuracy plot

  • “location_plot”: 2D data set to test location plot

  • “3d_grid”: 3D gridded data set

  • “point2d_mv” : 2D multivariate data set

  • “cluster”: GSLIB datafile (data with declustering weights)

  • “97data”: GSLIB datafile (the first 97 rows of cluster datafile)

  • “data”: GSLIB datafile (2D data set of primary and secondary variable)

  • “parta”: GSLIB datafile (small 2D dataset part A)

  • “partb”: GSLIB datafile (small 2D dataset part B)

  • “partc”: GSLIB datafile (small 2D dataset part C)

  • “true”: GSLIB datafile (Primary secondary data pairs)

  • “ydata”: GSLIB datafile (2D spatial seondary data with some primary data)

Input/Ouput Tools

iotools.py: Contains input/output utilities/functions for pygeostat. Many of which are based off of Pandas builtin functions.

Read File

pygeostat.data.iotools.read_file(flname, fltype=None, headeronly=False, delimiter='\\s*', h5path=None, h5datasets=None, columns=None, ireal=1, griddef=None, tmin=None)

Reads in a GSLIB-style Geo-EAS data file, CSV, GSB or HDF5 data files.

Parameters

flname (str) – Path (or name) of file to read.

Keyword Arguments
  • fltype (str) – Type of file to read: either csv, gslib, or hdf5.

  • headeronly (bool) – If True, only reads in the 1st line from the data file which is useful for just getting column numbers or testing. OR it allows you to open a hdf5 object with Pandas HDFStore functionality

  • delimiter (str) – Delimiter specified instead of sniffing

  • h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to read the dataset(s) specified by the argument datasets from. The dataset name cannot be passed using this argument, it is interpreted as a group name only. A value of None places the dataset into the root directory of the HDF5 file. A value of False loads a blank pd.DataFrame().

  • h5datasets (str or list) – Name of the dataset(s) to read from the group specified by h5path. Does nothing if h5path points to a dataset.

  • column (list) – List of column labels to use for resulting frame

  • ireal (int) – Number of realizaitons in the file

  • griddef (GridDef) – griddef for the realization

  • tmin (float) – values less than this number are convernted to NaN, since NaN’s are natural handled within matplotlib, pandas, numpy, etc. If None, set to pygeostat.Parameters[‘data.tmin’].

Returns

Pandas DataFrame object with input data.

Return type

data (pandas.DataFrame)

Note

Functions can also be called seperately with the following code

>>> data.data = pygeostat.read_gslib(flname)
>>> data.data = pygeostat.read_csv(flname)
>>> data.data = pygeostat.read_h5(flname, h5path='')
>>> data.data = pygeostat.read_gsb(flname)
>>> data.data = pygeostat.open_hdf5(flname)

Examples

>>> data.data = gs.read_gsb('testgsb.gsb')
>>> data = gs.DataFile('testgsb.gsb')

Read CSV

pygeostat.data.iotools.read_csv(flname, headeronly=False, tmin=None)

Reads in a GSLIB-style CSV data file.

Parameters

flname (str) – Path (or name) of file to read.

Keyword Arguments
  • headeronly (bool) – If True, only reads in the 1st line from the data file which is useful for just getting column numbers or testing

  • delimiter (str) – Delimiter specified instead of sniffing

  • tmin (float) – values less than this number are convernted to NaN, since NaN’s are natural handled within matplotlib, pandas, numpy, etc. If None, set to pygeostat.Parameters[‘data.tmin’].

Returns

Pandas DataFrame object with input data.

Return type

data (pandas.DataFrame)

Read GSLIB Python

pygeostat.data.iotools.read_gslib(flname, headeronly=False, delimiter='\\s*', tmin=None)

Reads in a GSLIB-style Geo-EAS data file

Parameters

flname (str) – Path (or name) of file to read.

Keyword Arguments
  • headeronly (bool) – If True, only reads in the 1st line from the data file which is useful for just getting column numbers or testing

  • delimiter (str) – Delimiter specified instead of sniffing

  • tmin (float) – values less than this number are convernted to NaN, since NaN’s are natural handled within matplotlib, pandas, numpy, etc. If None, set to pygeostat.Parameters[‘data.tmin’].

Returns

Pandas DataFrame object with input data.

Return type

data (pandas.DataFrame)

Fortran Compile for GSB

pygeostat.data.iotools.compile_pygsb()

Compiles ‘pygeostat/fortran/src/pygsb.f90’ using ‘pygeostat/fortran/compile.py’ and tries to import pygsb.pyd

Note

How to install a gfortran compiler:

  • Install chocolatey from:

    chocolatey.org/install

(chocolatey is a package manager that let you install software using command prompt and PowerShell)

  • After installing chocolatey, then install the ‘gnu Fortran compiler’ by writing the below in a PowerShell:

    choco install mingw –version 8.1

    choco install visualstudio2019community

    choco install visualstudio2019-workload-vctools

  • When installing “mingw” through “chocolatey”, ensure that the path of the “mingw” ‘s “bin” folder is added to the environment variables path.

Read GSB

pygeostat.data.iotools.read_gsb(flname, ireal=-1, tmin=None, null=None)

Reads in a CCG GSB (GSLIB-Binary) file.

Parameters

flname (str) – Path (or name) of file to read.

Keyword Arguments
  • ireal (int) – 1-indexed realization number to read (reads 1 at a time), -1 to read all

  • tmin (float) – values less than this number are convernted to NaN, since NaN’s are natural handled within matplotlib, pandas, numpy, etc. If None, set to pygeostat.Parameters[‘data.tmin’].

  • null (float) – when the gsb array has a keyout, on reconstruction this value fills the array in keyed out locations. If None taken from Parameters[‘data.null’]

Returns

Pandas DataFrame object with input data.

Return type

data (pandas.DataFrame)

Code author: Jared Deutsch 2016-02-19

Write GSLIB Python

pygeostat.data.iotools.write_gslib(data, flname, title=None, variables=None, fmt=None, sep=' ', null=None)

Writes out a GSLIB-style data file.

Parameters
  • data (pygeostat.DataFile or pandas.DataFrame) – data to write out

  • flname (str) – Path (or name) of file to write out.

Keyword Arguments
  • title (str) – Title for output file.

  • variables (List(str)) – List of variables to write out if only a subset is desired.

  • fmt (str) – Format to use for floating point numbers.

  • sep (str) – Delimiter to use for file output, generally don’t need to change.

  • null (float) – NaN numbers are converted to this value prior to writing. If None, set to data.null. If data.Null is None, set to pygeostat.Parameters[‘data.null’].

Write CSV

pygeostat.data.iotools.write_csv(data, flname, variables=None, fmt='%.5f', sep=', ', fltype='csv', null=None)

Writes out a CSV or Excel (XLSX) data file.

Parameters
  • data (pygeostat.DataFile or pandas.DataFrame) – data to write out

  • flname (str) – Path (or name) of file to write out.

Keyword Arguments
  • variables (List(str)) – List of variables to write out if only a subset is desired.

  • fmt (str) – Format to use for floating point numbers.

  • sep (str) – Delimiter to use for file output, generally don’t need to change.

  • fltype (str) – Type of file to write either csv or xlsx.

  • null (float) – NaN numbers are converted to this value prior to writing. If None, set to data.null. If data.Null is None, set to pygeostat.Parameters[‘data.null’].

Write GSB

pygeostat.data.iotools.write_gsb(data, flname, tvar=None, nreals=1, variables=None, griddef=None, fmt=0)

Writes out a GSB (GSLIB-Binary) style data file. NaN values of tvar are compressed in the output with no tmin now provided.

Parameters
  • data (pygeostat.DataFile or pandas.DataFrame) – data to write out

  • flname (str) – Path (or name) of file to write out.

  • tvar (str) – Variable to trim by or None for no trimming. Note that all variables are trimmed in the data file (for compression) when this variable is trimmed.

  • nreals (int) – number of realizations in data

Keyword Arguments
  • griddef (pygeostat.griddef.GridDef) – This is required if the data is gridded and you want other gsb programs to read it

  • fmt (int) – if 0 then will write out all variables as float 64. Otherwise should be an list with a length equal to number of variables and with the following format codes 1=int32, 2=float32, 3=float64

  • variables (List(str)) – List of variables to write out if only a subset is desired.

Code author: Jared Deutsch 2016-02-19, modified by Ryan Barnett 2018-04-12

Write VTK

pygeostat.data.iotools.write_vtk(data, flname, dftype=None, x=None, y=None, z=None, variables=None, griddef=None, null=None, vdtype=None, cdtype=None)

Writes out an XML VTK data file. A required dependency is pyevtk, which may be installed using the following command:

>>> pip install pyevtk

Users are also recommended to install the latest Paraview, as versions from 2017 were observed to have odd precision bugs with the XML format.

Parameters
  • data (pygeostat.DataFile) – data to write out

  • flname (str) – Path (or name) of file to write out (without extension)

Keyword Arguments
  • dftype (str) – type of datafile options grid or point, which if None, is drawn from data.dftype

  • x (str) – name of the x-coordinate, which is used if point. Drawn from data.x if the kwarg=None. If not provided by these means for `sgrid`, calculated via sim.griddef.get_coordinates().

  • y (str) – name of the y-coordinate, which is used if point. Drawn from data.y if the kwarg=None. If not provided by these means for `sgrid`, calculated via sim.griddef.get_coordinates().

  • z (str) – name of the z-coordinate, which is used if point. Drawn from data.z if the kwarg=None. If not provided by these means for `sgrid`, calculated via sim.griddef.get_coordinates().

  • griddef (pygeostat.GridDef) – grid definition, which is required if grid. Drawn from data.griddef if the kwarg=None.

  • variables (list or str) – List or string of variables to write out. If None, then all columns aside from coordinates are written out by default.

  • null (float) – NaNs are converted to this value prior to writing. If None, set to pygeostat.Parameters[‘data.null_vtk’].

  • vdtype (dict(str)) – Dictionary of the format {‘varname’: dtype}, where dtype is a numpy data format. May be used for reducing file size, by specifying int, float32, etc. If a format string is provided instead of a dictionary, that format is applied to all variables. This is not applied to coordinate variables (if applicable). If None, the value is drawn from Parameters[‘data.write_vtk.vdtype’].

  • cdtype (str) – Numpy format to use for the output of coordinates, where valid formats are float64 (default) and float32. The later is recommended for reducing file sizes, but may not provide the requisite precision for UTM coordinates. If None, the value is drawn from Parameters[‘data.write_vtk.cdtype’].

dftype should be one of:

  1. ‘point’ (irregular points) where data.x, data.y and data.z are columns in data.data

  2. ‘grid’ (regular or rectilinear grid) where data.griddef must be initialized

  3. ‘sgrid’ (structured grid) where data.x, data.y and data.z are columns in data.data. data.griddef should also be initialized, although only griddef.nx, griddef.ny and griddef.nz are utilized (since the grid is assumed to not be regular)

Write HDF5 VTK

pygeostat.data.iotools.write_hvtk(data, flname, griddef, variables=None)

Writes out an H5 file and corresponding xdmf file that Paraview can read. Currently only supports 3D gridded datasets. This function will fail if the length of the DataFile or DataFrame does not equal griddef.count().

The extension xdmf is silently enforced. Any other extension passed is replaced.

Parameters
  • data (pd.DataFrame) – The DataFrame to writeout

  • flname (str) – Path (or name) of file to write out.

  • griddef (GridDef) – Grid definitions for the realizations to be written out

  • variables (str or list) – optional set of variables to write out from the DataFrame

Count Lines in File

pygeostat.data.iotools.file_nlines(flname)

Open a file and get the total number of lines. Seems pretty fast. Copied from stackoverflow http://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python

Parameters

flname (str) – Name of the file to read

Write CCG GMM

pygeostat.data.iotools.writeout_gslib_gmm(gmm, outfile)

Writeout a fitted Gaussian mixture to the format consistent with gmmfit from the CCG Knowledge Base. Assume gmm is a an sklearn.mixture.GaussianMixture class fitted to data

Note

Recently GMM was replaced with GaussianMixture, and there are subtle differences in attributes between the different versions..

Parameters
  • gmm (GaussianMixture) – a fitted mixture model

  • outfile (str) – the output file

HDF5 I/O

Write HDF5

pygeostat.data.h5_io.write_h5(data, flname, h5path=None, datasets=None, dtype=None, gridstr=None, trim_variable=None, var_min=-998.0)

Write data to an HDF5 file using the python package H5PY. The file is appended to and in the case that a dataset already exists, it is overwritten.

Parameters
  • data – A 1-D np.array/pd.Series or a pd.DataFrame containing different variables as columns

  • flname (str) – Path of the HDF5 you wish to write to or create

  • h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to place the dataset(s) specified by the argument datasets into. The dataset name cannot be passed using this argument, it is interpreted as a group name. A value of None places the dataset into the root directory of the HDF5 file.

  • datasets (str or list) – Name of the dataset(s) to write out. If a pd.DataFrame is passed, the values passed by the argument datasets must match the DataFrame’s columns.

  • dtype (str) – The data type to write. Currently, only the following values are permitted: ['int32', 'float32', 'float64']. If a pd.DataFrame is passed and this argument is left to it’s default value of None, the DataFrame’s dtypes must be of the types listed above.

  • gridstr (str) – Grid definition string that is saved to the HDF5 file as an attribute of the group defined by the parameter h5path.

  • trim_variable (str) – Variable to use for trimming the data. An index will be written to the h5file and will be used to rebuild dataset while only nontrimmed data will be written out

  • var_min (float) – minimum trimming limit usedif trim_variable is passed

Examples

Write a single pd.Series or np.array to an HDF5 file:

>>> gs.write_h5(array, 'file.h5', h5path='Modeled/Var1', datasets='Realization_0001')

Write a whole pd.DataFrame in group (folder) ‘OriginalData’ that contains a dataset for every column in the pd.DataFrame:

>>> gs.write_h5('file.h5', DataFrame, h5path='OriginalData')

Read HDF5

pygeostat.data.h5_io.read_h5(flname, h5path=None, datasets=None, fill_value=-999)

Return a 1-D array from an HDF5 file or build a pd.DataFrame() from a list of datasets in a single group.

The argument h5path must be a path to a group. If 1 or more specific variables are desired to be loaded, pass a list to datasets to specify which to read.

Parameters
  • flname (str) – Path of the HDF5 you wish to write to or create

  • h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to read the dataset(s) specified by the argument datasets from. The dataset name cannot be passed using this argument, it is interpreted as a group name only. A value of None places the dataset into the root directory of the HDF5 file. A value of False loads a blank pd.DataFrame().

  • datasets (str or list) – Name of the dataset(s) to read from the group specified by h5path. Does nothing if h5path points to a dataset.

  • fill_value (float or np.NaN) – value to fill in grid with if trimmed data was written out. default is -999

Returns

DataFrame containing one or more columns, each containing a single 1-D array of a variable.

Return type

data (pd.DataFrame)

Is HDF5

pygeostat.data.h5_io.ish5dataset(h5fl, dataset, h5path=None)

Check to see if a dataset exits within an HDF5 file

The argument h5path must be a path to a group and cannot contain the dataset name. Can only check for one dataset at a time.

Parameters
  • flname (str) – Path of the HDF5 you wish to check

  • h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to check for the specified dataset. The dataset name cannot be passed using this argument, it is interpreted as a group name only. A value of None places the dataset into the root directory of the HDF5 file.

  • dataset (str) – Name of the dataset to check for in the group specified by h5path.

Returns

Indicator if the specified dataset exists

Return type

exists (bool)

Combine Datasets from Multiple Paths

pygeostat.data.h5_io.h5_combine_data(flname, h5paths, datasets=None)

Combine data into one DataFrame from multiple paths in a HDF5 file.

Parameters
  • flname (str) – Path of the HDF5 you wish to read from

  • h5paths (list) – A list of h5paths to combine. Forward slash (/) delimited path through the group hierarchy you wish to place the dataset(s) specified by the argument datasets into. The dataset name cannot be passed using this argument, it is interpreted as a group name. A value of None places the dataset into the root directory of the HDF5 file.

  • datasets (list of lists) – If only a specific set of datasets from each path are desired then pass a list of lists of equal length as the h5paths list. An empty list within the list will cause all datasets in the corresponding path to be readin.

Returns

DataFrame

Example:

>>> flname = 'drilldata.h5'
... h5paths = ['/Orig_data/series4870/', 'NS/Declus/series4870/']
... datasets = [['LOCATIONX', 'LOCATIONY', 'LOCATIONZ'], []]
... data = gs.h5_combine_data(flname, h5paths, datasets=datasets)

Pygeostat HDF5 Class

class pygeostat.data.h5_io.H5Store(flname, replace=False)

A simple class within pygeostat to manage and use HDF5 files.

Variables
  • flname (str) – Path to a HDF5 file to create or use

  • h5data (h5py.File) – h5py File object

  • paths (dict) – Dictionary containing all of the groups found in the HDF5 file that contain datasets

Parameters

flname (str) – Path to a HDF5 file to create or use

Usage:

Write a np.array or pd.Series to the HDF5 file:

>>> H5Store['Group1/Group2/Var1'] = np.array()

Write all the columns in a pd.DataFrame to the HDF5 file:

>>> H5Store['Group1/Group2'] = pd.DataFrame()

Retrieve a single 1-D array:

>>> array = H5Store['Group1/Group2/Var1']

Retrieve a single 1-D array within the root directory of the HDF5 file:

>>> array = H5Store['Var1']

Retrieve the first value from the array:

>>> value = H5Store['Var1', 0]

Retrieve a slice of values from the array:

>>> values = H5Store['Var1', 10:15]

Write Data

H5Store.__setitem__(key, value)

Write the the HDF5 file using the self[key] notation.

If a pd.Series or np.array is passed, the last entry in the path is used as the dataset name. If a pd.DataFrame is passed, all columns are written to the path specified to datasets with their names retrieved from the pd.DataFrame’s columns. If more flexible usage is required, please use gs.write_h5().

Example

Write a np.array or pd.Series to the HDF5 file:

>>> H5Store['Group1/Group2/Var1'] = np.array()

Write all the columns in a pd.DataFrame to the HDF5 file:

>>> H5Store['Group1/Group2'] = pd.DataFrame()

Read Data

H5Store.__getitem__(key)

Retrieve an array using the self[key] notation. The passed key is the path used to access the array desired and included direction through groups if required and the dataset name. The array may be selectively queried allowing a specific value or range of values to be loaded into the systems memory and not the whole array.

Example

Retrieve a single 1-D array:

>>> array = H5Store['Group1/Group2/Var1']

Retrieve a single 1-D array within the root directory of the HDF5 file:

>>> array = H5Store['Var1']

Retrieve the first value from the array:

>>> value = H5Store['Var1', 0]

Retrieve a slice of values from the array:

>>> values = H5Store['Var1', 10:15]

Close the HDF5 File

H5Store.close()

Release the open HDF5 file from python.

Datasets in H5 Store

H5Store.datasets(h5path=None)

Return the datasets found in the specified group.

Keyword Arguments

h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to retrieve the lists of datasets from. A dataset name cannot be passed using this argument, it is interpreted as a group name. A value of None places the dataset into the root directory of the HDF5 file.

Returns

List of the datasets found within the specified h5path

Return type

datasets (list)

Generate Iterator

H5Store.iteritems(h5path=None, datasets=None, wildcard=None)

Produces an iterator that can be used to iterate over HDF5 datasets.

Can use the parameter h5path to indicate which group to retrieve the datasets from. If a set of specific datasets are required, the parameter datasets will restrict the iterator to those. The parameter wildcard allows a string wild-card value to restrict which datasets are iterated over.

Keyword Arguments
  • h5path (str) – Forward slash (/) delimited path through the group hierarchy you wish to retrieve datasets from. A dataset name cannot be passed using this argument, it is interpreted as a group name. A value of None places the dataset into the root directory of the HDF5 file.

  • datasets (list) – List of specific dataset names found within the specified group to iterator over

  • wildcard (str) – String to search for within the names of the datasets found within the specified group to iterate over

Examples

Load a HDF5 file to pygeostat:

>>> data = gs.H5Store('data.h5')

Iterate over all datasets within the root directory of a HDF5 file:

>>> for dataset in data.iteritems():
>>>     gs.histplt(dataset)

Iterate over the datasets within a specific group that are realizations:

>>> for dataset in data.iteritems(h5path='Simulation/NS_AU', wildcard='Realization'):
>>>     gs.histplt(dataset)

DictFile Class

class pygeostat.data.data.DictFile(flname=None, readfl=False, dictionary={})

Class containing dictionary file information

Read Dictionary

DictFile.read_dict()

Read dictionary information from file

Write Dictionary

DictFile.write_dict()

Write dictionary information to csv style dictionary