elm.pipeline.steps

The examples below assume you have created a random ElmStore as follows:

from elm.pipeline.tests.util import random_elm_store
X = random_elm_store()

Operations to reshape an ElmStore

  • Flatten - Flatten each 2-D DataArray in an ElmStore to create an ElmStore with a single DataArray called flat that is 2-D (each band raster is raveled from 2-D to a 1-D column in flat). Example:
steps.Flatten().fit_transform(X)
  • Agg - Aggregate over a dimension or axis. Example:
steps.Agg(axis=0, func='mean').fit_transform(X)
  • DropNaRows - Remove null / NaN rows from an ElmStore that has been through steps.Flatten():
steps.DropNaRows().fit_transform(*steps.Flatten().fit_transform(X))
  • InverseFlatten - Convert a flattened ElmStore back to 2-D rasters as separate DataArray values in an ElmStore. Example:
steps.InverseFlatten().fit_transform(*steps.Flatten().fit_transform(X)

Use an unsupervised feature extractor

  • Transform - steps.Transform allows one to use any sklearn.decomposition method in an elm Pipeline. Partial fit of the feature extractor can be accomplished by giving partial_fit_batches at initialization:
from sklearn.decomposition import IncrementalPCA
X, y, sample_weight = steps.Flatten().fit_transform(X)
pca = steps.Transform(IncrementalPCA(),
                partial_fit_batches=2)
pca.fit_transform(X)

Run a user-given callable

There are two choices for running a user-given callable in a Pipeline . Using ModifySample is the most general, taking any shape of X, y and sample_weight arguments, while FunctionTransformer requires that the ElmStore has been through steps.Flatten()

  • ModifySample - The following shows an example function with the required signature for use with ModifySample . It divides all the values in each DataArray by their sum. Note the function always returns a tuple of (X, y, sample_weight) , even if y and sample_weight are not used by the function:
def modifier(X, y=None, sample_weight=None, **kwargs):
     for band in X.data_vars:
         arr = getattr(X, band)
         if kwargs.get('normalize'):
             arr.values /= arr.values.max()
     return X, y, sample_weight

steps.ModifySample(modifier, normalize=True).fit_transform(X)
import numpy as np
Xnew, y, sample_weight = steps.Flatten().fit_transform(X)
Xnew, y, sample_weight = steps.FunctionTransformer(func=np.log).fit_transform(Xnew)

Preprocessing - Scaling / Normalization

Each of the following classes from scikit-learn have been wrapped for usage as a Pipeline step. Each requires that the ElmStore

The examples below continue with Xnew a flattened ElmStore :

Xnew, y, sample_weight = steps.Flatten().fit_transform(X)
steps.KernelCenterer().fit_transform(Xnew)
steps.MaxAbsScaler().fit_transform(*steps.Flatten().fit_transform(X))
steps.MinMaxScaler().fit_transform(Xnew)
  • Normalizer - See also Normalizer scikit-learn docs.
steps.Normalizer().fit_transform(Xnew)
steps.RobustScaler().fit_transform(Xnew)
step = steps.PolynomialFeatures(degree=3,
                                interaction_only=False)
step.fit_transform(Xnew)
steps.StandardScaler().fit_transform(Xnew)

Encoding Preprocessors from sklearn

Each method here requires that the ElmStore has been through steps.Flatten() as follows:

Xnew, y, sample_weight = steps.Flatten().fit_transform(X)
  • Binarizer - Binarize features. See also Binarizer docs from sklearn .
steps.Binarizer().fit_transform(Xnew)
  • Imputer - Impute missing values. See also Imputer docs from sklearn .
steps.Imputer().fit_transform(Xnew)

Feature selectors

The following list shows the feature selectors that may be used in a Pipeline . The methods, with the exception of VarianceThreshold each require y to be not None.

Setup for the examples:

X, y = random_elm_store(return_y=True)
X = steps.Flatten().fit_transform(X)[0]
  • RFE - See also RFE in sklearn docs. Example:
steps.RFE(estimator=LinearRegression()).fit_transform(X, y)
  • RFECV - See also RFECV in sklearn docs. Example:
steps.RFECV(estimator=LinearRegression()).fit_transform(X, y)
  • SelectFdr - See also SelectFdr in sklearn docs. Example:
steps.SelectFdr().fit_transform(X, y)
  • SelectFpr - See also SelectFpr in sklearn docs. Example:
steps.SelectFpr().fit_transform(X, y)
steps.SelectFromModel(estimator=LinearRegression()).fit_transform(X, y)
  • SelectFwe - See also SelectFwe in sklearn docs. Example:
steps.SelectFwe().fit_transform(X, y)
  • SelectKBest - See also SelectKBest in sklearn docs. Example:
steps.SelectKBest(k=2).fit_transform(X, y)
steps.SelectPercentile(percentile=50).fit_transform(X, y)
steps.VarianceThreshold(threshold=6.92).fit_transform(X)

Normalizing time dimension of 3-D Cube

The following two functions take an ElmStore with a DataArray of any name that is a 3-D cube with a time dimension. The functions run descriptive stats along the time dimension and flatten the spatial (x, y) dims to space (essentially a ravel of the (x, y) points).

Setup - make a compatible ElmStore:

from elm.readers import ElmStore
import numpy as np
import xarray as xr
def make_3d():
    arr = np.random.uniform(0, 1, 100000).reshape(100, 10, 100)
    return ElmStore({'band_1': xr.DataArray(arr,
                            coords=[('time', np.arange(100)),
                                    ('x', np.arange(10)),
                                    ('y',np.arange(100))],
                            dims=('time', 'x', 'y'),
                            attrs={})}, attrs={}, add_canvas=False)
X = make_3d()
  • TSDescribe - Run scipy.stats.describe and other stats along the time axis of a 3-D cube DataArray . Example:
s = steps.TSDescribe(band='band_1', axis=0)
Xnew, y, sample_weight = s.fit_transform(X)
Xnew.flat.band

The above code would show the band dimension of Xnew consists of different summary statistics, mostly from scipy.stats.describe :

<xarray.DataArray 'band' (band: 8)>
array(['var', 'skew', 'kurt', 'min', 'max', 'median', 'std', 'np_skew'],
      dtype='<U7')
Coordinates:
  * band     (band) <U7 'var' 'skew' 'kurt' 'min' 'max' 'median' 'std' 'np_skew'
  • TSProbs - TSProbs will run bin, count and return probabilities associated with bin counts. An example:
fixed_bins = steps.TSProbs(band='band_1',
                           bin_size=0.5,
                           num_bins=152,
                           log_probs=True,
                           axis=0)
Xnew, y, sample_weight = fixed_bins.fit_transform(X)

The above would create the DataArray Xnew.flat with 152 columns consisting of the log transformed bin probabilities (152 bins of 0.5 width).

And the following would use irregular ( numpy.histogram ) bins rather than fixed bins and return probabilities without log transform first:

irregular_bins = steps.TSProbs(band='band_1',
                               num_bins=152,
                               log_probs=False,
                               axis=0)
Xnew, y, sample_weight = irregular_bins.fit_transform(X)