`elm.pipeline.steps`¶

The examples below assume you have created a random ElmStore as follows:

from elm.pipeline.tests.util import random_elm_store
X = random_elm_store()

Operations to reshape an ElmStore ¶

Flatten - Flatten each 2-D DataArray in an ElmStore to create an ElmStore with a single DataArray called flat that is 2-D (each band raster is raveled from 2-D to a 1-D column in flat). Example:

steps.Flatten().fit_transform(X)

Agg - Aggregate over a dimension or axis. Example:

steps.Agg(axis=0, func='mean').fit_transform(X)

DropNaRows - Remove null / NaN rows from an ElmStore that has been through steps.Flatten():

steps.DropNaRows().fit_transform(*steps.Flatten().fit_transform(X))

InverseFlatten - Convert a flattened ElmStore back to 2-D rasters as separate DataArray values in an ElmStore. Example:

steps.InverseFlatten().fit_transform(*steps.Flatten().fit_transform(X)

Use an unsupervised feature extractor¶

Transform - steps.Transform allows one to use any sklearn.decomposition method in an elm Pipeline. Partial fit of the feature extractor can be accomplished by giving partial_fit_batches at initialization:

from sklearn.decomposition import IncrementalPCA
X, y, sample_weight = steps.Flatten().fit_transform(X)
pca = steps.Transform(IncrementalPCA(),
                partial_fit_batches=2)
pca.fit_transform(X)

Run a user-given callable¶

There are two choices for running a user-given callable in a Pipeline . Using ModifySample is the most general, taking any shape of X, y and sample_weight arguments, while FunctionTransformer requires that the ElmStore has been through steps.Flatten()

ModifySample - The following shows an example function with the required signature for use with ModifySample . It divides all the values in each DataArray by their sum. Note the function always returns a tuple of (X, y, sample_weight) , even if y and sample_weight are not used by the function:

def modifier(X, y=None, sample_weight=None, **kwargs):
     for band in X.data_vars:
         arr = getattr(X, band)
         if kwargs.get('normalize'):
             arr.values /= arr.values.max()
     return X, y, sample_weight

steps.ModifySample(modifier, normalize=True).fit_transform(X)

FunctionTransformer - Here is an example using the FunctionTransformer from sklearn :

import numpy as np
Xnew, y, sample_weight = steps.Flatten().fit_transform(X)
Xnew, y, sample_weight = steps.FunctionTransformer(func=np.log).fit_transform(Xnew)

Preprocessing - Scaling / Normalization¶

Each of the following classes from scikit-learn have been wrapped for usage as a Pipeline step. Each requires that the ElmStore

The examples below continue with Xnew a flattened ElmStore :

Xnew, y, sample_weight = steps.Flatten().fit_transform(X)

KernelCenterer - See also KernelCenterer scikit-learn docs.

steps.KernelCenterer().fit_transform(Xnew)

MaxAbsScaler - See also MaxAbsScaler scikit-learn docs.

steps.MaxAbsScaler().fit_transform(*steps.Flatten().fit_transform(X))

MinMaxScaler - See also MinMaxScaler scikit-learn docs.

steps.MinMaxScaler().fit_transform(Xnew)

Normalizer - See also Normalizer scikit-learn docs.

steps.Normalizer().fit_transform(Xnew)

RobustScaler - See also RobustScaler scikit-learn docs.

steps.RobustScaler().fit_transform(Xnew)

PolynomialFeatures - See also PolynomialFeatures scikit-learn docs.

step = steps.PolynomialFeatures(degree=3,
                                interaction_only=False)
step.fit_transform(Xnew)

StandardScaler - See also StandardScaler scikit-learn docs.

steps.StandardScaler().fit_transform(Xnew)

Encoding Preprocessors from `sklearn`¶

Each method here requires that the ElmStore has been through steps.Flatten() as follows:

Xnew, y, sample_weight = steps.Flatten().fit_transform(X)

Binarizer - Binarize features. See also Binarizer docs from sklearn .

steps.Binarizer().fit_transform(Xnew)

Imputer - Impute missing values. See also Imputer docs from sklearn .

steps.Imputer().fit_transform(Xnew)

Feature selectors¶

The following list shows the feature selectors that may be used in a Pipeline . The methods, with the exception of VarianceThreshold each require y to be not None.

Setup for the examples:

X, y = random_elm_store(return_y=True)
X = steps.Flatten().fit_transform(X)[0]

RFE - See also RFE in sklearn docs. Example:

steps.RFE(estimator=LinearRegression()).fit_transform(X, y)

RFECV - See also RFECV in sklearn docs. Example:

steps.RFECV(estimator=LinearRegression()).fit_transform(X, y)

SelectFdr - See also SelectFdr in sklearn docs. Example:

steps.SelectFdr().fit_transform(X, y)

SelectFpr - See also SelectFpr in sklearn docs. Example:

steps.SelectFpr().fit_transform(X, y)

SelectFromModel - See also SelectFromModel in sklearn docs. Example:

steps.SelectFromModel(estimator=LinearRegression()).fit_transform(X, y)

SelectFwe - See also SelectFwe in sklearn docs. Example:

steps.SelectFwe().fit_transform(X, y)

SelectKBest - See also SelectKBest in sklearn docs. Example:

steps.SelectKBest(k=2).fit_transform(X, y)

SelectPercentile - See also SelectPercentile in sklearn docs. Example:

steps.SelectPercentile(percentile=50).fit_transform(X, y)

VarianceThreshold - See also VarianceThreshold in sklearn docs. Example:

steps.VarianceThreshold(threshold=6.92).fit_transform(X)

Normalizing time dimension of 3-D Cube¶

The following two functions take an ElmStore with a DataArray of any name that is a 3-D cube with a time dimension. The functions run descriptive stats along the time dimension and flatten the spatial (x, y) dims to space (essentially a ravel of the (x, y) points).

Setup - make a compatible ElmStore:

from elm.readers import ElmStore
import numpy as np
import xarray as xr
def make_3d():
    arr = np.random.uniform(0, 1, 100000).reshape(100, 10, 100)
    return ElmStore({'band_1': xr.DataArray(arr,
                            coords=[('time', np.arange(100)),
                                    ('x', np.arange(10)),
                                    ('y',np.arange(100))],
                            dims=('time', 'x', 'y'),
                            attrs={})}, attrs={}, add_canvas=False)
X = make_3d()

TSDescribe - Run scipy.stats.describe and other stats along the time axis of a 3-D cube DataArray . Example:

s = steps.TSDescribe(band='band_1', axis=0)
Xnew, y, sample_weight = s.fit_transform(X)
Xnew.flat.band

The above code would show the band dimension of Xnew consists of different summary statistics, mostly from scipy.stats.describe :

<xarray.DataArray 'band' (band: 8)>
array(['var', 'skew', 'kurt', 'min', 'max', 'median', 'std', 'np_skew'],
      dtype='<U7')
Coordinates:
  * band     (band) <U7 'var' 'skew' 'kurt' 'min' 'max' 'median' 'std' 'np_skew'

TSProbs - TSProbs will run bin, count and return probabilities associated with bin counts. An example:

fixed_bins = steps.TSProbs(band='band_1',
                           bin_size=0.5,
                           num_bins=152,
                           log_probs=True,
                           axis=0)
Xnew, y, sample_weight = fixed_bins.fit_transform(X)

The above would create the DataArray Xnew.flat with 152 columns consisting of the log transformed bin probabilities (152 bins of 0.5 width).

And the following would use irregular ( numpy.histogram ) bins rather than fixed bins and return probabilities without log transform first:

irregular_bins = steps.TSProbs(band='band_1',
                               num_bins=152,
                               log_probs=False,
                               axis=0)
Xnew, y, sample_weight = irregular_bins.fit_transform(X)

elm.pipeline.steps¶

Operations to reshape an ElmStore¶