elm.pipeline.steps
¶
The examples below assume you have created a random ElmStore as follows:
from elm.pipeline.tests.util import random_elm_store
X = random_elm_store()
Operations to reshape an ElmStore¶
Flatten
- Flatten each 2-DDataArray
in an ElmStore to create an ElmStore with a singleDataArray
calledflat
that is 2-D (each band raster is raveled from 2-D to a 1-D column inflat
). Example:
steps.Flatten().fit_transform(X)
Agg
- Aggregate over a dimension or axis. Example:
steps.Agg(axis=0, func='mean').fit_transform(X)
DropNaRows
- Remove null / NaN rows from an ElmStore that has been throughsteps.Flatten()
:
steps.DropNaRows().fit_transform(*steps.Flatten().fit_transform(X))
InverseFlatten
- Convert a flattened ElmStore back to 2-D rasters as separateDataArray
values in an ElmStore. Example:
steps.InverseFlatten().fit_transform(*steps.Flatten().fit_transform(X)
Use an unsupervised feature extractor¶
Transform
-steps.Transform
allows one to use any sklearn.decomposition method in anelm
Pipeline. Partial fit of the feature extractor can be accomplished by givingpartial_fit_batches
at initialization:
from sklearn.decomposition import IncrementalPCA
X, y, sample_weight = steps.Flatten().fit_transform(X)
pca = steps.Transform(IncrementalPCA(),
partial_fit_batches=2)
pca.fit_transform(X)
Run a user-given callable¶
There are two choices for running a user-given callable in a Pipeline . Using ModifySample
is the most general, taking any shape of X
, y
and sample_weight
arguments, while FunctionTransformer
requires that the ElmStore has been through steps.Flatten()
ModifySample
- The following shows an example function with the required signature for use withModifySample
. It divides all the values in eachDataArray
by their sum. Note the function always returns a tuple of(X, y, sample_weight)
, even ify
andsample_weight
are not used by the function:
def modifier(X, y=None, sample_weight=None, **kwargs):
for band in X.data_vars:
arr = getattr(X, band)
if kwargs.get('normalize'):
arr.values /= arr.values.max()
return X, y, sample_weight
steps.ModifySample(modifier, normalize=True).fit_transform(X)
FunctionTransformer
- Here is an example using the FunctionTransformer fromsklearn
:
import numpy as np
Xnew, y, sample_weight = steps.Flatten().fit_transform(X)
Xnew, y, sample_weight = steps.FunctionTransformer(func=np.log).fit_transform(Xnew)
Preprocessing - Scaling / Normalization¶
Each of the following classes from scikit-learn have been wrapped for usage as a Pipeline step. Each requires that the ElmStore
The examples below continue with Xnew
a flattened ElmStore :
Xnew, y, sample_weight = steps.Flatten().fit_transform(X)
KernelCenterer
- See also KernelCenterer scikit-learn docs.
steps.KernelCenterer().fit_transform(Xnew)
MaxAbsScaler
- See also MaxAbsScaler scikit-learn docs.
steps.MaxAbsScaler().fit_transform(*steps.Flatten().fit_transform(X))
MinMaxScaler
- See also MinMaxScaler scikit-learn docs.
steps.MinMaxScaler().fit_transform(Xnew)
Normalizer
- See also Normalizer scikit-learn docs.
steps.Normalizer().fit_transform(Xnew)
RobustScaler
- See also RobustScaler scikit-learn docs.
steps.RobustScaler().fit_transform(Xnew)
PolynomialFeatures
- See also PolynomialFeatures scikit-learn docs.
step = steps.PolynomialFeatures(degree=3,
interaction_only=False)
step.fit_transform(Xnew)
StandardScaler
- See also StandardScaler scikit-learn docs.
steps.StandardScaler().fit_transform(Xnew)
Encoding Preprocessors from sklearn
¶
Each method here requires that the ElmStore has been through steps.Flatten()
as follows:
Xnew, y, sample_weight = steps.Flatten().fit_transform(X)
Binarizer
- Binarize features. See also Binarizer docs fromsklearn
.
steps.Binarizer().fit_transform(Xnew)
Imputer
- Impute missing values. See also Imputer docs fromsklearn
.
steps.Imputer().fit_transform(Xnew)
Feature selectors¶
The following list shows the feature selectors that may be used in a Pipeline . The methods, with the exception of VarianceThreshold
each require y
to be not None
.
Setup for the examples:
X, y = random_elm_store(return_y=True)
X = steps.Flatten().fit_transform(X)[0]
RFE
- See also RFE insklearn
docs. Example:
steps.RFE(estimator=LinearRegression()).fit_transform(X, y)
RFECV
- See also RFECV insklearn
docs. Example:
steps.RFECV(estimator=LinearRegression()).fit_transform(X, y)
SelectFdr
- See also SelectFdr insklearn
docs. Example:
steps.SelectFdr().fit_transform(X, y)
SelectFpr
- See also SelectFpr insklearn
docs. Example:
steps.SelectFpr().fit_transform(X, y)
SelectFromModel
- See also SelectFromModel insklearn
docs. Example:
steps.SelectFromModel(estimator=LinearRegression()).fit_transform(X, y)
SelectFwe
- See also SelectFwe insklearn
docs. Example:
steps.SelectFwe().fit_transform(X, y)
SelectKBest
- See also SelectKBest insklearn
docs. Example:
steps.SelectKBest(k=2).fit_transform(X, y)
SelectPercentile
- See also SelectPercentile insklearn
docs. Example:
steps.SelectPercentile(percentile=50).fit_transform(X, y)
VarianceThreshold
- See also VarianceThreshold insklearn
docs. Example:
steps.VarianceThreshold(threshold=6.92).fit_transform(X)
Normalizing time dimension of 3-D Cube¶
The following two functions take an ElmStore with a DataArray
of any name that is a 3-D cube with a time dimension. The functions run descriptive stats along the time dimension and flatten the spatial (x, y)
dims to space (essentially a ravel
of the (x, y)
points).
Setup - make a compatible ElmStore:
from elm.readers import ElmStore
import numpy as np
import xarray as xr
def make_3d():
arr = np.random.uniform(0, 1, 100000).reshape(100, 10, 100)
return ElmStore({'band_1': xr.DataArray(arr,
coords=[('time', np.arange(100)),
('x', np.arange(10)),
('y',np.arange(100))],
dims=('time', 'x', 'y'),
attrs={})}, attrs={}, add_canvas=False)
X = make_3d()
TSDescribe
- Runscipy.stats.describe
and other stats along the time axis of a 3-D cubeDataArray
. Example:
s = steps.TSDescribe(band='band_1', axis=0)
Xnew, y, sample_weight = s.fit_transform(X)
Xnew.flat.band
The above code would show the band
dimension of Xnew
consists of different summary statistics, mostly from scipy.stats.describe
:
<xarray.DataArray 'band' (band: 8)>
array(['var', 'skew', 'kurt', 'min', 'max', 'median', 'std', 'np_skew'],
dtype='<U7')
Coordinates:
* band (band) <U7 'var' 'skew' 'kurt' 'min' 'max' 'median' 'std' 'np_skew'
TSProbs
-TSProbs
will run bin, count and return probabilities associated with bin counts. An example:
fixed_bins = steps.TSProbs(band='band_1',
bin_size=0.5,
num_bins=152,
log_probs=True,
axis=0)
Xnew, y, sample_weight = fixed_bins.fit_transform(X)
The above would create the DataArray
Xnew.flat
with 152 columns consisting of the log
transformed bin probabilities (152 bins of 0.5 width).
And the following would use irregular ( numpy.histogram
) bins rather than fixed bins and return probabilities without log
transform first:
irregular_bins = steps.TSProbs(band='band_1',
num_bins=152,
log_probs=False,
axis=0)
Xnew, y, sample_weight = irregular_bins.fit_transform(X)