elm.pipeline.steps¶
The examples below assume you have created a random ElmStore as follows:
from elm.pipeline.tests.util import random_elm_store
X = random_elm_store()
Operations to reshape an ElmStore¶
Flatten- Flatten each 2-DDataArrayin an ElmStore to create an ElmStore with a singleDataArraycalledflatthat is 2-D (each band raster is raveled from 2-D to a 1-D column inflat). Example:
steps.Flatten().fit_transform(X)
Agg- Aggregate over a dimension or axis. Example:
steps.Agg(axis=0, func='mean').fit_transform(X)
DropNaRows- Remove null / NaN rows from an ElmStore that has been throughsteps.Flatten():
steps.DropNaRows().fit_transform(*steps.Flatten().fit_transform(X))
InverseFlatten- Convert a flattened ElmStore back to 2-D rasters as separateDataArrayvalues in an ElmStore. Example:
steps.InverseFlatten().fit_transform(*steps.Flatten().fit_transform(X)
Use an unsupervised feature extractor¶
Transform-steps.Transformallows one to use any sklearn.decomposition method in anelmPipeline. Partial fit of the feature extractor can be accomplished by givingpartial_fit_batchesat initialization:
from sklearn.decomposition import IncrementalPCA
X, y, sample_weight = steps.Flatten().fit_transform(X)
pca = steps.Transform(IncrementalPCA(),
partial_fit_batches=2)
pca.fit_transform(X)
Run a user-given callable¶
There are two choices for running a user-given callable in a Pipeline . Using ModifySample is the most general, taking any shape of X, y and sample_weight arguments, while FunctionTransformer requires that the ElmStore has been through steps.Flatten()
ModifySample- The following shows an example function with the required signature for use withModifySample. It divides all the values in eachDataArrayby their sum. Note the function always returns a tuple of(X, y, sample_weight), even ifyandsample_weightare not used by the function:
def modifier(X, y=None, sample_weight=None, **kwargs):
for band in X.data_vars:
arr = getattr(X, band)
if kwargs.get('normalize'):
arr.values /= arr.values.max()
return X, y, sample_weight
steps.ModifySample(modifier, normalize=True).fit_transform(X)
FunctionTransformer- Here is an example using the FunctionTransformer fromsklearn:
import numpy as np
Xnew, y, sample_weight = steps.Flatten().fit_transform(X)
Xnew, y, sample_weight = steps.FunctionTransformer(func=np.log).fit_transform(Xnew)
Preprocessing - Scaling / Normalization¶
Each of the following classes from scikit-learn have been wrapped for usage as a Pipeline step. Each requires that the ElmStore
The examples below continue with Xnew a flattened ElmStore :
Xnew, y, sample_weight = steps.Flatten().fit_transform(X)
KernelCenterer- See also KernelCenterer scikit-learn docs.
steps.KernelCenterer().fit_transform(Xnew)
MaxAbsScaler- See also MaxAbsScaler scikit-learn docs.
steps.MaxAbsScaler().fit_transform(*steps.Flatten().fit_transform(X))
MinMaxScaler- See also MinMaxScaler scikit-learn docs.
steps.MinMaxScaler().fit_transform(Xnew)
Normalizer- See also Normalizer scikit-learn docs.
steps.Normalizer().fit_transform(Xnew)
RobustScaler- See also RobustScaler scikit-learn docs.
steps.RobustScaler().fit_transform(Xnew)
PolynomialFeatures- See also PolynomialFeatures scikit-learn docs.
step = steps.PolynomialFeatures(degree=3,
interaction_only=False)
step.fit_transform(Xnew)
StandardScaler- See also StandardScaler scikit-learn docs.
steps.StandardScaler().fit_transform(Xnew)
Encoding Preprocessors from sklearn¶
Each method here requires that the ElmStore has been through steps.Flatten() as follows:
Xnew, y, sample_weight = steps.Flatten().fit_transform(X)
Binarizer- Binarize features. See also Binarizer docs fromsklearn.
steps.Binarizer().fit_transform(Xnew)
Imputer- Impute missing values. See also Imputer docs fromsklearn.
steps.Imputer().fit_transform(Xnew)
Feature selectors¶
The following list shows the feature selectors that may be used in a Pipeline . The methods, with the exception of VarianceThreshold each require y to be not None.
Setup for the examples:
X, y = random_elm_store(return_y=True)
X = steps.Flatten().fit_transform(X)[0]
RFE- See also RFE insklearndocs. Example:
steps.RFE(estimator=LinearRegression()).fit_transform(X, y)
RFECV- See also RFECV insklearndocs. Example:
steps.RFECV(estimator=LinearRegression()).fit_transform(X, y)
SelectFdr- See also SelectFdr insklearndocs. Example:
steps.SelectFdr().fit_transform(X, y)
SelectFpr- See also SelectFpr insklearndocs. Example:
steps.SelectFpr().fit_transform(X, y)
SelectFromModel- See also SelectFromModel insklearndocs. Example:
steps.SelectFromModel(estimator=LinearRegression()).fit_transform(X, y)
SelectFwe- See also SelectFwe insklearndocs. Example:
steps.SelectFwe().fit_transform(X, y)
SelectKBest- See also SelectKBest insklearndocs. Example:
steps.SelectKBest(k=2).fit_transform(X, y)
SelectPercentile- See also SelectPercentile insklearndocs. Example:
steps.SelectPercentile(percentile=50).fit_transform(X, y)
VarianceThreshold- See also VarianceThreshold insklearndocs. Example:
steps.VarianceThreshold(threshold=6.92).fit_transform(X)
Normalizing time dimension of 3-D Cube¶
The following two functions take an ElmStore with a DataArray of any name that is a 3-D cube with a time dimension. The functions run descriptive stats along the time dimension and flatten the spatial (x, y) dims to space (essentially a ravel of the (x, y) points).
Setup - make a compatible ElmStore:
from elm.readers import ElmStore
import numpy as np
import xarray as xr
def make_3d():
arr = np.random.uniform(0, 1, 100000).reshape(100, 10, 100)
return ElmStore({'band_1': xr.DataArray(arr,
coords=[('time', np.arange(100)),
('x', np.arange(10)),
('y',np.arange(100))],
dims=('time', 'x', 'y'),
attrs={})}, attrs={}, add_canvas=False)
X = make_3d()
TSDescribe- Runscipy.stats.describeand other stats along the time axis of a 3-D cubeDataArray. Example:
s = steps.TSDescribe(band='band_1', axis=0)
Xnew, y, sample_weight = s.fit_transform(X)
Xnew.flat.band
The above code would show the band dimension of Xnew consists of different summary statistics, mostly from scipy.stats.describe :
<xarray.DataArray 'band' (band: 8)>
array(['var', 'skew', 'kurt', 'min', 'max', 'median', 'std', 'np_skew'],
dtype='<U7')
Coordinates:
* band (band) <U7 'var' 'skew' 'kurt' 'min' 'max' 'median' 'std' 'np_skew'
TSProbs-TSProbswill run bin, count and return probabilities associated with bin counts. An example:
fixed_bins = steps.TSProbs(band='band_1',
bin_size=0.5,
num_bins=152,
log_probs=True,
axis=0)
Xnew, y, sample_weight = fixed_bins.fit_transform(X)
The above would create the DataArray Xnew.flat with 152 columns consisting of the log transformed bin probabilities (152 bins of 0.5 width).
And the following would use irregular ( numpy.histogram ) bins rather than fixed bins and return probabilities without log transform first:
irregular_bins = steps.TSProbs(band='band_1',
num_bins=152,
log_probs=False,
axis=0)
Xnew, y, sample_weight = irregular_bins.fit_transform(X)