Pipeline¶
Overview of Pipeline
in elm
¶
elm.pipeline.Pipeline
allows a sequence of transformations on samples before fitting, transforming, and/or predicting from an scikit-learn estimator. elm.pipeline.Pipeline
is similar to the concept of the Pipeline
in scikit-learn (sklearn.pipeline.Pipeline
) but differs in several ways described below.
- Data sources for a Pipeline: In
elm
, the fitting expectsX
to be an ElmStore or xarray.Dataset rather than anumpy
array as in scikit-learn. This allows thePipeline
of transformations to include operations on cubes and other data structures common in satellite data machine learning. - Transformations: In scikit-learn each step in a
Pipeline
passes a numpy array to the next step by way of afit_transform
method. Inelm
, aPipeline
always passes a tuple of (X, y, sample_weight) where X is an ElmStore orxarray.Dataset
andy
andsample_weight
are numpy arrays orNone
. - Partial Fit for Large Samples: In
elm
a transformer with apartial_fit
method, such assklearn.decomposition.IncrementalPCA
may be partially fit several times as a step in aPipeline
and the final estimator may also usepartial_fit
several times withdask-distributed
for parallelization. - Multi-Model / Multi-Sample Fitting: In
elm
, aPipeline
can be fit with: - fit_ensemble: This method repeats model fitting over a series of samples and/or a ensemble of
Pipeline
instances. ThePipeline
instances in the ensemble may or may not have the same initialization parameters. fit_ensemble can run in generations, optionally applying user-given model selection logic between generations. This fit_ensemble method is aimed at improved model fitting in cases where a representative sample is large and/or there is a need to account for parameter uncertainty. - fit_ea: This method uses Distributed Evolutionary Algorithms in Python (
deap
) to run a genetic algorithm, typically NSGA-2, that selects the bestPipeline
instance(s). The interface for fit_ea and fit_ensemble are similar, but fit_ea takes anevo_params
argument to configure the genetic algorithm.
- fit_ensemble: This method repeats model fitting over a series of samples and/or a ensemble of
- Multi-Model / Multi-Sample Fitting: In
- Multi-Model / Multi-Sample Prediction:
elm
’sPipeline
has a method predict_many that can use dask-distributed to predict from one or morePipeline
instances and/or one or more samples (ElmStore will predict for all models in the final ensemble output by fit_ensemble.
The following discusses each step of making a Pipeline
that uses most of the features described above.
Data Sources for a Pipeline
¶
Pipeline
can be used for fitting / transforming / predicting from a single sample or series of samples. For the fit_ensemble, fit_ea or predict_many methods of aPipeline
instance:- To fit to a single sample, use the
X
keyword argument, and optionallyy
andsample_weight
keyword arguments. - To fit to a series of samples, use the
args_list
andsampler
keyword arguments.
- To fit to a single sample, use the
If X
is given it is assumed to be an ElmStore or xarray.Dataset
If sampler
is given with args_list
, then each element of args_list
is unpacked as arguments to the callable sampler
. There is a special case of giving sampler
as earthio.band_selection.select_from_file
which allows using the functions from earthio
for reading common formats and selecting bands from files (the band_specs
argument). Here is an example that uses select_from_file
to load multi-band HDF4
arrays:
from earthio import LayerSpec
from earthio.metadata_selection import meta_is_day
band_specs = list(map(lambda x: LayerSpec(**x),
[{'search_key': 'long_name', 'search_value': "Band 1 ", 'name': 'band_1'},
{'search_key': 'long_name', 'search_value': "Band 2 ", 'name': 'band_2'},
{'search_key': 'long_name', 'search_value': "Band 3 ", 'name': 'band_3'},
{'search_key': 'long_name', 'search_value': "Band 4 ", 'name': 'band_4'},
{'search_key': 'long_name', 'search_value': "Band 5 ", 'name': 'band_5'},
{'search_key': 'long_name', 'search_value': "Band 6 ", 'name': 'band_6'},
{'search_key': 'long_name', 'search_value': "Band 7 ", 'name': 'band_7'},
{'search_key': 'long_name', 'search_value': "Band 9 ", 'name': 'band_9'},
{'search_key': 'long_name', 'search_value': "Band 10 ", 'name': 'band_10'},
{'search_key': 'long_name', 'search_value': "Band 11 ", 'name': 'band_11'}]))
HDF4_FILES = [f for f in glob.glob(os.path.join(ELM_EXAMPLE_DATA_PATH, 'hdf4', '*hdf'))
if meta_is_day(load_hdf4_meta(f))]
data_source = {
'sampler': select_from_file,
'band_specs': band_specs,
'args_list': HDF4_FILES,
}
Alternatively, to train on a single HDF4 file, we could have done:
from earthio import load_array
from earthio.metadata_selection import example_meta_is_day
HDF4_FILES = [f for f in glob.glob(os.path.join(ELM_EXAMPLE_DATA_PATH, 'hdf4', '*hdf'))
if example_meta_is_day(load_hdf4_meta(f))]
data_source = {'X': load_array(HDF4_FILES[0], band_specs=band_specs)}
Transformations¶
A Pipeline
is created by giving a list of steps - the steps before the final step are known as transformers and the final step is the estimator. See also the full docs on elm.pipeline.steps.
Here is an example Pipeline
of transformations before K-Means
from elm.pipeline import steps, Pipeline
pipeline_steps = [steps.Flatten(),
('scaler', steps.StandardScaler()),
('pca', steps.Transform(IncrementalPCA(n_components=4), partial_fit_batches=2)),
('kmeans', MiniBatchKMeans(n_clusters=4, compute_labels=True)),]
The example above calls:
steps.Flatten
first (See transformers-flatten) first, as utility for flattening our multi-band raster HDF4 sample(s) into an ElmStore with a single xarray.DataArray, calledflat
, with each band as a column inflat
.- StandardScaler with default arguments from
sklearn.prepreprocessing
(all other transformers from sklearn.preprocessing and sklearn.feature_selection are also attributes ofelm.pipeline.steps
and could be used here) - PCA with
elm.pipeline.steps.Transform
to wrap scikit-learn transformers to allow multiple calls topartial_fit
within a single fitting task of the final estimator -steps.Transform
is initialized with:- A scikit-learn transformer as an argument
partial_fit_batches
as a keyword, defaulting to 1. Note: usingpartial_fit_batches != 1
requires a transformer with apartial_fit
method
- Finally MiniBatchKMeans
Multi-Model / Multi-Sample Fitting¶
There are two multi-model approaches to fitting that can be used with a Pipeline
: fit_ensemble or fit_ea. The examples above with a data source to a Pipeline
and the transformation steps within one Pipeline
instance work similarly in fit_ensemble and fit_ea.
- Other similarities between fit_ea and fit_ensemble include the following common keyword arguments:
scoring
a callable with a signature likeelm.model_selection.kmeans.kmeans_aic
(See API docs ) or a string likef_classif
attribute name fromsklearn.metrics
scoring_kwargs
kwargs passed to thescoring
callable if neededsaved_ensemble_size
an integer indicating how manyPipeline
estimators to retain in the final ensemble
- Read more on controlling ensemble or evolutionary algorithm approaches to fitting:
Multi-Model / Multi-Sample Prediction¶
After fit_ensemble or fit_ea has been called on a Pipeline
instance, the instance will have the attribute ensemble
a list of (tag, pipeline) tuples which are the final Pipeline
instances selected by either of the fitting functions (see also saved_ensemble_size
- See Controlling Ensemble Initialization). With a fitted Pipeline
instance, predict_many can be called on the instance to predict from every ensemble member (Pipeline
instance) on a single X
sample or from every ensemble member and every sample if sampler
and args_list
are given in place of X
.
Read more on controlling predict_many.