Pipeline¶
Overview of Pipeline in elm¶
elm.pipeline.Pipeline allows a sequence of transformations on samples before fitting, transforming, and/or predicting from an scikit-learn estimator. elm.pipeline.Pipeline is similar to the concept of the Pipeline in scikit-learn (sklearn.pipeline.Pipeline) but differs in several ways described below.
- Data sources for a Pipeline: In
elm, the fitting expectsXto be an ElmStore or xarray.Dataset rather than anumpyarray as in scikit-learn. This allows thePipelineof transformations to include operations on cubes and other data structures common in satellite data machine learning. - Transformations: In scikit-learn each step in a
Pipelinepasses a numpy array to the next step by way of afit_transformmethod. Inelm, aPipelinealways passes a tuple of (X, y, sample_weight) where X is an ElmStore orxarray.Datasetandyandsample_weightare numpy arrays orNone. - Partial Fit for Large Samples: In
elma transformer with apartial_fitmethod, such assklearn.decomposition.IncrementalPCAmay be partially fit several times as a step in aPipelineand the final estimator may also usepartial_fitseveral times withdask-distributedfor parallelization. - Multi-Model / Multi-Sample Fitting: In
elm, aPipelinecan be fit with: - fit_ensemble: This method repeats model fitting over a series of samples and/or a ensemble of
Pipelineinstances. ThePipelineinstances in the ensemble may or may not have the same initialization parameters. fit_ensemble can run in generations, optionally applying user-given model selection logic between generations. This fit_ensemble method is aimed at improved model fitting in cases where a representative sample is large and/or there is a need to account for parameter uncertainty. - fit_ea: This method uses Distributed Evolutionary Algorithms in Python (
deap) to run a genetic algorithm, typically NSGA-2, that selects the bestPipelineinstance(s). The interface for fit_ea and fit_ensemble are similar, but fit_ea takes anevo_paramsargument to configure the genetic algorithm.
- fit_ensemble: This method repeats model fitting over a series of samples and/or a ensemble of
- Multi-Model / Multi-Sample Fitting: In
- Multi-Model / Multi-Sample Prediction:
elm’sPipelinehas a method predict_many that can use dask-distributed to predict from one or morePipelineinstances and/or one or more samples (ElmStore will predict for all models in the final ensemble output by fit_ensemble.
The following discusses each step of making a Pipeline that uses most of the features described above.
Data Sources for a Pipeline¶
Pipelinecan be used for fitting / transforming / predicting from a single sample or series of samples. For the fit_ensemble, fit_ea or predict_many methods of aPipelineinstance:- To fit to a single sample, use the
Xkeyword argument, and optionallyyandsample_weightkeyword arguments. - To fit to a series of samples, use the
args_listandsamplerkeyword arguments.
- To fit to a single sample, use the
If X is given it is assumed to be an ElmStore or xarray.Dataset`
If sampler is given with args_list, then each element of args_list is unpacked as arguments to the callable sampler. There is a special case of giving sampler as elm.readers.band_selection.select_from_file which allows using the functions from elm.readers for reading common formats and selecting bands from files (the band_specs argument). Here is an example that uses select_from_file to load multi-band HDF4 arrays:
from elm.readers import BandSpec
from elm.readers.metadata_selection import meta_is_day
band_specs = list(map(lambda x: BandSpec(**x),
[{'search_key': 'long_name', 'search_value': "Band 1 ", 'name': 'band_1'},
{'search_key': 'long_name', 'search_value': "Band 2 ", 'name': 'band_2'},
{'search_key': 'long_name', 'search_value': "Band 3 ", 'name': 'band_3'},
{'search_key': 'long_name', 'search_value': "Band 4 ", 'name': 'band_4'},
{'search_key': 'long_name', 'search_value': "Band 5 ", 'name': 'band_5'},
{'search_key': 'long_name', 'search_value': "Band 6 ", 'name': 'band_6'},
{'search_key': 'long_name', 'search_value': "Band 7 ", 'name': 'band_7'},
{'search_key': 'long_name', 'search_value': "Band 9 ", 'name': 'band_9'},
{'search_key': 'long_name', 'search_value': "Band 10 ", 'name': 'band_10'},
{'search_key': 'long_name', 'search_value': "Band 11 ", 'name': 'band_11'}]))
HDF4_FILES = [f for f in glob.glob(os.path.join(ELM_EXAMPLE_DATA_PATH, 'hdf4', '*hdf'))
if meta_is_day(load_hdf4_meta(f))]
data_source = {
'sampler': select_from_file,
'band_specs': band_specs,
'args_list': HDF4_FILES,
}
Alternatively, to train on a single HDF4 file, we could have done:
from elm.readers import load_array
from elm.sample_util.metadata_selection import example_meta_is_day
HDF4_FILES = [f for f in glob.glob(os.path.join(ELM_EXAMPLE_DATA_PATH, 'hdf4', '*hdf'))
if example_meta_is_day(load_hdf4_meta(f))]
data_source = {'X': load_array(HDF4_FILES[0], band_specs=band_specs)}
Transformations¶
A Pipeline is created by giving a list of steps - the steps before the final step are known as transformers and the final step is the estimator. See also the full docs on elm.pipeline.steps.
Here is an example Pipeline of transformations before K-Means
from elm.pipeline import steps, Pipeline
pipeline_steps = [steps.Flatten(),
('scaler', steps.StandardScaler()),
('pca', steps.Transform(IncrementalPCA(n_components=4), partial_fit_batches=2)),
('kmeans', MiniBatchKMeans(n_clusters=4, compute_labels=True)),]
The example above calls:
steps.Flattenfirst (See transformers-flatten) first, as utility for flattening our multi-band raster HDF4 sample(s) into an ElmStore with a single xarray.DataArray, calledflat, with each band as a column inflat.- StandardScaler with default arguments from
sklearn.prepreprocessing(all other transformers from sklearn.preprocessing and sklearn.feature_selection are also attributes ofelm.pipeline.stepsand could be used here) - PCA with
elm.pipeline.steps.Transformto wrap scikit-learn transformers to allow multiple calls topartial_fitwithin a single fitting task of the final estimator -steps.Transformis initialized with:- A scikit-learn transformer as an argument
partial_fit_batchesas a keyword, defaulting to 1. Note: usingpartial_fit_batches != 1requires a transformer with apartial_fitmethod
- Finally MiniBatchKMeans
Multi-Model / Multi-Sample Fitting¶
There are two multi-model approaches to fitting that can be used with a Pipeline: fit_ensemble or fit_ea. The examples above with a data source to a Pipeline and the transformation steps within one Pipeline instance work similarly in fit_ensemble and fit_ea.
- Other similarities between fit_ea and fit_ensemble include the following common keyword arguments:
scoringa callable with a signature likeelm.model_selection.kmeans.kmeans_aic(See API docs ) or a string likef_classifattribute name fromsklearn.metricsscoring_kwargskwargs passed to thescoringcallable if neededsaved_ensemble_sizean integer indicating how manyPipelineestimators to retain in the final ensemble
- Read more on controlling ensemble or evolutionary algorithm approaches to fitting:
Multi-Model / Multi-Sample Prediction¶
After fit_ensemble or fit_ea has been called on a Pipeline instance, the instance will have the attribute ensemble a list of (tag, pipeline) tuples which are the final Pipeline instances selected by either of the fitting functions (see also saved_ensemble_size - See Controlling Ensemble Initialization). With a fitted Pipeline instance, predict_many can be called on the instance to predict from every ensemble member (Pipeline instance) on a single X sample or from every ensemble member and every sample if sampler and args_list are given in place of X.
Read more on controlling predict_many.