Pipeline¶

Overview of `Pipeline` in `elm`¶

elm.pipeline.Pipeline allows a sequence of transformations on samples before fitting, transforming, and/or predicting from an scikit-learn estimator. elm.pipeline.Pipeline is similar to the concept of the Pipeline in scikit-learn (sklearn.pipeline.Pipeline) but differs in several ways described below.

Data sources for a Pipeline: In elm, the fitting expects X to be an ElmStore or xarray.Dataset rather than a numpy array as in scikit-learn. This allows the Pipeline of transformations to include operations on cubes and other data structures common in satellite data machine learning.
Transformations: In scikit-learn each step in a Pipeline passes a numpy array to the next step by way of a fit_transform method. In elm, a Pipeline always passes a tuple of (X, y, sample_weight) where X is an ElmStore or xarray.Dataset and y and sample_weight are numpy arrays or None.
Partial Fit for Large Samples: In elm a transformer with a partial_fit method, such as sklearn.decomposition.IncrementalPCA may be partially fit several times as a step in a Pipeline and the final estimator may also use partial_fit several times with dask-distributed for parallelization.
Multi-Model / Multi-Sample Fitting: In elm, a Pipeline can be fit with:
- fit_ensemble: This method repeats model fitting over a series of samples and/or a ensemble of Pipeline instances. The Pipeline instances in the ensemble may or may not have the same initialization parameters. fit_ensemble can run in generations, optionally applying user-given model selection logic between generations. This fit_ensemble method is aimed at improved model fitting in cases where a representative sample is large and/or there is a need to account for parameter uncertainty.
- fit_ea: This method uses Distributed Evolutionary Algorithms in Python (deap) to run a genetic algorithm, typically NSGA-2, that selects the best Pipeline instance(s). The interface for fit_ea and fit_ensemble are similar, but fit_ea takes an evo_params argument to configure the genetic algorithm.
Multi-Model / Multi-Sample Prediction: elm’s Pipeline has a method predict_many that can use dask-distributed to predict from one or more Pipeline instances and/or one or more samples (ElmStore will predict for all models in the final ensemble output by fit_ensemble.

The following discusses each step of making a Pipeline that uses most of the features described above.

Data Sources for a `Pipeline`¶

Pipeline can be used for fitting / transforming / predicting from a single sample or series of samples. For the fit_ensemble, fit_ea or predict_many methods of a Pipeline instance:

To fit to a single sample, use the X keyword argument, and optionally y and sample_weight keyword arguments.
To fit to a series of samples, use the args_list and sampler keyword arguments.

If X is given it is assumed to be an ElmStore or xarray.Dataset

If sampler is given with args_list, then each element of args_list is unpacked as arguments to the callable sampler. There is a special case of giving sampler as earthio.band_selection.select_from_file which allows using the functions from earthio for reading common formats and selecting bands from files (the band_specs argument). Here is an example that uses select_from_file to load multi-band HDF4 arrays:

from earthio import LayerSpec
from earthio.metadata_selection import meta_is_day
band_specs = list(map(lambda x: LayerSpec(**x),
        [{'search_key': 'long_name', 'search_value': "Band 1 ", 'name': 'band_1'},
         {'search_key': 'long_name', 'search_value': "Band 2 ", 'name': 'band_2'},
         {'search_key': 'long_name', 'search_value': "Band 3 ", 'name': 'band_3'},
         {'search_key': 'long_name', 'search_value': "Band 4 ", 'name': 'band_4'},
         {'search_key': 'long_name', 'search_value': "Band 5 ", 'name': 'band_5'},
         {'search_key': 'long_name', 'search_value': "Band 6 ", 'name': 'band_6'},
         {'search_key': 'long_name', 'search_value': "Band 7 ", 'name': 'band_7'},
         {'search_key': 'long_name', 'search_value': "Band 9 ", 'name': 'band_9'},
         {'search_key': 'long_name', 'search_value': "Band 10 ", 'name': 'band_10'},
         {'search_key': 'long_name', 'search_value': "Band 11 ", 'name': 'band_11'}]))
HDF4_FILES = [f for f in glob.glob(os.path.join(ELM_EXAMPLE_DATA_PATH, 'hdf4', '*hdf'))
              if meta_is_day(load_hdf4_meta(f))]
data_source = {
    'sampler': select_from_file,
    'band_specs': band_specs,
    'args_list': HDF4_FILES,
}

Alternatively, to train on a single HDF4 file, we could have done:

from earthio import load_array
from earthio.metadata_selection import example_meta_is_day
HDF4_FILES = [f for f in glob.glob(os.path.join(ELM_EXAMPLE_DATA_PATH, 'hdf4', '*hdf'))
              if example_meta_is_day(load_hdf4_meta(f))]
data_source = {'X': load_array(HDF4_FILES[0], band_specs=band_specs)}

Transformations¶

A Pipeline is created by giving a list of steps - the steps before the final step are known as transformers and the final step is the estimator. See also the full docs on elm.pipeline.steps.

Transformer steps must be taken from one of the classes in elm.pipeline.steps. The purpose of elm.pipeline.steps is to wrap preprocessors and transformers from scikit-learn for use with :doc:`ElmStore<elm-store>`s or ``xarray.Dataset``s.

Here is an example Pipeline of transformations before K-Means

from elm.pipeline import steps, Pipeline
pipeline_steps = [steps.Flatten(),
                  ('scaler', steps.StandardScaler()),
                  ('pca', steps.Transform(IncrementalPCA(n_components=4), partial_fit_batches=2)),
                  ('kmeans', MiniBatchKMeans(n_clusters=4, compute_labels=True)),]

The example above calls:

steps.Flatten first (See transformers-flatten) first, as utility for flattening our multi-band raster HDF4 sample(s) into an ElmStore with a single xarray.DataArray, called flat, with each band as a column in flat.
StandardScaler with default arguments from sklearn.prepreprocessing (all other transformers from sklearn.preprocessing and sklearn.feature_selection are also attributes of elm.pipeline.steps and could be used here)
PCA with elm.pipeline.steps.Transform to wrap scikit-learn transformers to allow multiple calls to partial_fit within a single fitting task of the final estimator - steps.Transform is initialized with:
- A scikit-learn transformer as an argument
- partial_fit_batches as a keyword, defaulting to 1. Note: using partial_fit_batches != 1 requires a transformer with a partial_fit method
Finally MiniBatchKMeans

Multi-Model / Multi-Sample Fitting¶

There are two multi-model approaches to fitting that can be used with a Pipeline: fit_ensemble or fit_ea. The examples above with a data source to a Pipeline and the transformation steps within one Pipeline instance work similarly in fit_ensemble and fit_ea.

Other similarities between fit_ea and fit_ensemble include the following common keyword arguments:

scoring a callable with a signature like elm.model_selection.kmeans.kmeans_aic (See API docs ) or a string like f_classif attribute name from sklearn.metrics
scoring_kwargs kwargs passed to the scoring callable if needed
saved_ensemble_size an integer indicating how many Pipeline estimators to retain in the final ensemble

Read more on controlling ensemble or evolutionary algorithm approaches to fitting:

Multi-Model / Multi-Sample Prediction¶

After fit_ensemble or fit_ea has been called on a Pipeline instance, the instance will have the attribute ensemble a list of (tag, pipeline) tuples which are the final Pipeline instances selected by either of the fitting functions (see also saved_ensemble_size - See Controlling Ensemble Initialization). With a fitted Pipeline instance, predict_many can be called on the instance to predict from every ensemble member (Pipeline instance) on a single X sample or from every ensemble member and every sample if sampler and args_list are given in place of X.