Pipeline

Overview of Pipeline in elm

elm.pipeline.Pipeline allows a sequence of transformations on samples before fitting, transforming, and/or predicting from an scikit-learn estimator. elm.pipeline.Pipeline is similar to the concept of the Pipeline in scikit-learn (sklearn.pipeline.Pipeline) but differs in several ways described below.

  • Data sources for a Pipeline: In elm, the fitting expects X to be an ElmStore or xarray.Dataset rather than a numpy array as in scikit-learn. This allows the Pipeline of transformations to include operations on cubes and other data structures common in satellite data machine learning.
  • Transformations: In scikit-learn each step in a Pipeline passes a numpy array to the next step by way of a fit_transform method. In elm, a Pipeline always passes a tuple of (X, y, sample_weight) where X is an ElmStore or xarray.Dataset and y and sample_weight are numpy arrays or None.
  • Partial Fit for Large Samples: In elm a transformer with a partial_fit method, such as sklearn.decomposition.IncrementalPCA may be partially fit several times as a step in a Pipeline and the final estimator may also use partial_fit several times with dask-distributed for parallelization.
  • Multi-Model / Multi-Sample Fitting: In elm, a Pipeline can be fit with:
    • fit_ensemble: This method repeats model fitting over a series of samples and/or a ensemble of Pipeline instances. The Pipeline instances in the ensemble may or may not have the same initialization parameters. fit_ensemble can run in generations, optionally applying user-given model selection logic between generations. This fit_ensemble method is aimed at improved model fitting in cases where a representative sample is large and/or there is a need to account for parameter uncertainty.
    • fit_ea: This method uses Distributed Evolutionary Algorithms in Python (deap) to run a genetic algorithm, typically NSGA-2, that selects the best Pipeline instance(s). The interface for fit_ea and fit_ensemble are similar, but fit_ea takes an evo_params argument to configure the genetic algorithm.
  • Multi-Model / Multi-Sample Prediction: elm’s Pipeline has a method predict_many that can use dask-distributed to predict from one or more Pipeline instances and/or one or more samples (ElmStore will predict for all models in the final ensemble output by fit_ensemble.

The following discusses each step of making a Pipeline that uses most of the features described above.

Data Sources for a Pipeline

Pipeline can be used for fitting / transforming / predicting from a single sample or series of samples. For the fit_ensemble, fit_ea or predict_many methods of a Pipeline instance:
  • To fit to a single sample, use the X keyword argument, and optionally y and sample_weight keyword arguments.
  • To fit to a series of samples, use the args_list and sampler keyword arguments.

If X is given it is assumed to be an ElmStore or xarray.Dataset

If sampler is given with args_list, then each element of args_list is unpacked as arguments to the callable sampler. There is a special case of giving sampler as earthio.band_selection.select_from_file which allows using the functions from earthio for reading common formats and selecting bands from files (the band_specs argument). Here is an example that uses select_from_file to load multi-band HDF4 arrays:

from earthio import LayerSpec
from earthio.metadata_selection import meta_is_day
band_specs = list(map(lambda x: LayerSpec(**x),
        [{'search_key': 'long_name', 'search_value': "Band 1 ", 'name': 'band_1'},
         {'search_key': 'long_name', 'search_value': "Band 2 ", 'name': 'band_2'},
         {'search_key': 'long_name', 'search_value': "Band 3 ", 'name': 'band_3'},
         {'search_key': 'long_name', 'search_value': "Band 4 ", 'name': 'band_4'},
         {'search_key': 'long_name', 'search_value': "Band 5 ", 'name': 'band_5'},
         {'search_key': 'long_name', 'search_value': "Band 6 ", 'name': 'band_6'},
         {'search_key': 'long_name', 'search_value': "Band 7 ", 'name': 'band_7'},
         {'search_key': 'long_name', 'search_value': "Band 9 ", 'name': 'band_9'},
         {'search_key': 'long_name', 'search_value': "Band 10 ", 'name': 'band_10'},
         {'search_key': 'long_name', 'search_value': "Band 11 ", 'name': 'band_11'}]))
HDF4_FILES = [f for f in glob.glob(os.path.join(ELM_EXAMPLE_DATA_PATH, 'hdf4', '*hdf'))
              if meta_is_day(load_hdf4_meta(f))]
data_source = {
    'sampler': select_from_file,
    'band_specs': band_specs,
    'args_list': HDF4_FILES,
}

Alternatively, to train on a single HDF4 file, we could have done:

from earthio import load_array
from earthio.metadata_selection import example_meta_is_day
HDF4_FILES = [f for f in glob.glob(os.path.join(ELM_EXAMPLE_DATA_PATH, 'hdf4', '*hdf'))
              if example_meta_is_day(load_hdf4_meta(f))]
data_source = {'X': load_array(HDF4_FILES[0], band_specs=band_specs)}

Transformations

A Pipeline is created by giving a list of steps - the steps before the final step are known as transformers and the final step is the estimator. See also the full docs on elm.pipeline.steps.

  • Transformer steps must be taken from one of the classes in elm.pipeline.steps. The purpose of elm.pipeline.steps is to wrap preprocessors and transformers from scikit-learn for use with :doc:`ElmStore<elm-store>`s or ``xarray.Dataset``s.

Here is an example Pipeline of transformations before K-Means

from elm.pipeline import steps, Pipeline
pipeline_steps = [steps.Flatten(),
                  ('scaler', steps.StandardScaler()),
                  ('pca', steps.Transform(IncrementalPCA(n_components=4), partial_fit_batches=2)),
                  ('kmeans', MiniBatchKMeans(n_clusters=4, compute_labels=True)),]

The example above calls:

  • steps.Flatten first (See transformers-flatten) first, as utility for flattening our multi-band raster HDF4 sample(s) into an ElmStore with a single xarray.DataArray, called flat, with each band as a column in flat.
  • StandardScaler with default arguments from sklearn.prepreprocessing (all other transformers from sklearn.preprocessing and sklearn.feature_selection are also attributes of elm.pipeline.steps and could be used here)
  • PCA with elm.pipeline.steps.Transform to wrap scikit-learn transformers to allow multiple calls to partial_fit within a single fitting task of the final estimator - steps.Transform is initialized with:
    • A scikit-learn transformer as an argument
    • partial_fit_batches as a keyword, defaulting to 1. Note: using partial_fit_batches != 1 requires a transformer with a partial_fit method
  • Finally MiniBatchKMeans

Multi-Model / Multi-Sample Fitting

There are two multi-model approaches to fitting that can be used with a Pipeline: fit_ensemble or fit_ea. The examples above with a data source to a Pipeline and the transformation steps within one Pipeline instance work similarly in fit_ensemble and fit_ea.

Other similarities between fit_ea and fit_ensemble include the following common keyword arguments:
  • scoring a callable with a signature like elm.model_selection.kmeans.kmeans_aic (See API docs ) or a string like f_classif attribute name from sklearn.metrics
  • scoring_kwargs kwargs passed to the scoring callable if needed
  • saved_ensemble_size an integer indicating how many Pipeline estimators to retain in the final ensemble
Read more on controlling ensemble or evolutionary algorithm approaches to fitting:

Multi-Model / Multi-Sample Prediction

After fit_ensemble or fit_ea has been called on a Pipeline instance, the instance will have the attribute ensemble a list of (tag, pipeline) tuples which are the final Pipeline instances selected by either of the fitting functions (see also saved_ensemble_size - See Controlling Ensemble Initialization). With a fitted Pipeline instance, predict_many can be called on the instance to predict from every ensemble member (Pipeline instance) on a single X sample or from every ensemble member and every sample if sampler and args_list are given in place of X.

Read more on controlling predict_many.