elm (Ensemble Learning Models) is a versatile set of tools for ensemble and evolutionary algorithm approaches to training and selecting machine learning models and large scale prediction from trained models.
elm has a focus on data structures that are common in satellite and weather data analysis, such as rasters representing bands of satellite data or cubes of weather model output.
Common computational challenges in satellite and weather data machine learning include:
To address these challenges
elm draws from existing Python packages:
elmuses dask-distributed for parallelism over ensemble fitting and prediction
- scikit-learn :
elmcan use unsupervised and supervised models, preprocessors, scoring functions, and postprocessors from
scikit-learnor any estimator that follows the scikit-learn initialize / fit / predict estimator interface.
- xarray :
elmwraps xarray data structures for n-dimensional arrays, such as 3-dimensional weather cubes, and for collections of 2-D rasters, such as a LANDSAT sample
Large-Scale Model Training¶
elm offers the following strategies for large scale training:
- Use of
partial_fitfor incremental training on series of saemples
- Ensemble modeling, training batches of models in generations in parallel, with model selection after each generation
- Use of a Pipeline with a sequence of transformation steps
partial_fitfor incremental training of transformers used in
Pipelinesteps, such as PCA
- Custom user-given model selection logic in ensemble approaches to training
elm can use
dask to parallelize the activities above.
Ensemble modeling can be used to account for uncertainty that arises from uncertain model parameters or uncertainty in the fitting process. The ensemble approach in
elm allows training and prediction from an ensemble where model parameters are varied, including parameters related to preprocessing transformations, such as feature selection or PCA transforms. See the predict_many example.
Hyperparameterization / Model Selection¶
elm offers two different algorithms for multi-model training with model selection:
In either of these algorithms
elm can use most of the model scoring features of
scikit-learn or a user-given model scoring callable.
One challenge in satellite and weather data processing is the variety of input data formats, including GeoTiff, NetCDF, HDF4, HDF5, and others.
elm offers a function
load_array which can load spatial array data in the following formats:
- GeoTiff: Loads files from a directory of GeoTiffs, assuming each is a single-band raster
- NetCDF: Loads variables from a NetCDF file
- HDF4 / HDF5: Loads subdatasets from HDF4 and HDF5 files
load_array creates an
ElmStore (read more here), a fundamental data structure in
elm that is essentially an
xarray.Dataset with metadata standardization over the various file types.
Preprocessing Input Data¶
elm has a wide range of support for preprocessing activities. One important feature of
elm is its ability to train and/or predict from more than one sample and for each sample run a series of preprocessing steps that may include:
- Scaling, adding polynomial features, or other preprocessors from
- Feature selection using any class from
- Flattening collections of rasters to a single 2-D matrix for fitting / prediction
- Running user-given sample transformers
- Resampling one raster onto another raster’s coordinates
- In-polygon selection
- Feature extraction through transform models like PCA or ICA
See elm.pipeline.steps for more information on preprocessing.
Predicting for Many Large Samples and/or Models¶
elm can use dask-distributed, a dask thread pool, or serial processing for predicting over a group (ensemble) of models and a single sample or series of samples.
elm’s interface for large scale prediction, described here, is via the predict_many method of a