This page walks through a
Jupyter notebook using
elm to ensemble fit K-Means and predict from all members of the ensemble.
It demonstrates the common steps of using
- Working with
HDF5, and GeoTiff files, and controlling how a sample is composed of bands or separate rasters with
LayerSpec. See also Creating an ElmStore from File
- Defining a
Pipelineof transformers (e.g. normalization and PCA) and an estimator, where the transformers use classes from
elm.pipeline.stepsand the estimator is a model with a
predictinterface. See also Pipeline
- Calling fit_ensemble to train the Pipeline under varying parameters with one or more input samples
- Calling predict_many to predict from all trained ensemble members to one or more input samples
The LANDSAT classification is notebook from elm examples. This section walks through that notebook, pointing out:
- How to use
earthiofor scientific data files like GeoTiffs
- How to set up an
- How to use
Pipelinein ensemble and predict from many models
First the notebook sets some environment variables related to usage of a
Client and the path to the GeoTiff example files from `elm-data`_:
Each GeoTiff file has 1 raster (band of LANDSAT data):
See more inforation on
ElmStore in ElmStore.
Using a list of
LayerSpec objects, as shown below, is how one can control which bands, or individual GeoTiff files, become the sample dimensions for learning:
buf_xsize: The size of the output raster horizontal dimension
buf_ysize: The size of the output raster vertical dimension
name: What do call the band in the
ElmStorereturned. For example
band_1as a name will mean you can use
band_1as a key in
search_key: Where to look for the band identifiying info, in this case the file
search_value: What string token identifies a band, e.g.
B1.TIF(see file names printed above)
We are using
buf_ysize below to downsample.
repr of the
LayerSpec objects to see all possible arguments controlling reading of bands:
load_array aims to find a file reader for a
HDF5, or GeoTiff source.
The first argument to
load_array is a directory if reading GeoTiff files and it is assumed that the directory contains GeoTiff files each with a 1-band raster.
HDF5 the first argument is a single filename, and the bands are taken from the
band_specs (list of
LayerSpec objects) is passed in to
load_array (the list of
LayerSpec objects from above) to control which bands are read from the directory of GeoTiffs.
The notebook then goes through a number of examples similar to:
X.band_1.plot.pcolormesh()- The code uses names like
band_2. These are named
DataArrayobjects in the
Xbecause of the
nameargument to the
LayerSpecobjects above. The
plot.pcolormesh()comes from the data viz tools with xarray.DataArray .
- The output of
elm.pipeline.Pipeline of transformations is similar to the idea of a Pipeline in scikit-learn.
- All steps but the last step in a Pipeline must be instances of classes from the elm.pipeline.steps - these are the transformers.
- The final step in a Pipeline should be an estimator from scikit-learn with a fit/predict interface.
- The notebook shows how to specify a several-step Pipeline of
- Flattening rasters
- Assigning NaN where needed
- Dropping NaN rows
- Standardizing (Z-scoring) by band means and standard deviations
- Adding polynomial interaction terms of degree two
- Transforming with PCA
- K-Means with partial_fit several times per model
Preamble - Imports
This cell show typical import statments for working with a
elm.pipeline.steps that become part of a
Pipeline, including importing a transformer and estimator from scikit-learn:
This transform-flatten step is essentially
.ravel on each
X to create a single 2-D
The next step uses
elm.pipeline.steps.ModifySample to run a custom callable in a
Pipeline of transformations. This function sets
NaN for the no-data perimeters of the rasters:
DropNaRows - Drop Null / NaN Rows¶
The transform-dropnarows is a transformer to remove the
NaN values from the
flat (the flattened (
ravel) rasters as a single 2-D
ModifySample - Log Transform (or pass through)¶
This usage of
ModifySample will allow the
Pipeline to use log transformation or not (see usage of
set_params several screenshots later)
Feature engineering in a
Define a function that can do normalized differences between bands (raster or
DataArray ), adding the normalized differences to what will be the
X data in the
Pipeline of transformations.
Feature engineering -
ModifySample with arguments¶
And here is how the function above can be used in a
Pipeline (wrapping with
We are calculating:
NDWI: Normalized Difference Water Index *
(band_4 - band_5) / (band_4 + band_5)
NDVI: Normalized Difference Vegetation Index *
(band_5 - band_4) / (band_5 + band_4)
NDSI: Normalized Difference SnowIndex *
(band_2 - band_6) / (band_2 + band_6)
NBR: Normalized Burn Ratio *
(band_4 - band_7) / (band_7 + band_4)
Using pcolormesh on normalized differences of bands
Here are the
NDVI plotted with the `xarray-pcolormesh`_ method of the
False Color with normalized differences of bands
The image below has an RGB (red, green, blue) matrix made up of the
NDWI normalized differences:
Normalization and Adding Polynomial Terms¶
The following snippets show how to use a class from
Custom Feature Selection
By defining the function below, we will be able to choose among random combinations of the original data or normalized differences
steps.Transform to wrap
PCA or another method from
Use an estimator from
Use a model with a
predict interface, such as
Most scikit-learn models described here are supported.
The following uses all the steps we have created in sequence of tuples and configures scoring for K-Means with the Akaike Information Criterion.
The next steps deal with controlling fit_ensemble (fitting with a group of models of different parameters)
See more info on Pipeline here.
This is an example
ensemble_init_func to pass to fit_ensemble, using
pipe.new_with_params(**new_params) to create a new unfitted
Pipeline instance with new parameters.
The fit_ensemble docs also show an example of an
The following sets the number of generations (
ngen ) and the
model_selection callable after each generation.
fit_ensemble , to fit a group of models in generations with model selection after each generation, and predict_many each take a
client keyword as a dask
Client (dask). predict_many parallelizes over multiple models and samples, though here only one sample is used.
ElmStore from predict_many¶
predict_many has called transform-inverseflatten to reshape the 1-D numpy array from the
sklearn.cluster.KMeans.predict method to a 2-D raster with the coordinates of the original data. Note also the
inverse_flatten is typically able to preserve
NaN regions of the original data (the
NaN borders of this image are preserved).
The best prediction in terms of