elm
Intro¶
This tutorial is a Hello World example with elm
Step 1 - Choose Model(s)¶
First import model(s) from scikit-learn and Pipeline
and steps
from elm.pipeline
:
from elm.config import client_context
from elm.pipeline.tests.util import random_elm_store
from elm.pipeline import Pipeline, steps
from sklearn.decomposition import PCA
from sklearn.cluster import AffinityPropagation
random_elm_store
is a function that returns random rasters (xarray.DataArray
s) in an ElmStore, a data structure similar to an xarray.Datasetsteps
is a module of all the transformation steps possible in a Pipeline
See the LANDSAT K-Means and other examples to see how to read an ElmStore from GeoTiff, HDF4
, HDF5
, or NetCDF
.
Step 2 - Define a sampler
¶
If fitting more than one sample, then define a sampler
function to pass to fit_ensemble. Here we are using a partial of random_elm_store
(synthetic data). If using a sampler
, we also need to define args_list
a list of tuples where each tuple can be unpacked as arguments to sampler
. The length of args_list
determines the number of samples potentially used. Here we have 2 empty tuples as args_list
because our sampler
needs no arguments and we want 2 samples. Alternatively the arguments X
, y
, and sample_weight
may be given in place of sampler
and args_list
.
from functools import partial
N_SAMPLES = 2
bands = ['band_{}'.format(idx + 1) for idx in range(10)]
sampler = partial(random_elm_store,
bands=bands,
width=60,
height=60)
args_list = [(),] * N_SAMPLES
Step 2 - Define a Pipeline¶
The code block below will use Flatten
to convert each 2-D raster ( DataArray
) to give a single 1-D column in 2-D DataArray
for machine learning. The output of Flatten
will be in turn passed to sklearn.decomposition.PCA and the reduced feature set from PCA
will be passed to the sklearn.cluster.AffinityPropagation clustering model.
pipe = Pipeline([('flat', steps.Flatten()),
('pca', steps.Transform(PCA())),
('aff_prop', AffinityPropagation())])
Step 3 - Call fit_ensemble with dask
¶
Now we can use fit_ensemble to fit to one or more samples and one more instances of the pipe
Pipeline above. Below we are passing the sampler
and args_list
, client
, which will be a dask-distributed
or ThreadPool
or None, depending on environment variables. init_ensemble_size
sets the number of Pipeline instances and models_share_sample=False
means to fit all Pipeline
/ sample combinations (2 X 2 == 4
total members in this case).
with client_context() as client:
pipe.fit_ensemble(sampler=sampler,
args_list=args_list,
client=client,
init_ensemble_size=2,
models_share_sample=False,
ngen=1)
The code block with fit_ensemble above would show the repr
of the Pipeline
object as follows:
<elm.pipeline.Pipeline> with steps:
flat: <elm.steps.Flatten>:
pca: <elm.steps.Transform>:
copy: True
iterated_power: 'auto'
n_components: None
partial_fit_batches: None
random_state: None
svd_solver: 'auto'
tol: 0.0
whiten: False
aff_prop: AffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True,
damping=0.5, max_iter=200, preference=None, verbose=False)
We can confirm that we have 4
Pipeline instances in the trained ensemble:
>>> len(pipe.ensemble)
4
Step 4 - Call predict_many¶
predict_many will by default predict from the ensemble that was just trained (4 models in this case). predict_many takes sampler
and args_list
like fit_ensemble. The args_list
may differ from that given to fit_ensemble
or be the same. We have 4 trained models in the .ensemble
attribute of pipe
and 2 samples specified by args_list
, so predict_many returns a list of 8 prediction :doc:`ElmStore<elm-store>`s
import matplotlib.pyplot as plt
with client_context() as client:
preds = pipe.predict_many(sampler=sampler, args_list=args_list, client=client)
example = preds[0]
example.predict.plot.pcolormesh()
plt.show()
Read More : LANDSAT K-Means example