Quick Start

The following steps generate a visualization using elm on a synthetic dataset. As development on elm continues we strive to condense this document into a smaller example. For now, it offers insight into elm’s customizability and extensive feature set.

Step 1 - Choose Model(s)

First import model(s) from scikit-learn and Pipeline and steps from elm.pipeline:

from elm.config import client_context
from earthio.filters.make_blobs import random_elm_store
from elm.pipeline import Pipeline, steps
from sklearn.decomposition import PCA
from sklearn.cluster import AffinityPropagation

See the LANDSAT K-Means and other examples to see how to read an ElmStore from GeoTiff, HDF4, HDF5, or NetCDF.

Step 2 - Define a sampler

If fitting more than one sample, then define a sampler function to pass to fit_ensemble. Here we are using a partial of random_elm_store (synthetic data). If using a sampler , we also need to define args_list a list of tuples where each tuple can be unpacked as arguments to sampler. The length of args_list determines the number of samples potentially used. Here we have 2 empty tuples as args_list because our sampler needs no arguments and we want 2 samples. Alternatively the arguments X , y , and sample_weight may be given in place of sampler and args_list .

from functools import partial
bands = ['band_{}'.format(idx + 1) for idx in range(10)]
sampler = partial(random_elm_store,
args_list = [(),] * N_SAMPLES

Step 2 - Define a Pipeline

The code block below will use Flatten to convert each 2-D raster ( DataArray ) to give a single 1-D column in 2-D DataArray for machine learning. The output of Flatten will be in turn passed to sklearn.decomposition.PCA and the reduced feature set from PCA will be passed to the sklearn.cluster.AffinityPropagation clustering model.

pipe = Pipeline([('flat', steps.Flatten()),
                 ('pca', steps.Transform(PCA())),
                 ('aff_prop', AffinityPropagation())])

Step 3 - Call fit_ensemble with dask

Now we can use fit_ensemble to fit to one or more samples and one more instances of the pipe Pipeline above. Below we are passing the sampler and args_list, client, which will be a dask-distributed or ThreadPool or None, depending on environment variables. init_ensemble_size sets the number of Pipeline instances and models_share_sample=False means to fit all Pipeline / sample combinations (2 X 2 == 4 total members in this case).


The code block with fit_ensemble above would show the repr of the Pipeline object as follows:

>>> print(pipe)
<elm.pipeline.Pipeline> with steps:
    flat: <elm.steps.Flatten>:

    pca: <elm.steps.Transform>:
        copy: True
        iterated_power: 'auto'
        n_components: None
        partial_fit_batches: None
        random_state: None
        svd_solver: 'auto'
        tol: 0.0
        whiten: False
    aff_prop: AffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True,
              damping=0.5, max_iter=200, preference=None, verbose=False)

We can confirm that we have 4 Pipeline instances in the trained ensemble:

>>> len(pipe.ensemble)

Step 4 - Call predict_many

predict_many will by default predict from the ensemble that was just trained (4 models in this case). predict_many takes sampler and args_list like fit_ensemble. The args_list may differ from that given to fit_ensemble or be the same. We have 4 trained models in the .ensemble attribute of pipe and 2 samples specified by args_list , so predict_many returns a list of 8 prediction :doc:`ElmStore<elm-store>`s

preds = pipe.predict_many(sampler=sampler, args_list=args_list)
example = preds[0]
import matplotlib.pyplot as plt

Read More : LANDSAT K-Means example