Multi-Model Fitting II: Evolutionary Algorithms

elm can use an evolutionary algorithm for hyperparameterization. This involves using the fit_ea method of Pipeline. It is helpful at this point to first read about Pipeline and how to configure a data source for the multi-model approaches in elm. That page summarizes how fit_ea and fit_ensemble may be fit to a single X matrix (when the keyword X is given) or a series of samples (when sampler and args_list are given).

The example below walks through configuring an evolutionary algorithm to select the best K-Means model with preprocessing steps inclusive of standard scaling and PCA. First it sets up a sampler from HDF4 files (note the set up of a data source is the same as in fit_ensemble)


import os

from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_selection import SelectPercentile, f_classif
import numpy as np

from elm.config.dask_settings import client_context
from elm.model_selection.evolve import ea_setup
from elm.model_selection.kmeans import kmeans_model_averaging, kmeans_aic
from elm.pipeline import Pipeline, steps
from earthio import *
from earthio.filters.band_selection import select_from_file
from earthio.metadata_selection import example_meta_is_day


band_specs = list(map(lambda x: LayerSpec(**x),
        [{'search_key': 'long_name', 'search_value': "Band 1 ", 'name': 'band_1'},
         {'search_key': 'long_name', 'search_value': "Band 2 ", 'name': 'band_2'},
         {'search_key': 'long_name', 'search_value': "Band 3 ", 'name': 'band_3'},
         {'search_key': 'long_name', 'search_value': "Band 4 ", 'name': 'band_4'},
         {'search_key': 'long_name', 'search_value': "Band 5 ", 'name': 'band_5'},
         {'search_key': 'long_name', 'search_value': "Band 6 ", 'name': 'band_6'},
         {'search_key': 'long_name', 'search_value': "Band 7 ", 'name': 'band_7'}]))
# Just get daytime files
HDF4_FILES = [f for f in glob.glob(os.path.join(ELM_EXAMPLE_DATA_PATH, 'hdf4', '*hdf'))
              if example_meta_is_day(load_hdf4_meta(f))]
data_source = {
    'sampler': select_from_file,
    'band_specs': band_specs,
    'args_list': HDF4_FILES,

Next the example sets up a Pipeline of transformations

def make_example_y_data(X, y=None, sample_weight=None, **kwargs):
    fitted = MiniBatchKMeans(n_clusters=5).fit(X.flat.values)
    y = fitted.predict(X.flat.values)
    return (X, y, sample_weight)

pipeline_steps = [steps.Flatten(),
                  ('top_n', steps.SelectPercentile(percentile=80,score_func=f_classif)),
                  ('kmeans', MiniBatchKMeans(n_clusters=4))]
pipeline = Pipeline(pipeline_steps, scoring=kmeans_aic, scoring_kwargs=dict(score_weights=[-1]))

The example above uses elm.pipeline.steps.ModifySample to return a y data set corresponding to X ElmStore so that the example can show SelectPercentile for feature selection.

Next evo_params need to be called by passing a param_grid dict to elm.model_selection.evolve.ea_setup. The param_grid uses scikit-learn syntax for parameter replacement (i.e. a named step like “kmeans” then a double underscore then a parameter name for that step [“n_clusters”]), so this param_grid could potentially run models with n_clusters in range(3, 10) and percentile in range(20, 100, 5). The control dict sets parameters for the evolutionary algorithm (described below).

param_grid =  {
    'kmeans__n_clusters': list(range(3, 10)),
    'top_n__percentile': list(range(20, 100, 5)),
    'control': {
        'select_method': 'selNSGA2',
        'crossover_method': 'cxTwoPoint',
        'mutate_method': 'mutUniformInt',
        'init_pop': 'random',
        'indpb': 0.5,
        'mutpb': 0.9,
        'cxpb':  0.3,
        'eta':   20,
        'ngen':  2,
        'mu':    4,
        'k':     4,
        'early_stop': {'abs_change': [10], 'agg': 'all'},
        # alternatively early_stop: {percent_change: [10], agg: all}
        # alternatively early_stop: {threshold: [10], agg: any}

evo_params = ea_setup(param_grid=param_grid,
                      score_weights=[-1]) # minimization

Running with dask to parallelize over the individual solutions (Pipeline instances) and their calls to partial_fit .

Note : If you want dask-distributed as a client, first make sure you are running a dask-scheduler and dask-worker . Read more here on dask-distributed and follow instructions in environment variables .

with client_context() as client:
    fitted = pipeline.fit_ea(evo_params=evo_params,
    preds = pipeline.predict_many(client=client, **data_source)

Reference param_grid - control

In the example above the param_grid has a control dictionary specifying parameters of the evolutionary algorithm. The control dict names the functions to be used for crossover, mutation, and selection, and the other arguments are passed to the those methods as needed. The following section describes each key/value of a control dictionary.

Note While it is possible to change the select_method, crossover_method and mutate_method below from the example shown, it is important to use methods that are consistent with how fit_ea expresses parameter choices. For each parameter in the param_grid, such as kmeans__n_clusters=list(range(3, 10)), fit_ea optimizes with indices into kmeans__n_clusters list, i.e. choosing among list(range(7)), not optimizing an integer parameter between 3 and 10. This allows fit_ea to avoid custom treatment of string, float, or integer data types in the parameters’ lists of choices. If changing the mutate_method keep in mind that it needs to take individuals that are sequences of integers as arguments and return the same.

  • select_method: Selection method on each generation of evolutionary algorithm. The selection method is typically selNSGA2 but can be any selection method (see the `list of selection methods here`_)
  • crossover_method: Crossover method between two individuals, e.g. cxTwoPoint, or any crossover method from
  • mutate_method: Mutation method, typically mutUniformInt, or another mutation method from mutation methods
  • init_pop: Placeholder for initialization features- must always be random (random initialization of solutions)
  • indpb: Proability each attribute (feature) is mutated when an individual is mutated, e.g. 0.5 (passed to mutation methods in
  • mutpb: When two individuals crossover, this is the probability they will mutate immediately after crossover, e.g. 0.9
  • cxpb: Probabity of crossover 0.3
  • eta: Tuning parameter in NSGA-2 - passed to mutate and mate methods. With a higher eta crowding is penalized and offspring are more different from their parents
  • ngen: Number of generations in genetic algorithm
  • mu: Size of the population of solutions (individuals) initially
  • k: Select the top k on each generation
  • early_stop: Control stopping of algorithm before ngen number of generations is completed. Examples are below (note agg refers to aggregation as all or any in the case it is a multi-objective problem)
    • Stop on absolute change in objective: {'abs_change': [10], 'agg': 'all'}
    • Stop on percent change in objective: early_stop: {percent_change: [10], agg: all}
    • Stop on reaching objective threshold: early_stop: {threshold: [10], agg: any}

More Reading

fit_ea relies on deap for Pareto sorting and the genetic algorithm components described above. Read more about deap: