Multi-Model Fitting II: Evolutionary Algorithms¶
elm
can use an evolutionary algorithm for hyperparameterization. This involves using the fit_ea
method of Pipeline. It is helpful at this point to first read about Pipeline and how to configure a data source for the multi-model approaches in elm
. That page summarizes how fit_ea and fit_ensemble may be fit to a single X
matrix (when the keyword X
is given) or a series of samples (when sampler
and args_list
are given).
The example below walks through configuring an evolutionary algorithm to select the best K-Means model with preprocessing steps inclusive of standard scaling and PCA. First it sets up a sampler from HDF4 files (note the set up of a data source is the same as in fit_ensemble)
Example¶
import os
from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_selection import SelectPercentile, f_classif
import numpy as np
from elm.config.dask_settings import client_context
from elm.model_selection.evolve import ea_setup
from elm.model_selection.kmeans import kmeans_model_averaging, kmeans_aic
from elm.pipeline import Pipeline, steps
from earthio import *
from earthio.filters.band_selection import select_from_file
from earthio.metadata_selection import example_meta_is_day
ELM_EXAMPLE_DATA_PATH = os.environ['ELM_EXAMPLE_DATA_PATH']
band_specs = list(map(lambda x: LayerSpec(**x),
[{'search_key': 'long_name', 'search_value': "Band 1 ", 'name': 'band_1'},
{'search_key': 'long_name', 'search_value': "Band 2 ", 'name': 'band_2'},
{'search_key': 'long_name', 'search_value': "Band 3 ", 'name': 'band_3'},
{'search_key': 'long_name', 'search_value': "Band 4 ", 'name': 'band_4'},
{'search_key': 'long_name', 'search_value': "Band 5 ", 'name': 'band_5'},
{'search_key': 'long_name', 'search_value': "Band 6 ", 'name': 'band_6'},
{'search_key': 'long_name', 'search_value': "Band 7 ", 'name': 'band_7'}]))
# Just get daytime files
HDF4_FILES = [f for f in glob.glob(os.path.join(ELM_EXAMPLE_DATA_PATH, 'hdf4', '*hdf'))
if example_meta_is_day(load_hdf4_meta(f))]
data_source = {
'sampler': select_from_file,
'band_specs': band_specs,
'args_list': HDF4_FILES,
}
Next the example sets up a Pipeline of transformations
def make_example_y_data(X, y=None, sample_weight=None, **kwargs):
fitted = MiniBatchKMeans(n_clusters=5).fit(X.flat.values)
y = fitted.predict(X.flat.values)
return (X, y, sample_weight)
pipeline_steps = [steps.Flatten(),
steps.ModifySample(make_example_y_data),
('top_n', steps.SelectPercentile(percentile=80,score_func=f_classif)),
('kmeans', MiniBatchKMeans(n_clusters=4))]
pipeline = Pipeline(pipeline_steps, scoring=kmeans_aic, scoring_kwargs=dict(score_weights=[-1]))
The example above uses elm.pipeline.steps.ModifySample
to return a y
data set corresponding to X
ElmStore
so that the example can show SelectPercentile
for feature selection.
Next evo_params
need to be called by passing a param_grid
dict to elm.model_selection.evolve.ea_setup
. The param_grid
uses scikit-learn syntax for parameter replacement (i.e. a named step like “kmeans” then a double underscore then a parameter name for that step [“n_clusters”]), so this param_grid
could potentially run models with n_clusters
in range(3, 10)
and percentile
in range(20, 100, 5)
. The control
dict sets parameters for the evolutionary algorithm (described below).
param_grid = {
'kmeans__n_clusters': list(range(3, 10)),
'top_n__percentile': list(range(20, 100, 5)),
'control': {
'select_method': 'selNSGA2',
'crossover_method': 'cxTwoPoint',
'mutate_method': 'mutUniformInt',
'init_pop': 'random',
'indpb': 0.5,
'mutpb': 0.9,
'cxpb': 0.3,
'eta': 20,
'ngen': 2,
'mu': 4,
'k': 4,
'early_stop': {'abs_change': [10], 'agg': 'all'},
# alternatively early_stop: {percent_change: [10], agg: all}
# alternatively early_stop: {threshold: [10], agg: any}
}
}
evo_params = ea_setup(param_grid=param_grid,
param_grid_name='param_grid_example',
score_weights=[-1]) # minimization
Running with dask
to parallelize over the individual solutions (Pipeline instances) and their calls to partial_fit
.
Note : If you want dask-distributed
as a client, first make sure you are running a dask-scheduler
and dask-worker
. Read more here on dask-distributed and follow instructions in environment variables .
with client_context() as client:
fitted = pipeline.fit_ea(evo_params=evo_params,
client=client,
**data_source)
preds = pipeline.predict_many(client=client, **data_source)
Reference param_grid
- control
¶
In the example above the param_grid
has a control
dictionary specifying parameters of the evolutionary algorithm. The control
dict names the functions to be used for crossover, mutation, and selection, and the other arguments are passed to the those methods as needed. The following section describes each key/value of a control
dictionary.
Note While it is possible to change the select_method
, crossover_method
and mutate_method
below from the example shown, it is important to use methods that are consistent with how fit_ea
expresses parameter choices. For each parameter in the param_grid
, such as kmeans__n_clusters=list(range(3, 10))
, fit_ea
optimizes with indices into kmeans__n_clusters
list, i.e. choosing among list(range(7))
, not optimizing an integer parameter between 3 and 10. This allows fit_ea
to avoid custom treatment of string, float, or integer data types in the parameters’ lists of choices. If changing the mutate_method
keep in mind that it needs to take individuals that are sequences of integers as arguments and return the same.
- select_method: Selection method on each generation of evolutionary algorithm. The selection method is typically
selNSGA2
but can be anydeap.tools
selection method (see the `list of selection methods here`_)- crossover_method: Crossover method between two individuals, e.g.
cxTwoPoint
, or any crossover method from deap.tools- mutate_method: Mutation method, typically
mutUniformInt
, or another mutation method fromdeap.tools
mutation methods- init_pop: Placeholder for initialization features- must always be
random
(random initialization of solutions)- indpb: Proability each attribute (feature) is mutated when an individual is mutated, e.g.
0.5
(passed to mutation methods indeap.tools
)- mutpb: When two individuals crossover, this is the probability they will mutate immediately after crossover, e.g.
0.9
- cxpb: Probabity of crossover
0.3
- eta: Tuning parameter in NSGA-2 - passed to mutate and mate methods. With a higher
eta
crowding is penalized and offspring are more different from their parents- ngen: Number of generations in genetic algorithm
- mu: Size of the population of solutions (individuals) initially
- k: Select the top
k
on each generation- early_stop: Control stopping of algorithm before
ngen
number of generations is completed. Examples are below (noteagg
refers to aggregation asall
orany
in the case it is a multi-objective problem)
- Stop on absolute change in objective:
{'abs_change': [10], 'agg': 'all'}
- Stop on percent change in objective:
early_stop: {percent_change: [10], agg: all}
- Stop on reaching objective threshold:
early_stop: {threshold: [10], agg: any}
More Reading¶
fit_ea
relies on deap
for Pareto sorting and the genetic algorithm components described above. Read more about deap
: