elm
yaml
Specs - Deprecated Temporarily¶
elm-main is deprecated temporarily while ``elm`` and ``earthio`` undergo significant churn and changes in usage patterns. Around August 1, 2017 ``elm-main`` will be revisited as it provides a ``yaml`` based interface to ``elm`` and may assist in ``elm`` UI contexts or in interoperability.
Workflows involving ensemble and evolutionary methods and predict_many can also be specified in a yaml
config file for running with the elm-main console entry point. The yaml
config can refer to functions from elm
or user-given packages or modules. Read more the yaml configuration file format here
The repository elm examples has a number of example yaml
configuration files for GeoTiff and HDF4
files as input to K-Means or stochastic gradient descent classifiers.
This page walks through each part of a valid yaml
config.
ensembles
¶
The ensembles
section creates named dicts of keyword arguments to fit_ensemble. The example below creates example_ensemble
, an identifier we can use elsewhere in the config. If passing the keyword ensemble_init_func
in an ensemble here, then it should be given in “package.subpackage.module:callable” notation like a setup.py console entry point, e.g. "my_kmeans_module:make_ensemble"
.
ensembles: {
example_ensemble: {
init_ensemble_size: 1,
saved_ensemble_size: 1,
ngen: 3,
partial_fit_batches: 2,
},
}
data_sources
¶
The dicts in data_sources
create a named sampler
with their keyword arguments.
In the config, args_list
can be a callable. In this case, it is iter_files_recursively
a function which takes top_dir
and file_pattern
as arguments. The filenames returned by iter_files_recursively
will be filtered by example_meta_is_day
an example function for detecting whether a satellite data file is night or day based on its metadata. If args_list
is callable, it should take a variable number of keyword arguments (**kwargs
).
This examples creates ds_example
which selects from files to get bands 1 through 6, iterating recursively over .hdf
files in ELM_EXAMPLE_DATA_PATH
from the environment (env:SOMETHING
means take SOMETHING
from environment variables).
band_specs
in the data source are passed to earthio.LayerSpec
(See also ElmStore and LANDSAT Example ) and determine which bands (subdatasets in this HDF4 case) to include in a sample.
data_sources: {
ds_example: {
sampler: "earthio.filters.band_selection:select_from_file",
band_specs: [{search_key: long_name, search_value: "Band 1 ", name: band_1},
{search_key: long_name, search_value: "Band 2 ", name: band_2},
{search_key: long_name, search_value: "Band 3 ", name: band_3},
{search_key: long_name, search_value: "Band 4 ", name: band_4},
{search_key: long_name, search_value: "Band 5 ", name: band_5},
{search_key: long_name, search_value: "Band 6 ", name: band_6},],
args_list: "earthio.local_file_iterators:iter_files_recursively",
top_dir: "env:ELM_EXAMPLE_DATA_PATH",
metadata_filter: "earthio.metadata_selection:example_meta_is_day",
file_pattern: "\\.hdf",
},
}
See also Creating an ElmStore from File
model_scoring
¶
Each dict in model_scoring
has a scoring
callable and the other keys/values are passed as scoring_kwargs
. These in turn become the scoring
and scoring_kwargs
to initialize a Pipeline
instance. This example creates a scorer called kmeans_aic
model_scoring: {
kmeans_aic: {
scoring: "elm.model_selection.kmeans:kmeans_aic",
score_weights: [-1],
}
}
transform
¶
This section allows using transform model, such as IncrementalPCA
from sklearn.decomposition
. model_init_kwargs
can include any keyword argument to the model_init_class
, as well as partial_fit_batches
(partial_fit
operations on each Pipeline
fit
or partial_fit
).
transform: {
pca: {
model_init_class: "sklearn.decomposition:IncrementalPCA",
model_init_kwargs: {"n_components": 2, partial_fit_batches: 2},
}
}
sklearn_preprocessing
¶
This section configures scikit-learn preprocessing classes (sklearn.preprocessing
), such as PolynomialFeatures
, for use elsewhere in the config. Each key is an identifer and each dictionary contains a method
(imported from sklearn.preprocessing
) and keyword arguments to that method
.
sklearn_preprocessing: {
min_max: {
method: MinMaxScaler,
feature_range: [0, 1],
},
poly2_interact: {
method: PolynomialFeatures,
degree: 2,
interaction_only: True,
include_bias: True,
},
}
train
¶
The train
dict configures the final estimator in a Pipeline
, in this case MiniBatchKMeans
. This example shows how to run that estimator with the example_ensemble
keyword arguments from above, model scoring section from above (kmeans_aic
), passing drop_n
and evolve_n
to the model_selection
callable.
train: {
train_example: {
model_init_class: "sklearn.cluster:MiniBatchKMeans",
model_init_kwargs: {
compute_labels: True
},
ensemble: example_ensemble,
model_scoring: kmeans_aic,
model_selection: "elm.model_selection.kmeans:kmeans_model_averaging",
model_selection_kwargs: {
drop_n: 4,
evolve_n: 4,
}
}
}
feature_selection
¶
Each key in this section is an identifier and the each dict is a feature selector configuration, naming a method
to be imported from sklearn.preprocessing
and keyword arguments to that method
.
feature_selection: {
top_half: {
method: SelectPercentile,
percentile: 50,
score_func: f_classif
}
}
run
¶
The run
section names fitting and prediction jobs to be done by using identifiers created in the config’s dictionaries reviewed above.
- About the
run
section: - It is a list of actions
- Each action in the list is a dict
- Each action should have the key
pipeline
that is a list of dictionaries specifying steps (analogous to the interactive session Pipeline ) - Each action should have a
data_source
key pointing to one of thedata_sources
named above - Each action can have
predict
and/ortrain
key/value with the value being one of the namedtrain
dicts above
run:
- {pipeline: [{select_canvas: band_1},
{flatten: True},
{drop_na_rows: True},
{sklearn_preprocessing: poly2_interact},
{sklearn_preprocessing: min_max},
{transform: pca}],
data_source: ds_example,
predict: train_example,
train: train_example}
The example above showed a run
configuration with a pipeline
of transforms inclusive of flattening rasters, dropping null rows, adding polynomial interaction terms, min-max scaling, and PCA.
Valid steps for run
- pipeline
¶
This section shows all of the valid steps that can be a config’s run
- pipeline
lists (items that could have appeared in teh pipeline
list in preceding example).
flatten
Flattens 2-D each DataArray
raster to a column within a single DataArray
called flat
in an ElmStore.
{flatten: True}
See also transform-flatten.
See also: :docs:`elm.pipeline.steps<pipeline-steps>`
drop_na_rows
Drops null rows from an ElmStore
or xarray.Dataset
with a DataArray
called flat
(often this step follows {flatten: True} in a ``pipeline
).
{drop_na_rows: True}
See also transform-dropnarows.
modify_sample
Provides a callable and optionally keyword arguments to modify X
and optionally y
and sample_weight
. See example of interactive use of elm.pipeline.steps.ModifySample
here - TODO LINK and the function signature for a modify_sample
callable here - TODO LINK. This example shows how to run normalizer_func
imported from a package and subpackage, passing keyword_1
and keyword_2
.
{modify_sample: "mypackage.mysubpkg.mymodule:normalizer_func", keyword_1: 4, keyword_2: 99}
See also ModifySample
usage in a K-Means LANDSAT example .
transpose
Transpose the dimensions of the ElmStore
, like this example for converting from ("y", "x")
dims to ("x", "y")
dims.
{transpose: ['x', 'y']}
sklearn_preprocessing
If a config has a dict called sklearn_preprocessing
as in the example above, then named preprocessors in that dict can be used in the run
- pipeline
lists as follows:
{sklearn_preprocessing: poly2_interact}
where poly2_interact
is a key in sklearn_preprocessing
See also: elm.pipeline.steps.PolynomialFeatures
in elm.pipeline.steps
feature_selection
If a config has a dict called feature_selection
as in the example above, then named feature selectors there can be used in the run
- pipeline
section like this:
{feature_selection: top_half}
where top_half
is a named feature selector in feature_selection
.
transform
Note the config’s transform
section configures transform models like PCA but they are not used unless the config’s run
- pipeline
lists have a transform
action (dict) in them. Here is an example:
{transform: pca}
where pca
is a key in the config’s transform
dict.