`elm` `yaml` Specs - Deprecated Temporarily¶

elm-main is deprecated temporarily while ``elm`` and ``earthio`` undergo significant churn and changes in usage patterns. Around August 1, 2017 ``elm-main`` will be revisited as it provides a ``yaml`` based interface to ``elm`` and may assist in ``elm`` UI contexts or in interoperability.

Workflows involving ensemble and evolutionary methods and predict_many can also be specified in a yaml config file for running with the elm-main console entry point. The yaml config can refer to functions from elm or user-given packages or modules. Read more the yaml configuration file format here

The repository elm examples has a number of example yaml configuration files for GeoTiff and HDF4 files as input to K-Means or stochastic gradient descent classifiers.

This page walks through each part of a valid yaml config.

`ensembles`¶

The ensembles section creates named dicts of keyword arguments to fit_ensemble. The example below creates example_ensemble, an identifier we can use elsewhere in the config. If passing the keyword ensemble_init_func in an ensemble here, then it should be given in “package.subpackage.module:callable” notation like a setup.py console entry point, e.g. "my_kmeans_module:make_ensemble".

ensembles: {
  example_ensemble: {
    init_ensemble_size: 1,
    saved_ensemble_size: 1,
    ngen: 3,
    partial_fit_batches: 2,
  },
}

`data_sources`¶

The dicts in data_sources create a named sampler with their keyword arguments.

In the config, args_list can be a callable. In this case, it is iter_files_recursively a function which takes top_dir and file_pattern as arguments. The filenames returned by iter_files_recursively will be filtered by example_meta_is_day an example function for detecting whether a satellite data file is night or day based on its metadata. If args_list is callable, it should take a variable number of keyword arguments (**kwargs).

This examples creates ds_example which selects from files to get bands 1 through 6, iterating recursively over .hdf files in ELM_EXAMPLE_DATA_PATH from the environment (env:SOMETHING means take SOMETHING from environment variables).

band_specs in the data source are passed to earthio.LayerSpec (See also ElmStore and LANDSAT Example ) and determine which bands (subdatasets in this HDF4 case) to include in a sample.

data_sources: {
 ds_example: {
  sampler: "earthio.filters.band_selection:select_from_file",
  band_specs: [{search_key: long_name, search_value: "Band 1 ", name: band_1},
  {search_key: long_name, search_value: "Band 2 ", name: band_2},
  {search_key: long_name, search_value: "Band 3 ", name: band_3},
  {search_key: long_name, search_value: "Band 4 ", name: band_4},
  {search_key: long_name, search_value: "Band 5 ", name: band_5},
  {search_key: long_name, search_value: "Band 6 ", name: band_6},],
  args_list: "earthio.local_file_iterators:iter_files_recursively",
  top_dir: "env:ELM_EXAMPLE_DATA_PATH",
  metadata_filter: "earthio.metadata_selection:example_meta_is_day",
  file_pattern: "\\.hdf",
 },
}

`model_scoring`¶

Each dict in model_scoring has a scoring callable and the other keys/values are passed as scoring_kwargs. These in turn become the scoring and scoring_kwargs to initialize a Pipeline instance. This example creates a scorer called kmeans_aic

model_scoring: {
  kmeans_aic: {
    scoring: "elm.model_selection.kmeans:kmeans_aic",
    score_weights: [-1],
  }
}

`transform`¶

This section allows using transform model, such as IncrementalPCA from sklearn.decomposition. model_init_kwargs can include any keyword argument to the model_init_class, as well as partial_fit_batches (partial_fit operations on each Pipeline fit or partial_fit).

transform: {
  pca: {
    model_init_class: "sklearn.decomposition:IncrementalPCA",
    model_init_kwargs: {"n_components": 2, partial_fit_batches: 2},
  }
}

`sklearn_preprocessing`¶

This section configures scikit-learn preprocessing classes (sklearn.preprocessing), such as PolynomialFeatures, for use elsewhere in the config. Each key is an identifer and each dictionary contains a method (imported from sklearn.preprocessing) and keyword arguments to that method.

sklearn_preprocessing: {
  min_max: {
    method: MinMaxScaler,
    feature_range: [0, 1],
  },
  poly2_interact: {
    method: PolynomialFeatures,
    degree: 2,
    interaction_only: True,
    include_bias: True,
  },
}

`train`¶

The train dict configures the final estimator in a Pipeline, in this case MiniBatchKMeans. This example shows how to run that estimator with the example_ensemble keyword arguments from above, model scoring section from above (kmeans_aic), passing drop_n and evolve_n to the model_selection callable.

train: {
  train_example: {
    model_init_class: "sklearn.cluster:MiniBatchKMeans",
    model_init_kwargs: {
      compute_labels: True
    },
    ensemble: example_ensemble,
    model_scoring: kmeans_aic,
    model_selection: "elm.model_selection.kmeans:kmeans_model_averaging",
    model_selection_kwargs: {
      drop_n: 4,
      evolve_n: 4,
    }
  }
}

`feature_selection`¶

Each key in this section is an identifier and the each dict is a feature selector configuration, naming a method to be imported from sklearn.preprocessing and keyword arguments to that method.

feature_selection: {
    top_half: {
        method: SelectPercentile,
        percentile: 50,
        score_func: f_classif
    }

}

`run`¶

The run section names fitting and prediction jobs to be done by using identifiers created in the config’s dictionaries reviewed above.

About the run section:

It is a list of actions
Each action in the list is a dict
Each action should have the key pipeline that is a list of dictionaries specifying steps (analogous to the interactive session Pipeline )
Each action should have a data_source key pointing to one of the data_sources named above
Each action can have predict and/or train key/value with the value being one of the named train dicts above

run:
  - {pipeline: [{select_canvas: band_1},
      {flatten: True},
      {drop_na_rows: True},
      {sklearn_preprocessing: poly2_interact},
      {sklearn_preprocessing: min_max},
      {transform: pca}],
     data_source: ds_example,
     predict: train_example,
     train: train_example}

The example above showed a run configuration with a pipeline of transforms inclusive of flattening rasters, dropping null rows, adding polynomial interaction terms, min-max scaling, and PCA.

Valid steps for `run` - `pipeline`¶

This section shows all of the valid steps that can be a config’s run - pipeline lists (items that could have appeared in teh pipeline list in preceding example).

flatten

Flattens 2-D each DataArray raster to a column within a single DataArray called flat in an ElmStore.

{flatten: True}

elm yaml Specs - Deprecated Temporarily¶

ensembles¶

data_sources¶

model_scoring¶

transform¶

sklearn_preprocessing¶

train¶

feature_selection¶

run¶

Valid steps for run - pipeline¶