doenut.models

DoENUT: models A model is an approximation of a system, built by fitting various equations to a dataset.

Currently, this module provides four classes - two model classes and two model grouping classes.

Model is a basic fitted via linear regression model. It’s a fairly thin layer over sklearn’s model and the base for the rest of the code.

AveragedModel is the more useful one. As well as generating a normal model, it will also generate a set of fitting models via a leave-one-out methodology. This allows it to calculate the Q2 cross validation correlation coefficient.

ModelSet can be used when you have multiple columns/parameters in your response data. It will generate one Model for each response column. Similarly, AveragedModelSet will generate one AveragedModel for each response column.

Submodules

Package Contents

Classes

Model

A simple linear regression model.

AveragedModel

Model scored as the average of multiple models generated from a single

ModelSet

Class to train and hold a group of related models.

AveragedModelSet

Class to train and hold a group of related (averaged) models.

class doenut.models.Model(data: doenut.data.data_set.DataSet, fit_intercept: bool)[source]

A simple linear regression model.

Note

This class mostly exists as a base - you probably want AveragedModel

Parameters:
  • data (doenut.data.DataSet) – The inputs and responses for this model

  • fit_intercept (bool) – Whether to fit the intercept of the model to the axis

get_predictions_for(inputs: pandas.DataFrame) pandas.DataFrame[source]

Generates the predictions of the model for a set of inputs

Parameters:

inputs (pd.DataFrame) – The inputs to test against.

Returns:

the predictions from the model

Return type:

pd.DataFrame

get_r2_for(data: doenut.data.data_set.DataSet)[source]

Calculate the R2 Pearson coefficient for a given pairing of inputs and responses.

Parameters:

data (doenut.data.DataSet) – The data to test.

Returns:

The calculated R2 value as a float

Return type:

float

class doenut.models.AveragedModel(data: doenut.data.modifiable_data_set.ModifiableDataSet, scale_data: bool = True, scale_run_data: bool = True, fit_intercept: bool = True, response_key: str = None, drop_duplicates: str = 'yes')[source]

Bases: doenut.models.model.Model

Model scored as the average of multiple models generated from a single set of inputs via a leave-one-out approach.

Parameters:
  • data (doenut.data.ModifiableDataSet) – the data to run / test against.

  • scale_data (bool, default True) – Whether to scale the overall data before running it.

  • scale_run_data (bool, default True) – Whether to normalise the data for each run

  • fit_intercept (bool, default True) – Whether to fit the intercept to zero

  • response_key (str, optional) – for multi-column responses, which one to test on

  • drop_duplicates ({'yes', 'drop', 'average'}) – whether to drop duplicate values or not. May also be ‘average’ which will cause them to be dropped, but the one left will have its response value(s) set to the average of all the duplicates.

classmethod tune_model(data: doenut.data.modifiable_data_set.ModifiableDataSet, fit_intercept: bool = True, response_key: str = None, drop_duplicates: str = 'yes') Tuple[AveragedModel, AveragedModel][source]

Generate a pair of models from the same set of data. One using scaled data the other unscaled.

The scaled model can then be used for determining which columns to drop for later models, and the unscaled model for checking the models performance against validation data (or just for using once done).

Parameters:
  • data (doenut.data.ModifiableDataSet) – The dataset to test against. This should be unscaled.

  • fit_intercept (bool, default True) – Whether to fit the intercept or not (usually yes)

  • response_key (str, optional) – If there are more than one response columns, which to use.

  • drop_duplicates ({'yes', 'drop', 'average'}) – whether to drop duplicate values or not. May also be ‘average’ which will cause them to be dropped, but the one left will have its response value(s) set to the average of all the duplicates.

Returns:

  • AveragedModel – The generated scaled model

  • AveragedModel – The generated unscaled model

class doenut.models.ModelSet(default_inputs=None, default_responses=None, default_scale_data=True, default_fit_intercept=True)[source]

Class to train and hold a group of related models. When constructing the ModelSet, you can define default values. Then when adding a new model to the set you only have to specify the parameters which differ from the default.

Note

This class mostly exists as a base - you probably want AveragedModelSet

Parameters:
  • default_inputs (pd.DataFrame, optional) – The default inputs to the model

  • default_responses (pd.DataFrame, optional) – The default responses for the model

  • default_scale_data (bool, optional) – Whether to scale the data before adding to the model by default

  • default_fit_intercept (bool, optional) – Whether to fit the model’s intercept to the axis by default

_validate_value(name: str, value: Any = None) Any[source]
add_model(inputs: pandas.DataFrame = None, responses: pandas.DataFrame = None, scale_data: bool = None, fit_intercept: bool = None)[source]

Builds and adds a model to the set For each parameter not specified, the defaults will be used instead.

Parameters:
  • inputs (pd.DataFrame, optional) – The inputs to the model

  • responses (pd.DataFrame, optional) – The responses for the model

  • scale_data (bool, optional) – Whether to scale the data before adding to the model

  • fit_intercept (bool, optional) – Whether to fit the model’s intercept to the axis

Returns:

The generated model

Return type:

doenut.models.Model

get_r2s()[source]

Get the Pearson R2 values for the models in the set

Returns:

The R2 value for each model in the set.

Return type:

List[float]

get_attributes(attribute: str) List[Any][source]

Get a specified attribute from each model. Frustratingly, some are in the model, others in the sklearn model.

Parameters:

attribute (str) – The attribute you want from the model

Returns:

A list of the value of that attribute for each model in the set.

Return type:

List[Any]

Raises:

ValueError – If the attribute is not present in either the model or the inner sklearn model.

Note

If the attribute exists in both the model and the sklearn model, the model attribute will be the one returned.

class doenut.models.AveragedModelSet(default_inputs: pandas.DataFrame = None, default_responses: pandas.DataFrame = None, default_scale_data: bool = True, default_scale_run_data: bool = True, default_fit_intercept: bool = True, default_response_key: list = [0], default_drop_duplicates: str = 'yes', default_input_selector: list = [])[source]

Bases: doenut.models.model_set.ModelSet

Class to train and hold a group of related (averaged) models. When constructing the AveragedModelSet, you can define default values. Then when adding a new model to the set you only have to specify the parameters which differ from the default.

Parameters:
  • default_inputs (pd.DataFrame, optional) – The default inputs to the model

  • default_responses (pd.DataFrame, optional) – The default responses for the model

  • default_scale_data (bool, optional) – Whether to scale the data before adding to the model by default

  • default_scale_run_data (bool, optional) – Whether to scale the data for each train/test set by default

  • default_fit_intercept (bool, optional) – Whether to fit the model’s intercept to the axis by default

  • default_response_key (str, optional) – The default column to pick from the responses

  • default_drop_duplicates ({'no', 'yes', 'averages'}, optional) – What to do with duplicates in the inputs, by default

  • default_input_selector (List, optional) – What columns from the input data to select by default

classmethod multiple_response_columns(inputs: pandas.DataFrame = None, responses: pandas.DataFrame = None, scale_data: bool = True, scale_run_data: bool = True, fit_intercept: bool = True, drop_duplicates: str = 'yes', input_selector: list = []) AveragedModelSet[source]
add_model(inputs=None, responses=None, scale_data=None, scale_run_data=None, fit_intercept=None, response_key=None, drop_duplicates=None, input_selector=None)[source]

Add a new AveragedModel to the set

Parameters:
  • inputs (pd.DataFrame, optional) – The inputs to the model

  • responses (pd.DataFrame, optional) – The responses for the model

  • scale_data (bool, optional) – Whether to scale the data before adding to the model

  • scale_run_data (bool, optional) – Whether to scale the data for each train/test set

  • fit_intercept (bool, optional) – Whether to fit the model’s intercept to the axis

  • response_key (str, optional) – The column to pick from the responses

  • drop_duplicates ({'no', 'yes', 'averages'}, optional) – What to do with duplicates in the inputs

  • input_selector (List, optional) – What columns from the input data to select

Returns:

The generated model

Return type:

doenut.models.AveragedModel