`doenut.models`

DoENUT: models A model is an approximation of a system, built by fitting various equations to a dataset.

Currently, this module provides four classes - two model classes and two model grouping classes.

Model is a basic fitted via linear regression model. It’s a fairly thin layer over sklearn’s model and the base for the rest of the code.

AveragedModel is the more useful one. As well as generating a normal model, it will also generate a set of fitting models via a leave-one-out methodology. This allows it to calculate the Q2 cross validation correlation coefficient.

ModelSet can be used when you have multiple columns/parameters in your response data. It will generate one Model for each response column. Similarly, AveragedModelSet will generate one AveragedModel for each response column.

Submodules

Package Contents

Classes

`Model`	A simple linear regression model.
`AveragedModel`	Model scored as the average of multiple models generated from a single
`ModelSet`	Class to train and hold a group of related models.
`AveragedModelSet`	Class to train and hold a group of related (averaged) models.

class doenut.models.Model(data: doenut.data.data_set.DataSet, fit_intercept: bool)[source]

A simple linear regression model.

Note

This class mostly exists as a base - you probably want AveragedModel

Parameters:

data (doenut.data.DataSet) – The inputs and responses for this model
fit_intercept (bool) – Whether to fit the intercept of the model to the axis

get_predictions_for(inputs: pandas.DataFrame) → pandas.DataFrame[source]

Generates the predictions of the model for a set of inputs

Parameters:: inputs (pd.DataFrame) – The inputs to test against.
Returns:: the predictions from the model
Return type:: pd.DataFrame

get_r2_for(data: doenut.data.data_set.DataSet)[source]

Calculate the R2 Pearson coefficient for a given pairing of inputs and responses.

Parameters:: data (doenut.data.DataSet) – The data to test.
Returns:: The calculated R2 value as a float
Return type:: float

class doenut.models.AveragedModel(data: doenut.data.modifiable_data_set.ModifiableDataSet, scale_data: bool = True, scale_run_data: bool = True, fit_intercept: bool = True, response_key: str = None, drop_duplicates: str = 'yes')[source]

Bases: doenut.models.model.Model

Model scored as the average of multiple models generated from a single set of inputs via a leave-one-out approach.

Parameters:

data (doenut.data.ModifiableDataSet) – the data to run / test against.
scale_data (bool, default True) – Whether to scale the overall data before running it.
scale_run_data (bool, default True) – Whether to normalise the data for each run
fit_intercept (bool, default True) – Whether to fit the intercept to zero
response_key (str, optional) – for multi-column responses, which one to test on
drop_duplicates ({'yes', 'drop', 'average'}) – whether to drop duplicate values or not. May also be ‘average’ which will cause them to be dropped, but the one left will have its response value(s) set to the average of all the duplicates.

classmethod tune_model(data: doenut.data.modifiable_data_set.ModifiableDataSet, fit_intercept: bool = True, response_key: str = None, drop_duplicates: str = 'yes') → Tuple[AveragedModel, AveragedModel][source]

Generate a pair of models from the same set of data. One using scaled data the other unscaled.

The scaled model can then be used for determining which columns to drop for later models, and the unscaled model for checking the models performance against validation data (or just for using once done).

Parameters:

data (doenut.data.ModifiableDataSet) – The dataset to test against. This should be unscaled.
fit_intercept (bool, default True) – Whether to fit the intercept or not (usually yes)
response_key (str, optional) – If there are more than one response columns, which to use.
drop_duplicates ({'yes', 'drop', 'average'}) – whether to drop duplicate values or not. May also be ‘average’ which will cause them to be dropped, but the one left will have its response value(s) set to the average of all the duplicates.

Returns:

AveragedModel – The generated scaled model
AveragedModel – The generated unscaled model

class doenut.models.ModelSet(default_inputs=None, default_responses=None, default_scale_data=True, default_fit_intercept=True)[source]

Class to train and hold a group of related models. When constructing the ModelSet, you can define default values. Then when adding a new model to the set you only have to specify the parameters which differ from the default.

Note

This class mostly exists as a base - you probably want AveragedModelSet

Parameters:

default_inputs (pd.DataFrame, optional) – The default inputs to the model
default_responses (pd.DataFrame, optional) – The default responses for the model
default_scale_data (bool, optional) – Whether to scale the data before adding to the model by default
default_fit_intercept (bool, optional) – Whether to fit the model’s intercept to the axis by default

_validate_value(name: str, value: Any = None) → Any[source]

add_model(inputs: pandas.DataFrame = None, responses: pandas.DataFrame = None, scale_data: bool = None, fit_intercept: bool = None)[source]

Builds and adds a model to the set For each parameter not specified, the defaults will be used instead.

Parameters:

inputs (pd.DataFrame, optional) – The inputs to the model
responses (pd.DataFrame, optional) – The responses for the model
scale_data (bool, optional) – Whether to scale the data before adding to the model
fit_intercept (bool, optional) – Whether to fit the model’s intercept to the axis

Returns:

The generated model

Return type:

doenut.models.Model

get_r2s()[source]

Get the Pearson R2 values for the models in the set

Returns:: The R2 value for each model in the set.
Return type:: List[float]

get_attributes(attribute: str) → List[Any][source]

Get a specified attribute from each model. Frustratingly, some are in the model, others in the sklearn model.

Parameters:: attribute (str) – The attribute you want from the model
Returns:: A list of the value of that attribute for each model in the set.
Return type:: List[Any]
Raises:: ValueError – If the attribute is not present in either the model or the inner sklearn model.

Note

If the attribute exists in both the model and the sklearn model, the model attribute will be the one returned.

class doenut.models.AveragedModelSet(default_inputs: pandas.DataFrame = None, default_responses: pandas.DataFrame = None, default_scale_data: bool = True, default_scale_run_data: bool = True, default_fit_intercept: bool = True, default_response_key: list = [0], default_drop_duplicates: str = 'yes', default_input_selector: list = [])[source]

Bases: doenut.models.model_set.ModelSet

Class to train and hold a group of related (averaged) models. When constructing the AveragedModelSet, you can define default values. Then when adding a new model to the set you only have to specify the parameters which differ from the default.

Parameters:

default_inputs (pd.DataFrame, optional) – The default inputs to the model
default_responses (pd.DataFrame, optional) – The default responses for the model
default_scale_data (bool, optional) – Whether to scale the data before adding to the model by default
default_scale_run_data (bool, optional) – Whether to scale the data for each train/test set by default
default_fit_intercept (bool, optional) – Whether to fit the model’s intercept to the axis by default
default_response_key (str, optional) – The default column to pick from the responses
default_drop_duplicates ({'no', 'yes', 'averages'}, optional) – What to do with duplicates in the inputs, by default
default_input_selector (List, optional) – What columns from the input data to select by default

classmethod multiple_response_columns(inputs: pandas.DataFrame = None, responses: pandas.DataFrame = None, scale_data: bool = True, scale_run_data: bool = True, fit_intercept: bool = True, drop_duplicates: str = 'yes', input_selector: list = []) → AveragedModelSet[source]

add_model(inputs=None, responses=None, scale_data=None, scale_run_data=None, fit_intercept=None, response_key=None, drop_duplicates=None, input_selector=None)[source]

Add a new AveragedModel to the set

Parameters:

inputs (pd.DataFrame, optional) – The inputs to the model
responses (pd.DataFrame, optional) – The responses for the model
scale_data (bool, optional) – Whether to scale the data before adding to the model
scale_run_data (bool, optional) – Whether to scale the data for each train/test set
fit_intercept (bool, optional) – Whether to fit the model’s intercept to the axis
response_key (str, optional) – The column to pick from the responses
drop_duplicates ({'no', 'yes', 'averages'}, optional) – What to do with duplicates in the inputs
input_selector (List, optional) – What columns from the input data to select

Returns:

The generated model

Return type:

doenut.models.AveragedModel

doenut.models

Submodules

Package Contents

Classes

`doenut.models`