doenut.models
DoENUT: models A model is an approximation of a system, built by fitting various equations to a dataset.
Currently, this module provides four classes - two model classes and two model grouping classes.
Model is a basic fitted via linear regression model. It’s a fairly thin layer over sklearn’s model and the base for the rest of the code.
AveragedModel is the more useful one. As well as generating a normal model, it will also generate a set of fitting models via a leave-one-out methodology. This allows it to calculate the Q2 cross validation correlation coefficient.
ModelSet can be used when you have multiple columns/parameters in your response data. It will generate one Model for each response column. Similarly, AveragedModelSet will generate one AveragedModel for each response column.
Submodules
Package Contents
Classes
A simple linear regression model. |
|
Model scored as the average of multiple models generated from a single |
|
Class to train and hold a group of related models. |
|
Class to train and hold a group of related (averaged) models. |
- class doenut.models.Model(data: doenut.data.data_set.DataSet, fit_intercept: bool)[source]
A simple linear regression model.
Note
This class mostly exists as a base - you probably want
AveragedModel- Parameters:
data (doenut.data.DataSet) – The inputs and responses for this model
fit_intercept (bool) – Whether to fit the intercept of the model to the axis
- get_predictions_for(inputs: pandas.DataFrame) pandas.DataFrame[source]
Generates the predictions of the model for a set of inputs
- Parameters:
inputs (pd.DataFrame) – The inputs to test against.
- Returns:
the predictions from the model
- Return type:
pd.DataFrame
- get_r2_for(data: doenut.data.data_set.DataSet)[source]
Calculate the R2 Pearson coefficient for a given pairing of inputs and responses.
- Parameters:
data (doenut.data.DataSet) – The data to test.
- Returns:
The calculated R2 value as a float
- Return type:
float
- class doenut.models.AveragedModel(data: doenut.data.modifiable_data_set.ModifiableDataSet, scale_data: bool = True, scale_run_data: bool = True, fit_intercept: bool = True, response_key: str = None, drop_duplicates: str = 'yes')[source]
Bases:
doenut.models.model.ModelModel scored as the average of multiple models generated from a single set of inputs via a leave-one-out approach.
- Parameters:
data (doenut.data.ModifiableDataSet) – the data to run / test against.
scale_data (bool, default True) – Whether to scale the overall data before running it.
scale_run_data (bool, default True) – Whether to normalise the data for each run
fit_intercept (bool, default True) – Whether to fit the intercept to zero
response_key (str, optional) – for multi-column responses, which one to test on
drop_duplicates ({'yes', 'drop', 'average'}) – whether to drop duplicate values or not. May also be ‘average’ which will cause them to be dropped, but the one left will have its response value(s) set to the average of all the duplicates.
- classmethod tune_model(data: doenut.data.modifiable_data_set.ModifiableDataSet, fit_intercept: bool = True, response_key: str = None, drop_duplicates: str = 'yes') Tuple[AveragedModel, AveragedModel][source]
Generate a pair of models from the same set of data. One using scaled data the other unscaled.
The scaled model can then be used for determining which columns to drop for later models, and the unscaled model for checking the models performance against validation data (or just for using once done).
- Parameters:
data (doenut.data.ModifiableDataSet) – The dataset to test against. This should be unscaled.
fit_intercept (bool, default True) – Whether to fit the intercept or not (usually yes)
response_key (str, optional) – If there are more than one response columns, which to use.
drop_duplicates ({'yes', 'drop', 'average'}) – whether to drop duplicate values or not. May also be ‘average’ which will cause them to be dropped, but the one left will have its response value(s) set to the average of all the duplicates.
- Returns:
AveragedModel – The generated scaled model
AveragedModel – The generated unscaled model
- class doenut.models.ModelSet(default_inputs=None, default_responses=None, default_scale_data=True, default_fit_intercept=True)[source]
Class to train and hold a group of related models. When constructing the ModelSet, you can define default values. Then when adding a new model to the set you only have to specify the parameters which differ from the default.
Note
This class mostly exists as a base - you probably want
AveragedModelSet- Parameters:
default_inputs (pd.DataFrame, optional) – The default inputs to the model
default_responses (pd.DataFrame, optional) – The default responses for the model
default_scale_data (bool, optional) – Whether to scale the data before adding to the model by default
default_fit_intercept (bool, optional) – Whether to fit the model’s intercept to the axis by default
- add_model(inputs: pandas.DataFrame = None, responses: pandas.DataFrame = None, scale_data: bool = None, fit_intercept: bool = None)[source]
Builds and adds a model to the set For each parameter not specified, the defaults will be used instead.
- Parameters:
inputs (pd.DataFrame, optional) – The inputs to the model
responses (pd.DataFrame, optional) – The responses for the model
scale_data (bool, optional) – Whether to scale the data before adding to the model
fit_intercept (bool, optional) – Whether to fit the model’s intercept to the axis
- Returns:
The generated model
- Return type:
- get_r2s()[source]
Get the Pearson R2 values for the models in the set
- Returns:
The R2 value for each model in the set.
- Return type:
List[float]
- get_attributes(attribute: str) List[Any][source]
Get a specified attribute from each model. Frustratingly, some are in the model, others in the sklearn model.
- Parameters:
attribute (str) – The attribute you want from the model
- Returns:
A list of the value of that attribute for each model in the set.
- Return type:
List[Any]
- Raises:
ValueError – If the attribute is not present in either the model or the inner sklearn model.
Note
If the attribute exists in both the model and the sklearn model, the model attribute will be the one returned.
- class doenut.models.AveragedModelSet(default_inputs: pandas.DataFrame = None, default_responses: pandas.DataFrame = None, default_scale_data: bool = True, default_scale_run_data: bool = True, default_fit_intercept: bool = True, default_response_key: list = [0], default_drop_duplicates: str = 'yes', default_input_selector: list = [])[source]
Bases:
doenut.models.model_set.ModelSetClass to train and hold a group of related (averaged) models. When constructing the AveragedModelSet, you can define default values. Then when adding a new model to the set you only have to specify the parameters which differ from the default.
- Parameters:
default_inputs (pd.DataFrame, optional) – The default inputs to the model
default_responses (pd.DataFrame, optional) – The default responses for the model
default_scale_data (bool, optional) – Whether to scale the data before adding to the model by default
default_scale_run_data (bool, optional) – Whether to scale the data for each train/test set by default
default_fit_intercept (bool, optional) – Whether to fit the model’s intercept to the axis by default
default_response_key (str, optional) – The default column to pick from the responses
default_drop_duplicates ({'no', 'yes', 'averages'}, optional) – What to do with duplicates in the inputs, by default
default_input_selector (List, optional) – What columns from the input data to select by default
- classmethod multiple_response_columns(inputs: pandas.DataFrame = None, responses: pandas.DataFrame = None, scale_data: bool = True, scale_run_data: bool = True, fit_intercept: bool = True, drop_duplicates: str = 'yes', input_selector: list = []) AveragedModelSet[source]
- add_model(inputs=None, responses=None, scale_data=None, scale_run_data=None, fit_intercept=None, response_key=None, drop_duplicates=None, input_selector=None)[source]
Add a new AveragedModel to the set
- Parameters:
inputs (pd.DataFrame, optional) – The inputs to the model
responses (pd.DataFrame, optional) – The responses for the model
scale_data (bool, optional) – Whether to scale the data before adding to the model
scale_run_data (bool, optional) – Whether to scale the data for each train/test set
fit_intercept (bool, optional) – Whether to fit the model’s intercept to the axis
response_key (str, optional) – The column to pick from the responses
drop_duplicates ({'no', 'yes', 'averages'}, optional) – What to do with duplicates in the inputs
input_selector (List, optional) – What columns from the input data to select
- Returns:
The generated model
- Return type: