LECA.fit.WorkFlow

class LECA.fit.WorkFlow(data: DataFrame, features: str | List[str], objective_list: str | List[str], random_state: int | None = None, n_jobs: int | None = -1, polynomial_degree: int = 3, validation_holdout: int | float = 0, composition_features: str | List[str] | None = None)

Bases: object

Fundamental object for training/optimizing LECA regression models, tracking their performance via cross-validation and bootstrapping datasets for error-estimation.

Parameters:
  • data (pd.DataFrame) – DataFrame containing full feature set and objective functions.

  • features (Union[str, List[str]]) – str or List[str] enumerating the features to be used by the regression models. Must match column names in the data DataFrame.

  • objective_list (Union[str, List[str]]) – str or List[str] enumerating the objective functions (i.e. target values) for the regression task. Must match column names in the data DataFrame.

  • random_state (Optional[int]) –

    Sets the random_state parameter for any stochastic process used in the regression models for reproducibility.

    Default value None.

  • polynomial_degree (int) –

    Sets the maximum degree for the polynomial features generated for linear regression (see sklearn.preprocessing.PolynomialFeatures).

    Default value 3.

  • validation_holdout (Union[int, float]) –

    Select number of datapoints to hold out as a validation set (unseen by regression models). An int holdout declares an explicit number of datapoints to exclude, while a float is then the fraction of the total dataset to reserve for validation.

    Default value 0.

  • composition_features (Optional[Union[str, List[str]]]) –

    str or List[str] defining which features characterize a unique composition. This parameter is for test/train/validation split grouping purposes. For example, an electrolyte would have salt / solvent / additive concentrations as composition features, but temperature would be exempt. By defining the composition_features as the component concentrations we can then group each unique composition and split test/train/validation sets to avoid overfitting by including an identical electrolyte composition in the training set at only a slightly different temperature than in the test/validation sets.

    Default value None.

supported_models

List of supported regression model names (i.e. regr_name in add_regr()).

Type

Name

Neural Network

“nn”

Random Forest

“rf”

Linear Reg.

“poly_lr”

Lasso Lin. Reg.

“lasso_lr”

Ridge Lin. Reg.

“ridge_lr”

GPR - Iso RBF

“gpr_RBF_iso”

GPR - Aniso RBF

“gpr_RBF_aniso”

GPR - Iso Matern

“gpr_Matern_iso”

GPR - Aniso Matern

“gpr_Matern_aniso”

GPR - Rational Quadratic

“gpr_RQ_iso”

GPR - Custom Kernel

“gpr_custom”

Type:

List[str]

supported_metrics

List of supported scoring metrics for regression models. Currently supported metrics are: “r2”, “MAE”, “MSE”, and “time”.

Type:

List[str]

features

List[str] of the features used by the regression models.

Type:

List[str]

objective_list

List[str] of the objective functions for this WorkFlow instance.

Type:

List[str]

X

DataFrame sliced from the DataFrame provided during WorkFlow initialization. Columns are the declared features, rescaled to have a normal distribution with a mean of 0, with each row corresponding to a measured objective function value. (uses: scikit-learn StandardScaler)

Type:

pd.DataFrame

poly_X

DataFrame of mixed polynomial features built from the X DataFrame. All polynomial combinations up to a max degree of polynomial_degree are generated. (uses: scikit-learn PolynomialFeatures)

Type:

pd.DataFrame

X_unscaled

DataFrame of the raw input features, analogous to the X attribute but unscaled.

Type:

pd.DataFrame

X_validate

DataFrame, analog to X_unscaled of the raw input features set aside for the validation set (unseen by ML models and excluded from X, X_unscaled and poly_X)

Type:

pd.DataFrame

y, y_validate, std, std_validate

Grouped for brevity, analogous to the X DataFrame set. Each attribute is a DataFrame storing either the objective function values (rows corresponding to the rows of the X... DataFrames) and their standard deviations (if measurements were repeated and combined to have columns in the initialization DataFrame with the name <objective function name>_std) ..._validate is then the group of values split from the main set to be withheld for validation purposes. y and std support multiple objective functions.

Type:

pd.DataFrame

_random_state

Optional value declared at WorkFlow initialization and passed to any functions/methods called by the WorkFlow which accept a ‘random_state’.

Type:

Optional[int]

_n_jobs

Optional value declared at WorkFlow initialization and passed to any functions/methods called by the WorkFlow which accept a parallelization parameter to use n cores. Note -1 will use all available cores. None uses only one core.

Type:

Optional[int]

results

Dictionary with a key for each objective function declared at initialization. This dictionary stores the ML models trained for the corresponding objective function and their performance scores.

Type:

Dict

__init__(data: DataFrame, features: str | List[str], objective_list: str | List[str], random_state: int | None = None, n_jobs: int | None = -1, polynomial_degree: int = 3, validation_holdout: int | float = 0, composition_features: str | List[str] | None = None) None
RBMAL(uncertainty_fn: Callable | None = None, pool: DataFrame | None = None, batch_size: int = 10, target_uncertainty: float | None = 0) Tuple[List[int], DataFrame]

Method to apply Ranked Batch Mode Active Learning approach to return recommended queries.

Adapted from modAL RBMAL implementation https://github.com/modAL-python/modAL/blob/7f72997b6dc26e8fe063b90d409c7cfcf4ef418e/modAL/batch.py

Based on RBMAL approach proposed by Cardoso et al. https://www.sciencedirect.com/science/article/pii/S0020025516313949

Parameters:
  • uncertainty_fn (Optional[Callable]) –

    Optional function which takes the query pool and returns a DataFrame of uncertainty values analogous to the model’s prediction uncertainty. If None the prediction uncertainty of the workflow’s highest scoring model is used. In the case of multiple objective functions, the automatic uncertainty calculation will take the highest estimated uncertainty from the predictions for each objective function.

    Default value None

  • pool (Optional[pd.DataFrame]) –

    DataFrame of candidate points in feature space to query. If None is passed, the generate_bins method is used to automatically generate a grid in feature space.

    Default value None

  • batch_size (int) –

    Number of points to query.

    Default value 10

  • target_uncertainty (Optional[float]) –

    Sets lower bound on model uncertainty. If any prediction is estimated to have an uncertainty lower than this value that space is considered well defined and excluded from the candidate points to query.

    Default value 0

Returns:

DataFrame of ranked points in feature space from highest to lowest priority for further training the ML models.

Return type:

Tuple[List[int], pd.DataFrame]

add_regr(fit_name: str, regr_name: str, objective_funcs: str | List[str] | None = None, **hyperparameters) None

Adds a regression model to the WorkFlow. See **kwargs parameter description below for further info on LECA specific hyperparameters infer_alpha, polynomials and degree.

Parameters:
  • fit_name (str) – Unique identifying string for regression model. If the fit_name already exists for the enumerated objective_funcs it/they will be overwritten by calling this function.

  • regr_name (str) – Regression model type. Must be a supported model (see LECA.WorkFlow.supported_models).

  • objective_funcs (Optional[Union[str, List[str]]]) – Objective function(s) regression model should predict. The string or list of strings given must match the name(s) of an enumerated objective function during the creation of the WorkFlow object. If None a regression model for each objective function loaded into the workflow will be created.

  • **hyperparameters (kwargs) –

    Kwargs hyperparameters to be passed on when creating the regression model objects. Aside from the LECA specific model hyperparameters listed below, hyperparameters are passed on to the scikit-learn model object and the user is directed to the relevant scikit-learn docs.

    infer_alphaBoolean hyperparameter for GPR models

    Hyperparameter for GPR type models. If False, the models are constructed directly as scikit-learn objects with the specified kernel (+ WhiteKernel). If passed as True, LECA uses the extended GPR model clase AlphaGPR. This is essentially identical to the standard scikit-learn GPR model but it takes two objective functions when training. y, and y_std. y_std is then stored as the alpha array (representing training data biases) and the model is trained on the objective values stored in the y array. AlphaGPR models are trained with no white noise in the kernel.

    polynomialsList[int] hyperparameter for polynomial linear regression models (poly_lr)

    LECA has its own modified implementation for polynomial regression (see LECA.estimators.PolynomialRegression) which allows the user to specify which polynomials (as a list of indices corresponding to polynomial feature in WorkFlow.poly_X) are used.

    degreeint hyperparameter for polynomial linear regression models (poly_lr)

    Accepts an integer value which will automatically include all polynomials less than or equal to the given degree (i.e. autofills the polynomials indices).

Return type:

None

arrhenius_cross_validate(original_objective: str, df: DataFrame, beta_0: float, models: List[str] | None = None, arrhenius_objectives: List[str] | None = ['S0', 'S1', 'S2'], save_loc: bool | str = False, show_title: bool = False, log: bool = True, deviate_by_salt: bool = True, custom_label=None, highlight_extrema=False) Dict[str, float]

Output validation plots and prediction error scores for regression models after back-transforming Arrhenius surrogate model to original objective function.

Uses the WorkFlow’s CV folds to score the performance of each CV-trained model on its corresponding test set.

Parameters:
  • original_objective (str) – String name of pre-Arrhenius surrogate model objective function.

  • df (pd.DataFrame) – DataFrame of pre-Arrhenius surrogate model measurement data.

  • beta_0 (float) – beta_0 value used in Arrhenius fits. See: prep.arrhenius() for more information.

  • models (Optional[List[str]]) –

    Optionally define which models (by string name identifier) to use to predict the Arrhenius coefficients. The list ordering matches the objective_funcs. If None the best scoring model (MSE) is used by default.

    Default value None

  • objective_funcs (Optional[List[str]]) –

    List of string names of 3-objective functions pass to the Arrhenius function in the form [‘S0’, ‘S1’, ‘S2’]

    Default value ['S0', 'S1', 'S2']

  • save_loc (Union[bool, str]) –

    Name to save plot (if desired), if False the plot will only be shown, not saved.

    Saving filename convention is: save_loc + objective function + ‘-arrhenius-cross-validate.pdf’

  • log (Optional[bool]) –

    Whether to compare to logarithmic conductivity.

    Default value True

  • deviate_by_salt (Optional[bool]) –

    Whether log(conductivity/x_LiSalt) should be plotted

    Default value True

Returns:

Dictionary of mean-CV-scores with keys (sem: std of mean):

”r2_train”, “r2_train_sem”, “MAE_train”, “MAE_train_sem”, “MSE_train”, “MSE_train_sem”, “RMSE_train”, “RMSE_train_sem”, “r2_test”, “r2_test_sem”, “MAE_test”, “MAE_test_sem”, “MSE_test”, “MSE_test_sem”, “RMSE_test”, “RMSE_test_sem”

Return type:

Dict[str, float]

arrhenius_validate(original_objective: str, df: DataFrame, beta_0: float, models: List[str] | None = None, arrhenius_objectives: List[str] | None = ['S0', 'S1', 'S2'], save_loc: bool | str = False, log: bool = True, deviate_by_salt: bool = True, show_title: bool = True, custom_label=None) Dict[str, float]

Output validation plots and prediction error scores for regression models after back-transforming Arrhenius surrogate model to original objective function.

Uses the WorkFlow’s validation holdout dataset to identify validation-set compositions in the passed DataFrame.

Parameters:
  • original_objective (str) – String name of pre-Arrhenius surrogate model objective function.

  • df (pd.DataFrame) – DataFrame of pre-Arrhenius surrogate model measurement data.

  • beta_0 (float) – beta_0 value used in Arrhenius fits. See: prep.arrhenius() for more information.

  • models (Optional[List[str]]) –

    Optionally define which models (by string name identifier) to use to predict the Arrhenius coefficients. The list ordering matches the objective_funcs. If None the best scoring model (MSE) is used by default.

    Default value None

  • objective_funcs (Optional[List[str]]) –

    List of string names of 3-objective functions pass to the Arrhenius function in the form [‘S0’, ‘S1’, ‘S2’]

    Default value ['S0', 'S1', 'S2']

  • save_loc (Union[bool, str]) –

    Name to save plot (if desired), if False the plot will only be shown, not saved.

    Saving filename convention is: save_loc + objective function + ‘-arrhenius-validate.pdf’

  • log (Optional[bool]) –

    Whether to compare to logarithmic conductivity.

    Default value True

  • deviate_by_salt (Optional[bool]) –

    Whether log(conductivity/x_LiSalt) should be plotted

    Default value True

Returns:

Dictionary of model accuracy scores with keys:

”r2_train”, “r2_test”, “MAE_train”, “MAE_test”, “MSE_train”, “MSE_test”, “RMSE_train”, “RMSE_test”

Return type:

Dict[str, float]

autoML(k_fold: int = 5, verbose=True) None

Train a set of baseline models on the dataset and output info on best performer.

The set of baseline models are chosen depending on the number and shape of the input data:

if dataset_size < 500:

“iRBF”, “iMatern” (read: GPR with isotropic kernel)

if input feature dimensions > 1, additionally use: “aRBF”, “aMatern” (read: GPR with anisotropic kernel)

if the data was given with objective_std information, additionally use: “…_alpha” for all aforementioned GPR models (use AlphaGPR)

elif dataset_size < 1000:

use GPR models as above, in addition, train 3 PolynomialRegression models: “PR 1”, “PR 2”, “PR 3” (read: model trained on polynomial features up to nth degree) and train one random forest model with scikit-learn default hyperparameters: “RF”

elif dataset_size < 5000:

exclude GPR models, use aforementioned PR and RF models and, in addition, use a MLPRegressor with lbfgs solver: “NN”

else:

Use a MLPRegressor with lbfgs solver: “NN”

Parameters:
  • k_fold (int) –

    Number of folds to use for CV scoring. If scoring isn’t set to None this parameter is moot.

    Default value 5

  • verbose (bool) –

    Whether to output information on the fitting process (models used by autoML, best performing model overview).

    Default value True

Return type:

None

best_model(objectives: str | List[str] | None = None) str | Dict[str, str]

Return best scoring model name(s) (MSE) for objective(s).

Parameters:

objectives (Optional[Union[str, List[str]]]) – Optional parameter to define which models to return. If none, the best scoring model for each objective is returned.

Returns:

If a single objective is passed, the name of the best model is returned. If a list is passed, a dictionary in the form {‘objective_name’:’model_name} is returned.

Return type:

Union[str, Dict[str, str]]

cross_validate(cv: int = 5, objective_funcs: str | List[str] | None = None, verbose: bool = True) None

Method to score regression model performance with k-fold cross validation.

If composition_features are defined for the WorkFlow, the grouped inputs are split into CV-folds, rather than individual data, i.e. each group of data with identical composition_features are assigned together to either the training or test set for each fold (to prevent data leakage).

Parameters:
  • cv (int) – Number of cross validation folds.

  • objective_funcs (Optional[Union[str, List[str]]]) –

    Name or list of names of objective functions to score. All regression models initiated for these objective functions will have k-fold cross validated scores recorded. If None then all objective functions defined for the WorkFlow will be scored.

    Default value None.

  • verbose (bool) –

    Toggles whether to output scores (instead of storing them in the workflow metrics database).

    Default value True.

Return type:

None

estimate_uncertainty(regressions: str | None = None, objective_funcs: str | List[str] | None = None) None

Method to enable prediction uncertainty estimation for regression models. For GPR models this method will do nothing. For PolynomialRegression models 200 bootstrapped models are trained. For any other, 30 models are trained.

This method implements MapieRegressor to generate bootstrapped models and enable uncertainty estimation using Jackknife+-After-bootstrap for non-GPR models. For further details see: Mapie jackknife+-AB

Parameters:
  • fit_name (str) – Regression model unique identifying name.

  • objective_funcs (Optional[Union[str, List[str]]]) –

    String or list of strings enumerating the objective functions for which the bootstrapped models should be trained. If None, the named bootstrapped models on all objective functions will be trained.

    Default value None.

Return type:

None

generate_bins(objective_funcs: str | List[str] | None = None, features: str | List[str] | None = None, fixed_values: Dict[str, float] | None = None, min_bins: int = 3, feature_importance_bins: int = 10, manual_bins: Dict[str, List[float]] | None = None, manual_min_max_bounds: Dict[str, Tuple[float]] | None = None, validity_test: Callable | None = None) DataFrame

Method to create a DataFrame of discrete values spanning the design space. Features which show a higher importance (via random-forest feature importance metric) will have more bins. This method will either automatically determine the number of (equidistant) bins in the min-to-max range of a feature, or alternately accepts user defined bins for a feature in the form of a list of values.

The key formula for bin generation is:

\[bin\_count = max(importance_{objective_1},...)*feature\_importance\_bins + min\_bins\]

If automatic bin generation is used, equidistant bins from the min-to-max value range of the feature in the WorkFlow DataFrame are generated scaling with the feature’s highest importance for the given objective_funcs.

Parameters:
  • objective_funcs (Optional[Union[str, List[str]]]) –

    String or list of strings of the objective functions to be considered. If None, defaults to all objective functions in the WorkFlow.

    Default value None

  • features (Optional[Union[str, List[str]]]) –

    String or list of strings of the binned features. If None, defaults to all features in the WorkFlow.

    Default value None

  • fixed_values (Optional[Dict[str,float]]) –

    Dictionary of fixed {‘feature name’ : value} to be included in the output DataFrame (e.g. could include a fixed inverse temperature with fixed_values = {‘inverse temperature’:3.0}``.) NOTE: Any features included as a fixed value will be -excluded- from the random forest fits used for determining feature importance.

    Default value None

  • min_bins (int) –

    Minimum number of bins automatically assigned to a feature. Features with no measured importance from the random forest fit will have this many bins.

    Default value 3

  • feature_importance_bins (int) –

    Maximum number of bins automatically assigned to a feature. Features a measured importance of 1 from the random forest fit will have this many bins.

    Default value 10

  • manual_bins (Optional[Dict[str,List[float]]]) –

    Optional parameter to explicitly define the bins for a feature. Accepts a dictionary in the form: '{feature':[list of values]}, any features not included in this dictionary will have automatically selected bins.

    Default value None

  • manual_min_max_bounds (Optional[Dict[str,tuple[float]]]) –

    Optional parameter to define the min-to-max range for a feature. Passed as a dictionary of tuples: '{feature':(min, max)}, any features not included in this dictionary will take the min-to-max range from the WorkFlow training DataFrame.

    Default value None

  • validity_test (Optional[Callable f(x: pd.DataFrame) -> pd.DataFrame]) –

    Callable function which takes a feature DataFrame input (i.e. df.columns == feature_list) and returns a boolean list where True represents a valid composition, and False indicates a composition to be excluded.

    Default value None

Returns:

DataFrame of discrete feature values spanning the design space.

Return type:

pd.DataFrame

get_estimator(estimator_name: str, objective_funcs: str | List[str] | None = None) BaseEstimator | List[BaseEstimator] | None

Returns the named fitted estimator object(s) (if exists) for objective function(s).

Parameters:
  • estimator_name (str) – Unique model identifier name.

  • objective_funcs (Optional[Union[str, List[str]]]) – String or list of strings enumerating the objective functions for which to get estimators. If None, all objective functions will have the associated named model returned.

Returns:

If a singular str objective_funcs is passed, the named estimator object is returned, otherwise a list of the estimator objects corresponding to the List[str] of objective_funcs is returned.

Return type:

Union[BaseEstimator, List[BaseEstimator]]

hyperparameter_optimize(fit_name: str, regr_name: str, objective_funcs: str | List[str] | None = None, verbose: bool = True, scoring: Callable[[...], float] | None = None, k_fold: int = 5, **opt_params) None

Automated Bayesian Optimization of regression model hyperparameters using the GPyOpt library.

This method encapsulates 4 different optimization methods:

For regr_name="rf":

The GPyOpt library is used to perform Bayesian hyperparameter-optimization for a scikit-learn random-forest regression model. This method explores the following hyperparameter dimensions for the architecture which scores best with k-fold cross-validation:

"min_samples_split" range(2,20)

"min_samples_leaf" range(1,10)

"max_depth" range(1,31,5)

"n_estimators" range(100,2200,300)

"max_features" (0.1,1.0)

For regr_name="nn":

The GPyOpt library is used to perform Bayesian hyperparameter-optimization for a scikit-learn MLPRegressor model. This method explores the following hyperparameter dimensions for the architecture which scores best with k-fold cross-validation:

"hidden_layer_1" range(0,20)

"hidden_layer_2" range(0,20)

"hidden_layer_3" range(0,20)

"hidden_layer_4" range(0,20)

"alpha" (0.0001,0.01)

"batch_size" range(10,200,5)

"solver" ['lbfgs', 'sgd', 'adam']

"activation" ['identity', 'logistic', 'tanh', 'relu']

"max_iter" range(500,5001,500)

For regr_name="poly_lr":

Polynomial features up to the max degree as definied during WorkFlow initialization are recursively eliminated by estimating the training error of the PolynomialRegression model for an infinite number of training data as a function of used polynomial features. This is done by extrapolating the linear trend of training error as a function of (1/N_training_data) at x=0. This error(N_inf) value is scored for models trained on each set of polynomials excluding one, and the pool of polynomials with the lowest error(N_inf) is selected for running the algorithm again. The stopping condition is, by default, the point where error(N_inf) exceeds a 10% increase from the minimum error. The reduced set of polynomials are then saved as the optimized model. This method is based on a similar approach used in previous work.

For regr_name="lasso_lr"

The training data is fit with a scikit-learn LassoCV model with polynomial features up to the max degree as definied during WorkFlow initialization. The polynomials which are eliminated from the Lasso model are tracked, and a PolynomialRegression model is saved with matching polynomial features.

Parameters:
  • fit_name (str) – Unique name under which the model is saved. If later another training is done with the same name, it will overwrite

  • regr_name (str) – Which regression model to use. See: WorkFlow.supported_models In addition: ridge_lr, lasso_lr can be selected.

  • objective_funcs (Optional[Union[str, List[str]]]) – Default value None.

  • verbose (bool) –

    Toggles whether to output information on optimization process

    Default value True.

  • scoring (Optional[Callable[..., float]]) –

    If callable, signature scorer(estimator, X, y) -> float, otherwise, if None, k-fold cross-validation optimizing MSE.

    Default value None.

  • k_fold (int) –

    Number of folds to use for CV scoring. If scoring isn’t set to None this parameter is moot.

    Default value 5.

  • **opt_params (kwargs) – Accepts kwargs parameters for bayesian optimization algorithm.

Return type:

None

mean_cv_scores(objective_funcs: str | List[str] | None = None, cv: int | None = 5, verbose: bool = True) Dict[str, DataFrame]

Method to calculate the mean scores and Standard Error of the Mean (SEM) of WorkFlow models. The metrics calculated are: time, MAE train, MAE test, MSE train, MSE test, R2 train, R2 test. Where test/train declares whether the prediction scoring is on the training or test slice of the dataset. The scores are returned as a Dict of DataFrames with an entry for each objective function. The scores are also stored in the WorkFlow scoring database.

Parameters:
  • objective_funcs (Optional[Union[str, List[str]]]) –

    str or List[str] of objective functions (string names) for which to calculate the mean cross validated metric scores. If None all objective functions for the WorkFlow will be calculated.

    Default value None.

  • cv (Optional[int]) –

    Number of cross validation folds to use (only relevant if models not already cross-val-scored).

    Default value 5.

  • verbose (bool) –

    Whether to print out the scores.

    Default value True.

Returns:

Dictionary of DataFrames with keys:str - objective function names. The DataFrames have the following columns:

time

time_sem

…_sem

E.g. each metric and the Standard Error of the Mean is returned as a column.

\[np.std(metric)/np.sqrt(n_{samples})\]

Return type:

Dict[str, pd.DataFrame]

optimize(strategy: str = 'max', obj_fn: Callable | None = None, fixed_values: Dict[str, float] | None = None, bounds: Dict[str, Tuple[float]] | None = None, n_restarts_optimizer: int = 100, min_max: bool = False) DataFrame

Optimizer to search design space for max/min objective value, bayesian expected improvement, upper/lower confidence bound and maximum uncertainty strategies. Returns optimal input feature set to query for given strategy.

Parameters:
  • strategy (str) –

    Optimization strategy to use:

    • max : Maximize obj_fn

    • min : Minimize obj_fn

    • EI : Maximize bayesian expected improvement

    • UCB : Maximize upper confidence bound (obj_fn+std)

    • LCB : Minimize lower confidence bound -(obj_fn+std)

    • max_uncert : Maximize obj_fn uncertainty

    Default value max.

  • obj_fn (Optional[Callable f(x: pd.DataFrame) -> pd.DataFrame]) –

    Callable function which takes a feature DataFrame input (using the same features as the workflow) and returns a 2x1 dataframe in with columns [‘objective’, ‘objective_std’].

    Default value None

  • fixed_values (Optional[Dict[str,float]]) –

    Dictionary of fixed {‘feature name’ : value}s for optimization task.

    Default value None

  • bounds (Optional[Dict[str,Tuple[float]]]) –

    Dictionary of {‘feature name’ : (min, max)} for setting the boundaries to search for optimization task.

    Default value None

  • n_restarts_optimizer (int) –

    Number of random points in the design space from which the acquisition function will be optimized. Higher -> more computationally expensive, but higher chance of finding global best acquisition point.

    Default value 100.

  • min_max (bool) –

    Whether to use the min_max method to estimate uncertainty, or MAPIE with conformity scores. min_max:True returns the standard deviations of the predictions from all the bootstrapped models for each point. min_max:False uses the MAPIE uncertainty estimation outlined in: Mapie jackknife+-AB This parameter is moot for GPR models.

    Default value False.

Returns:

DataFrame of input features and objective prediction

Return type:

pd.DataFrame

poly_lr_coefs(estimator_name: str) Dict[str, DataFrame]

Method to output polynomial coefficients and deviations for linear regression models. The coefficients listed are the coefficients for the model fit on the full training dataset, whereas the STD values (ddof=1) are based on the standard deviations of the coefficients for the cross-validated fits.

Parameters:

estimator_name (str) – Name of trained / cross validated estimator from workflow.

Returns:

Dictionary of DataFrames with keys:str - objective function names. The DataFrames have the following columns:

Poly Index

Poly Degrees

Coefs

Coef STDs

Rel STDs

Where Poly Degrees is in the form:

Feature_1

Feature_2

Feature_n

Poly deg.

Poly deg.

Poly deg.

Poly deg. is the power of the input feature corresponding feature for the generated polynomial feature, and Rel STDs is abs(Coef STD / Coefficient)

Return type:

Dict[str, pd.DataFrame]

polynomial_convert(X: DataFrame, X_scaled: bool = False) DataFrame

Takes unscaled or scaled input DataFrame X (must match the DataFrame format used in WorkFlow) and transforms it into a scaled polynomial to match with the expected input for polynomial regression models.

Parameters:
  • X (pd.DataFrame) – DataFrame of input feature vectors, matching the DataFrame format for WorkFlow.

  • X_scaled (bool) –

    Whether DataFrame X has scaled features (True) or not.

    Default value False.

Returns:

DataFrame of scaled polynomial features matching the format for polynomial regression models with this WorkFlow.

Return type:

pd.DataFrame

predict(X: DataFrame, objectives: str | Dict[str, str] | None = None, X_scaled: bool = False, min_max: bool = False, return_std: bool = False) DataFrame

Call model to predict objective function for given input feature vectors.

Parameters:
  • X (pd.DataFrame) – DataFrame of input feature vectors, matching the DataFrame format for WorkFlow.

  • objectives (Optional[Union[str,Dict[str, str]]]) –

    Dictionary with format:

    Obj fn 1 name:

    Model A string name

    Obj fn 2 name:

    Model B string name

    If a string is passed, the best scoring model will be selected for the objective function matching with the string.

    If None the model with the best mean cross-validated MSE score for each objective function in WorkFlow is selected.

    Default value None.

  • X_scaled (bool) –

    Whether DataFrame X has scaled features (True) or not.

    Default value False.

  • min_max (bool) –

    Whether to use the min_max method to estimate uncertainty, or MAPIE with conformity scores. min_max:True returns the standard deviations of the predictions from all the bootstrapped models for each point. min_max:False uses the MAPIE uncertainty estimation outlined in: Mapie jackknife+-AB This parameter is moot for GPR models.

    Default value False.

  • return_std (bool) –

    Whether to return also the uncertainty estimation for predictions. If bootstrapped models not yet trained, will automatically call WorkFlow.estimate_uncertainty() for the selected models.

    Default value False.

Returns:

DataFrame with the following conditional named columns of predictions (and their one-sigma uncertainty):

Obj fn 1 name

Obj fn 2 name

If return_std = True:

Obj fn 1 name

Obj fn 1 name_std

…_std

Return type:

pd.DataFrame

reinit_data_sets(random_state: int | None = None, validation_holdout: int | float = 0) None

Reset the WorkFlow data splitting / shuffling with (optionally) modified random_state and validation_holdout values.

Parameters:
  • random_state (Optional[int]) – Sets the random_state parameter for any stochastic process used in the regression models for reproducibility.

  • validation_holdhout (Optional[Union[int, float]]) – Sets the split for train and test set. If an int is given, a fixed number of data points is stored in the validation_holdout set. If a float is given, a percentage of data points is stored in the validation_holdout set.

Return type:

None

remove_regr(fit_name: str, objective_funcs: str | List[str] | None = None) None

Removes a regression model from the WorkFlow object.

Parameters:
  • fit_name (str) – Unique model identifier name.

  • objective_funcs (Optional[Union[str, List[str]]]) –

    String or list of strings enumerating the objective functions for which the models should be deleted. If None, all objective functions will have the associated named model removed.

    Default value None.

Return type:

None

retrain()

Retrains all models of the WorkFlow. Cross validation scores and uncertainties from previous training are removed.

Return type:

None

validate(name: str, objective_funcs=None, save_loc: bool | str = False, show_title: bool = True) None

Output validation plots and r2 scores for regression models using stored validation dataset.

Parameters:
  • name (str) – String name of regression model.

  • objective_funcs (Optional[Union[str, List[str]]]) – String name, or list of string names of objective function to score on validation dataset.

  • save_loc (Union[bool, str]) –

    Name to save plot (if desired), if False the plot will only be shown, not saved.

    Saving filename convention is: save_loc + objective function + ‘-unseen-validate.pdf’

Return type:

None