LECA.analyze.comparative_datasize_performance
- LECA.analyze.comparative_datasize_performance(wf: WorkFlow, estimators: str | List[str] | None = None, objective_funcs: str | List[str] | None = None, test_size: int | float = 0.1, N_min: int | None = None, sample_count: int = 5, repeat: int = 100, log_scale: bool = False, plot: bool = False, y_lim: Tuple[float] | None = None, random_state: int | None = None, confidence=1.0, save_loc: str | bool = False) Dict[str, Tuple[float, DataFrame]]
Comparative model performance as a function of N_training size. The MSE scores of the named regression model are calculated for the given objective functions using randomly selected datapoints for fixed fractions of the dataset to analyze the model performance as a function of the number of training datapoints.
- Parameters:
wf (WorkFlow) – WorkFlow object with models to analyze.
estimators (Optional[Union[str, List[str]]]) –
String or list with model name(s) to perform datasize performance analysis.
Nonewill use all workflow models.Default value
None.objective_funcs (Optional[Union[str, List[str]]]) –
String or list of strings declaring which objective functions on which to perform datasize performance analysis. When
Noneis passed, defaults to all objective functions. If a list of objective functions are passed, the function returns a list of objects.Default value
None.test_size (Union[int, float]) –
If int: Explicit number of datapoints used as the test set.
If float: Defines fraction of the whole dataset to use as the test set.
Default value
0.1.N_min (Optional[int]) –
Number of training datapoints for the first (smallest) sample. If
Nonethen the first non-zero E_training results will be used as N_min.Default value
None.sample_count (int) –
Number of points to sample for training/prediction error. I.e. the number of different datasizes to score for training/test MSE. The sample sizes will be automatically selected as equidistant on the N_training scale on the range from N_min to N_total.
Default value
5.repeat (int, default=100) –
How many times to repeat datasize performance test. The mean values of the repeated analysis and their standard deviations are then recorded as the results.
Default value
100.log_scale (bool) –
Whether to plot the y axis with a log scale.
Default value
False.plot (bool) –
Whether to output a plot of the training/test MSE scores as a function of N_training.
Default value
False.random_state (Optional[int]) –
Sets a numpy random seed for reproducibility.
Default value
None.confidence (float) –
Set confidence intervall for error bars. confidence*standard deviation is shown as error bars.
Default value
1.0.save_loc (Union[str, bool]) –
Destination to save result plot (if provided as a string argument). Figure is saved to: save_loc + ‘model_compare_N_data-’ + obj.replace(“/”, “-”) + “.pdf” Where obj is the objective function of the model prediction.
Default value
False.
- Returns:
Returns a dictionary with each key the string name of the listed objective functions. Each objective key has a dictionary value in the form ‘model_name’: results dataFrame (for that model). The results DataFrame has the test/train MSE and their deviations for each model’s performance on different dataset slices. The DataFrame has the form:
slice 1 size
slice 2 size
…
test
…
…
…
train
…
…
…
test_std
…
…
…
train_std
…
…
…
Where test / train are the mean of the MSE scores on the test / train dataset for the models, and test/train_std are their deviations.
- Return type:
Dict[str, Dict[str, pd.DataFrame]]