LECA.analyze.comparative_datasize_performance

LECA.analyze.comparative_datasize_performance(wf: WorkFlow, estimators: str | List[str] | None = None, objective_funcs: str | List[str] | None = None, test_size: int | float = 0.1, N_min: int | None = None, sample_count: int = 5, repeat: int = 100, log_scale: bool = False, plot: bool = False, y_lim: Tuple[float] | None = None, random_state: int | None = None, confidence=1.0, save_loc: str | bool = False) Dict[str, Tuple[float, DataFrame]]

Comparative model performance as a function of N_training size. The MSE scores of the named regression model are calculated for the given objective functions using randomly selected datapoints for fixed fractions of the dataset to analyze the model performance as a function of the number of training datapoints.

Parameters:
  • wf (WorkFlow) – WorkFlow object with models to analyze.

  • estimators (Optional[Union[str, List[str]]]) –

    String or list with model name(s) to perform datasize performance analysis. None will use all workflow models.

    Default value None.

  • objective_funcs (Optional[Union[str, List[str]]]) –

    String or list of strings declaring which objective functions on which to perform datasize performance analysis. When None is passed, defaults to all objective functions. If a list of objective functions are passed, the function returns a list of objects.

    Default value None.

  • test_size (Union[int, float]) –

    If int: Explicit number of datapoints used as the test set.

    If float: Defines fraction of the whole dataset to use as the test set.

    Default value 0.1.

  • N_min (Optional[int]) –

    Number of training datapoints for the first (smallest) sample. If None then the first non-zero E_training results will be used as N_min.

    Default value None.

  • sample_count (int) –

    Number of points to sample for training/prediction error. I.e. the number of different datasizes to score for training/test MSE. The sample sizes will be automatically selected as equidistant on the N_training scale on the range from N_min to N_total.

    Default value 5.

  • repeat (int, default=100) –

    How many times to repeat datasize performance test. The mean values of the repeated analysis and their standard deviations are then recorded as the results.

    Default value 100.

  • log_scale (bool) –

    Whether to plot the y axis with a log scale.

    Default value False.

  • plot (bool) –

    Whether to output a plot of the training/test MSE scores as a function of N_training.

    Default value False.

  • random_state (Optional[int]) –

    Sets a numpy random seed for reproducibility.

    Default value None.

  • confidence (float) –

    Set confidence intervall for error bars. confidence*standard deviation is shown as error bars.

    Default value 1.0.

  • save_loc (Union[str, bool]) –

    Destination to save result plot (if provided as a string argument). Figure is saved to: save_loc + ‘model_compare_N_data-’ + obj.replace(“/”, “-”) + “.pdf” Where obj is the objective function of the model prediction.

    Default value False.

Returns:

Returns a dictionary with each key the string name of the listed objective functions. Each objective key has a dictionary value in the form ‘model_name’: results dataFrame (for that model). The results DataFrame has the test/train MSE and their deviations for each model’s performance on different dataset slices. The DataFrame has the form:

slice 1 size

slice 2 size

test

train

test_std

train_std

Where test / train are the mean of the MSE scores on the test / train dataset for the models, and test/train_std are their deviations.

Return type:

Dict[str, Dict[str, pd.DataFrame]]