LECA.analyze.datasize_performance

LECA.analyze.datasize_performance(wf: WorkFlow, estimator_name: str, objective_funcs: str | List[str] | None = None, test_size: int | float = 0.1, N_min: int | None = None, sample_count: int = 5, repeat: int = 100, plot: bool = False, random_state: int | None = None, confidence=1.0, save_loc: str | bool = False) Dict[str, Tuple[float, DataFrame]]

1/N_training performance metrics. The MSE scores of the named regression model are calculated for the given objective functions using randomly selected datapoints for fixed fractions of the dataset to analyze the model performance as a function of the number of training datapoints.

Parameters:
  • wf (WorkFlow) – WorkFlow object with models to analyze.

  • estimator_name (str) – String with model name to perform datasize performance analysis.

  • objective_funcs (Optional[Union[str, List[str]]]) –

    String or list of strings declaring which objective functions on which to perform datasize performance analysis. When None is passed, defaults to all objective functions. If a list of objective functions are passed, the function returns a list of objects.

    Default value None.

  • test_size (Union[int, float]) –

    If int: Explicit number of datapoints used as the test set.

    If float: Defines fraction of the whole dataset to use as the test set.

    Default value 0.1.

  • N_min (Optional[int]) –

    Number of training datapoints for the first (smallest) sample. If None then the first non-zero E_training results will be used as N_min.

    Default value None.

  • sample_count (int) –

    Number of points to sample for training/prediction error. I.e. the number of different datasizes to score for training/test MSE. The sample sizes will be automatically selected as equidistant on the 1/N_training scale on the range from N_min to N_total.

    Default value 5.

  • repeat (int, default=100) –

    How many times to repeat datasize performance test. The mean values of the repeated analysis and their standard deviations are then recorded as the results.

    Default value 100.

  • plot (bool) –

    Whether to output a plot of the training/test MSE scores as a function of 1/N_training.

    Default value False.

  • random_state (Optional[int]) –

    Sets a numpy random seed for reproducibility.

    Default value None.

  • save_loc (Union[str, bool]) –

    Destination to save result plot (if provided as a string argument). Figure is saved to: save_loc + ‘N_plot-’ + estimator_name + “-” + obj + “.pdf” Where obj is the objective function of the model prediction and estimator_name is the string name of the model saved in the WorkFlow object.

    Default value False.

Returns:

Returns a dictionary with each key the string name of the listed objective functions. The value of each dictionary is a tuple, the first value is the estimated eta squared value for infinite training data points. The second value is a DataFrame with the test/train MSE and their deviations for different dataset slices. The DataFrame has the form:

slice 1 size

slice 2 size

test

train

test_std

train_std

Where test / train are the mean of the MSE scores on the test / train dataset for the models, and test/train_std are their deviations.

Return type:

Dict[str, Tuple[float, pd.DataFrame]]