LECA.prep.outlier_filter

LECA.prep.outlier_filter(data: DataFrame, objective_funcs: str | List[str], cluster_dimensions: str | List[str], filter_threshhold: float = 0.98, quantile_filter: bool = False, min_cluster_size: int = 5, show_plots: bool = True) DataFrame

Data outlier detection using HDBSCAN clustering algorithm. HDBSCAN requires only one parameter (min_cluster_size) and we use the resulting outlier_scores for the datapoints we fit it on. Most important parameter: cluster_dimensions defines the data the clusterer actually considers, i.e. [objective function] if you just want to filter outlier measured results, or [objective function] + feature_list if you want to cluster in all dimensions. The function will also plot kept vs filtered data points for each feature to help with tuning parameters.

Parameters:
  • data (DataFrame) – Dataframe of experimental measurements.

  • objective_funcs (Union[str, List[str]]) – List of the objective functions - used for plotting each of the objective as a function of the features considered for clustering

  • cluster_dimensions (Union[str, List[str]]) – Defines which dimensions will be considered by the HDBSCAN clustering algorithm. Takes label names from data columns.

  • filter_threshhold (float (range [0-1])) –

    Outlier score threshold value. Datapoints with an outlier score higher than this value will be filtered.

    Default value 0.98.

  • quantile_filter (bool) –

    • quantile_filter = True: keep data under filter_threshhold*100 percentile outlier score (from total dataset)

    • quantile_filter = False: keep data under filter_threshhold outlier score

    Default value False

  • min_cluster_size (int) –

    HDBSCAN parameter for choosing min_cluster_size, see HDBSCAN documentation.

    Default value 5.

  • show_plots (bool) –

    Whether to show plots of objective_functions : cluster_dimensions

    Default value True

Returns:

DataFrame with outlier values removed

Return type:

DataFrame