`LECA.prep`.outlier_filter

LECA.prep.outlier_filter(data: DataFrame, objective_funcs: str | List[str], cluster_dimensions: str | List[str], filter_threshhold: float = 0.98, quantile_filter: bool = False, min_cluster_size: int = 5, show_plots: bool = True) → DataFrame

Data outlier detection using HDBSCAN clustering algorithm. HDBSCAN requires only one parameter (min_cluster_size) and we use the resulting outlier_scores for the datapoints we fit it on. Most important parameter: cluster_dimensions defines the data the clusterer actually considers, i.e. [objective function] if you just want to filter outlier measured results, or [objective function] + feature_list if you want to cluster in all dimensions. The function will also plot kept vs filtered data points for each feature to help with tuning parameters.

Parameters:

data (DataFrame) – Dataframe of experimental measurements.
objective_funcs (Union[str, List[str]]) – List of the objective functions - used for plotting each of the objective as a function of the features considered for clustering
cluster_dimensions (Union[str, List[str]]) – Defines which dimensions will be considered by the HDBSCAN clustering algorithm. Takes label names from data columns.
filter_threshhold (float (range [0-1])) –
Outlier score threshold value. Datapoints with an outlier score higher than this value will be filtered.

Default value 0.98.
quantile_filter (bool) –
- quantile_filter = True: keep data under filter_threshhold*100 percentile outlier score (from total dataset)
- quantile_filter = False: keep data under filter_threshhold outlier score
Default value False
min_cluster_size (int) –
HDBSCAN parameter for choosing min_cluster_size, see HDBSCAN documentation.

Default value 5.
show_plots (bool) –
Whether to show plots of objective_functions : cluster_dimensions

Default value True

Returns:

DataFrame with outlier values removed

Return type:

DataFrame

LECA.prep.outlier_filter

`LECA.prep`.outlier_filter