LECA.prep.outlier_filter
- LECA.prep.outlier_filter(data: DataFrame, objective_funcs: str | List[str], cluster_dimensions: str | List[str], filter_threshhold: float = 0.98, quantile_filter: bool = False, min_cluster_size: int = 5, show_plots: bool = True) DataFrame
Data outlier detection using HDBSCAN clustering algorithm. HDBSCAN requires only one parameter (
min_cluster_size) and we use the resulting outlier_scores for the datapoints we fit it on. Most important parameter:cluster_dimensionsdefines the data the clusterer actually considers, i.e. [objective function] if you just want to filter outlier measured results, or [objective function] + feature_list if you want to cluster in all dimensions. The function will also plot kept vs filtered data points for each feature to help with tuning parameters.- Parameters:
data (
DataFrame) – Dataframe of experimental measurements.objective_funcs (Union[str, List[str]]) – List of the objective functions - used for plotting each of the objective as a function of the features considered for clustering
cluster_dimensions (Union[str, List[str]]) – Defines which dimensions will be considered by the HDBSCAN clustering algorithm. Takes label names from data columns.
filter_threshhold (float (range [0-1])) –
Outlier score threshold value. Datapoints with an outlier score higher than this value will be filtered.
Default value
0.98.quantile_filter (bool) –
quantile_filter =
True: keep data under filter_threshhold*100 percentile outlier score (from total dataset)quantile_filter =
False: keep data under filter_threshhold outlier score
Default value
Falsemin_cluster_size (int) –
HDBSCAN parameter for choosing
min_cluster_size, see HDBSCAN documentation.Default value
5.show_plots (bool) –
Whether to show plots of objective_functions : cluster_dimensions
Default value
True
- Returns:
DataFrame with outlier values removed
- Return type:
DataFrame