dataclr.feature_selector#
The feature_selector
module is the main component for running the
feature selection algorithm in the dataclr
package. It provides the
FeatureSelector
class, which orchestrates the evaluation of machine learning
models, applies various filter and wrapper feature selection methods, and identifies
the best subset of features based on a specified performance metric.
Important
In order to use most wrapper methods, the model must have either the coef_
or feature_importances_
attribute. These attributes are used to evaluate the
importance of features during the selection process. Ensure that your model
implements at least one of these attributes for compatibility.
- class dataclr.feature_selector.FeatureSelector(model: BaseModel, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], X_train: DataFrame, X_test: DataFrame, y_train: Series, y_test: Series)#
A class for selecting the best features from a dataset using a combination of filter and wrapper methods.
The
FeatureSelector
evaluates the base performance of a model, applies various feature selection techniques, and determines the optimal set of features based on the given metric. It also ensures that the data is properly preprocessed, encoded, and scaled for optimal performance.- Parameters:
model (
BaseModel
) – The model to be used for evaluation.metric (
Metric
) – The metric used to assess model performance.X_train (pd.DataFrame) – Training feature data. Must be numeric and either normalized or standardized.
X_test (pd.DataFrame) – Testing feature data. Must be numeric and either normalized or standardized.
y_train (pd.Series) – Training target data.
y_test (pd.Series) – Testing target data.
- Raises:
ValueError – If X_train contains non-numeric data.
ValueError – If X_train is not normalized or standardized.
ValueError – If X_train or X_test contains incompatible features that cannot be aligned.
Notes
Features with only a single unique value are removed.
It is necessary to preprocess the data (e.g., encoding, scaling) prior to
passing it to this class for feature selection.
- select_features(n_results: int = 3, max_depth: int = 3, max_method_results: int = 2, start_wrappers: bool = True, level_wrapper_results: int = 1, final_wrapper_results: int = 2, level_cutoff_threshold: int = 100, filter_methods: list[FilterMethod] = None, wrapper_methods: list[WrapperMethod] = None, verbose: bool = True, n_jobs: int = -1, seed: int = None, max_console_width: int = 110, keep_features: list[str] = [], max_features: int = -1, features_remove_coeff: float = 1.5, mode: str = 'normal') list[MethodResult] #
Selects the best features using filter and wrapper methods and evaluates performance.
This method evaluates the base performance of the provided model on the dataset. It then applies a combination of feature selection methods to identify the optimal feature set. The best results are extracted and printed.
- Steps:
Compute the base model performance.
Construct a
Graph
object with filter and wrapper methods.Retrieve and display the best results from the graph.
- Parameters:
n_results (int) – The number of top results to return. Defaults to 3.
seed (int) – Number determining the randomness.
max_depth (int) – The maximum depth of exploration for the graph. Defaults to 3.
max_method_results (int) – The maximum number of results returned by a single method. Defaults to 2.
start_wrappers (bool) – Whether to initiate wrapper methods at the beginning. Defaults to True.
level_wrapper_results (int) – The number of top results to be used for running wrapper methods after entering a new level in the graph. Defaults to 0.
final_wrapper_results (int) – The number of top results to be used for running wrapper methods after the graph exploration ends. Defaults to 2.
level_cutoff_threshold (int) – The threshold for stopping exploration at the current level after a specified number of runs with no improvement. Defaults to 100.
filter_methods (list[
dataclr.methods.FilterMethod
]) – A set of filtering methods to be applied. Defaults to filter_classes.wrapper_methods (list[
dataclr.methods.WrapperMethod
]) – A set of wrapper methods to be applied. Defaults to wrapper_classes.verbose (bool) – Whether to display a UI with a progress bar during the algorithm’s runtime. Defaults to True.
n_jobs (int) – The number of parallel jobs to use. Set to -1 to utilize all available processors. Defaults to -1.
max_console_width (int) – The maximum width of the console output for UI display purposes. Defaults to 110.
keep_features (list[str]) – List of features not to be dropped. Defaults to empty.
max_features (int) – Number of max features list in end results. Defaults to -1 (all features number).
features_remove_coeff (float) – Coefficient that will be used to determine how much features can be o result on specified level. The exact formula is max_features*(features_remove_coeff)^(remaining_levels_count). Defaults to 1.5.
mode (str) – Determines how time-consuming methods will be used in feature selection. Possible values: ‘normal’, ‘fast’, ‘super_fast’. ‘normal’ is the best choice for datasets with up to a few hundred features. ‘fast’ is suitable for datasets with fewer than a thousand features. ‘super_fast’ is scalable for datasets with more than a few thousand features. Defaults to ‘normal’.
- Returns:
A list of the best results encapsulated as
MethodResult
objects.- Return type:
list[
MethodResult
]