dataclr.methods#
The methods
module provides the core implementations of various feature selection
methods. It includes both filter methods and wrapper methods.
Method Overview#
The following tables provide an overview of the available feature selection methods and their suitability for regression and classification tasks.
Filter Methods#
Method |
Regression |
Classification |
---|---|---|
Yes |
Yes |
|
No |
Yes |
|
Yes |
Yes |
|
No |
Yes |
|
No |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
Wrapper Methods#
Method |
Regression |
Classification |
---|---|---|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
- class dataclr.methods.ANOVA(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
ANOVA filter method for feature selection.
This method ranks features based on the ANOVA F-statistic, which evaluates the variance between groups relative to the variance within groups. It supports both regression and classification tasks.
- Inherits from:
FilterMethod
: The base class that provides the structure for filtermethods.
- fit(X_train: DataFrame, y_train: Series) ANOVA #
Computes the ANOVA F-statistics for each feature and ranks them.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series) – Target variable of the training data.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- class dataclr.methods.BorutaMethod(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
Boruta-based wrapper method for feature selection.
This method utilizes a model’s
feature_importances_
attribute to iteratively identify important features. Boruta performs a rigorous test to distinguish real features from noise.- Inherits from:
WrapperMethod
: The base class that provides the structure for filtermethods.
- fit(X_train: DataFrame, y_train: Series) BorutaMethod #
Fits the Boruta feature selection process.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series) – Target variable of the training data.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- Raises:
ValueError – If the model does not have a
feature_importances_
attribute.
- transform(X_train: DataFrame, X_test: DataFrame, y_train: Series, y_test: Series, max_features: int = -1) list[Result] #
Transforms the dataset by selecting the top-ranked features.
- Parameters:
X_train (pd.DataFrame) – Training feature matrix.
X_test (pd.DataFrame) – Testing feature matrix.
y_train (pd.Series) – Training target variable.
y_test (pd.Series) – Testing target variable.
max_features (int) – Number of max features count in results. Defaults to -1 (all features number).
- Returns:
List of results for the selected features.
- Return type:
list[Result]
- Raises:
ValueError – If
fit()
has not been called prior to transform.
- class dataclr.methods.Chi2(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
Chi-squared (Chi2) filter method for feature selection.
This method evaluates the dependency between each feature and the target variable using the chi-squared statistic. It is applicable only for classification tasks.
- Inherits from:
FilterMethod
: The base class that provides the structure for filtermethods.
- fit(X_train: DataFrame, y_train: Series) Chi2 #
Computes the Chi-squared statistic for each feature and ranks them.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series) – Target variable of the training data.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- Raises:
ValueError – If the target task is regression, as Chi2 is only applicable to classification tasks.
- class dataclr.methods.CohensD(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
Cohen’s D filter method for feature selection.
This method calculates the effect size (Cohen’s D) for each feature, comparing the mean differences between two target classes. It is applicable only for binary classification tasks.
- Inherits from:
FilterMethod
: The base class that provides the structure for filtermethods.
- fit(X_train: DataFrame, y_train: Series) CohensD #
Computes Cohen’s D effect size for each feature and ranks them.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series) – Target variable of the training data.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- Raises:
ValueError – If the target task is regression or if the target variable has more than two unique classes, as Cohen’s D is only applicable to binary classification tasks.
- class dataclr.methods.CramersV(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
Cramér’s V filter method for feature selection.
This method measures the association between categorical features and the target variable using Cramér’s V statistic. It is applicable only for classification tasks.
- Inherits from:
FilterMethod
: The base class that provides the structure for filtermethods.
- fit(X_train: DataFrame, y_train: Series) CramersV #
Computes Cramér’s V statistic for each categorical feature and ranks them.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series) – Target variable of the training data.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- Raises:
ValueError – If the target task is regression, as Cramér’s V is only applicable to classification tasks.
- class dataclr.methods.CumulativeDistributionFunction(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], bins: int = 4, n_results: int = 3, seed: int = 42)#
Cumulative Distribution Function (CDF) filter method for feature selection.
This method evaluates the separability of feature distributions across target bins or classes using the Kolmogorov-Smirnov test, applicable for both regression and classification tasks.
- Inherits from:
FilterMethod
: The base class that provides the structure for filtermethods.
- Extended Arguments:
- bins (int, optional): The number of bins to divide the target variable
into for distribution comparison. Defaults to 4.
- fit(X_train: DataFrame, y_train: Series) CumulativeDistributionFunction #
Fits the CDF feature selection process by computing feature scores based on distribution separability.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series) – Target variable of the training data.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- class dataclr.methods.DistanceCorrelation(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
Distance Correlation filter method for feature selection.
This method evaluates the dependency between each feature and the target variable using distance correlation, a measure of both linear and non-linear relationships.
- Inherits from:
FilterMethod
: The base class that provides the structure for filtermethods.
- fit(X_train: DataFrame, y_train: Series) DistanceCorrelation #
Computes distance correlation for each feature and ranks them.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series) – Target variable of the training data.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- class dataclr.methods.Entropy(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], bins: int = 4, n_results: int = 3, seed: int = 42)#
Entropy-based filter method for feature selection.
This method evaluates the importance of features by calculating the information gain with respect to the target variable. For regression tasks, the target is discretized into bins before calculating entropy.
- Inherits from:
FilterMethod
: The base class that provides the structure for filtermethods.
- Extended Arguments:
- bins (int, optional): The number of bins to discretize the target variable into
for regression tasks. Defaults to 4.
- fit(X_train: DataFrame, y_train: Series) Entropy #
Computes the information gain for each feature and ranks them.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series) – Target variable of the training data.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- Raises:
ValueError – If the target variable cannot be discretized for regression tasks.
- class dataclr.methods.FilterMethod(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
A base class for filter feature selection methods.
Filter methods evaluate the relevance of features using statistical tests, correlations, or other criteria independent of a machine learning model. This class serves as the foundation for implementing specific filter-based feature selection algorithms.
- Inherits from:
Method
: The base class that provides the structure for featureselection methods.
- transform(X_train: DataFrame, X_test: DataFrame, y_train: Series, y_test: Series, max_features: int = -1) list[Result] #
Returns results based on the ranked features determined during fitting.
This method applies the feature selection process to the provided datasets using the rankings obtained from the fit method. It raises an error if fit has not been called beforehand.
- Parameters:
X_train (pd.DataFrame) – The training features.
X_test (pd.DataFrame) – The test features.
y_train (pd.Series) – The training target variable.
y_test (pd.Series) – The test target variable.
max_features (int) – Number of max features count in results. Defaults to -1 (all features number).
- Returns:
- A list of results generated by optimizing the feature subset
based on the ranked features.
- Return type:
list[Result]
- Raises:
ValueError – If fit has not been called before transform.
Notes
The method relies on the ranked_features_ attribute, which must be populated during the fit process.
Internally, it calls the _optimize method to perform the feature selection.
See also
fit: Method to determine the ranked features.
_optimize: Internal method for feature subset optimization.
- class dataclr.methods.HyperoptMethod(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_trials: int = None, n_results: int = 3, seed: int = 42)#
Hyperparameter optimization (Hyperopt) wrapper method for feature selection.
This method evaluates feature subsets by optimizing a given metric through a hyperparameter optimization process. It leverages a specified number of trials to explore feature combinations and returns the best subsets.
- Inherits from:
WrapperMethod
: The base class that provides the structure forwrapper methods.
- Extended Arguments:
- n_trials (int, optional): The number of trials for hyperparameter optimization.
Defaults to
config.HYPEROPT_METHOD_N_TRIALS
.
- fit(X_train: ~pandas.core.frame.DataFrame, y_train: ~pandas.core.series.Series = Series([], dtype: object)) HyperoptMethod #
Placeholder method for fitting the feature selection process.
- Parameters:
X_train (pd.DataFrame, optional) – Feature matrix of the training data. Defaults to an empty DataFrame.
y_train (pd.Series, optional) – Target variable of the training data. Defaults to an empty Series.
- Returns:
The instance itself.
- Return type:
- transform(X_train: DataFrame, X_test: DataFrame, y_train: Series, y_test: Series, max_features: int = -1) list[Result] #
Applies the feature selection process by evaluating subsets using hyperparameter optimization.
- Parameters:
X_train (pd.DataFrame) – Training feature matrix.
X_test (pd.DataFrame) – Testing feature matrix.
y_train (pd.Series) – Training target variable.
y_test (pd.Series) – Testing target variable.
max_features (int) – Number of max features count in results. Defaults to -1 (all features number).
- Returns:
- A list of results containing feature subsets and their
corresponding performance metrics.
- Return type:
list[Result]
- class dataclr.methods.KendallCorrelation(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
Kendall’s Tau Correlation filter method for feature selection.
This method evaluates the monotonic relationship between each feature and the target variable using Kendall’s Tau correlation. It is suitable for both regression and classification tasks.
- Inherits from:
FilterMethod
: The base class that provides the structure for filtermethods.
- fit(X_train: DataFrame, y_train: Series) KendallCorrelation #
Computes Kendall’s Tau correlation for each feature and ranks them.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series) – Target variable of the training data.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- class dataclr.methods.Kurtosis(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
Kurtosis filter method for feature selection.
This method evaluates the shape of the distribution of each feature by calculating kurtosis, which measures the “tailedness” of the distribution. Features with higher kurtosis may capture more extreme values or outliers.
- Inherits from:
FilterMethod
: The base class that provides the structure for filtermethods.
- fit(X_train: ~pandas.core.frame.DataFrame, y_train: ~pandas.core.series.Series = Series([], dtype: object)) Kurtosis #
Computes the kurtosis for each feature and ranks them in descending order.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series, optional) – Target variable of the training data. Not used in this method but included for compatibility. Defaults to an empty Series.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- class dataclr.methods.LinearCorrelation(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
Linear Correlation filter method for feature selection.
This method evaluates the linear relationship between each feature and the target variable using Pearson’s correlation coefficient. It is suitable for both regression and classification tasks.
- Inherits from:
FilterMethod
: The base class that provides the structure for filtermethods.
- fit(X_train: DataFrame, y_train: Series) LinearCorrelation #
Computes Pearson’s correlation coefficient for each feature and ranks them.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series) – Target variable of the training data.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- class dataclr.methods.MaximalInformationCoefficient(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], bins: int = 4, n_results: int = 3, seed: int = 42)#
Maximal Information Coefficient (MIC) filter method for feature selection.
This method measures the strength of the relationship between each feature and the target variable. MIC is capable of detecting both linear and non-linear relationships, making it a versatile metric for feature selection.
- Inherits from:
FilterMethod
: The base class that provides the structure for filtermethods.
- Extended Arguments:
- bins (int, optional): Number of bins used for discretizing the data during
the MIC calculation. Defaults to 4.
- fit(X_train: DataFrame, y_train: Series) MaximalInformationCoefficient #
Computes the Maximal Information Coefficient (MIC) for each feature and ranks them.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series) – Target variable of the training data.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- class dataclr.methods.MeanAbsoluteDeviation(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
Mean Absolute Deviation (MAD) filter method for feature selection.
This method calculates the average absolute deviation of each feature from its mean. Features with lower deviation are considered less informative for distinguishing patterns.
- Inherits from:
FilterMethod
: The base class that provides the structure for filtermethods.
- fit(X_train: ~pandas.core.frame.DataFrame, y_train: ~pandas.core.series.Series = Series([], dtype: object)) MeanAbsoluteDeviation #
Computes the Mean Absolute Deviation (MAD) for each feature and ranks them.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series, optional) – Target variable of the training data. Not used in this method but included for compatibility. Defaults to an empty Series.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- class dataclr.methods.Method(model: BaseModel, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int, seed: int = 42)#
A base class for feature selection methods.
This class defines the structure for methods that integrate with a machine learning model to select the best features from a dataset. It includes the main functionality for fitting a model and returning results.
- n_results#
The number of top features or results to select.
- Type:
int
- total_combinations#
The total number of feature combinations evaluated.
- Type:
int
- seed#
Number determining the randomness.
- Type:
int
- fit(X_train: DataFrame, y_train: Series, keep_features: list[str] = []) Method #
Fits the model using the provided training data.
This method is intended to be implemented by child classes to define specific fitting logic.
- Parameters:
X_train (pd.DataFrame) – The training features.
y_train (pd.Series) – The training target variable.
- Returns:
The instance of the class itself after fitting.
- Return type:
- Raises:
NotImplementedError – If the method is not implemented in a subclass.
- fit_transform(X_train: DataFrame, X_test: DataFrame, y_train: Series, y_test: Series, max_features: int = -1) list[Result] #
Fits the model using the training data and returns results based on the model.
This method combines the functionality of fit and transform to perform both steps in sequence.
- Parameters:
X_train (pd.DataFrame) – The training features.
X_test (pd.DataFrame) – The test features.
y_train (pd.Series) – The training target variable.
y_test (pd.Series) – The test target variable.
- Returns:
- A list of results generated by the transformation.
Returns an empty list if fitting the model fails.
- Return type:
list[Result]
- abstractmethod transform(X_train: DataFrame, X_test: DataFrame, y_train: Series, y_test: Series, max_features: int = -1) list[Result] #
Returns results based on the fitted model.
This method is intended to be implemented by child classes to define specific transformation logic.
- Parameters:
X_train (pd.DataFrame) – The training features.
X_test (pd.DataFrame) – The test features.
y_train (pd.Series) – The training target variable.
y_test (pd.Series) – The test target variable.
- Returns:
A list of results generated by the transformation.
- Return type:
list[Result]
- Raises:
NotImplementedError – If the method is not implemented in a subclass.
- class dataclr.methods.MutualInformation(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
Mutual Information filter method for feature selection.
This method evaluates the dependency between each feature and the target variable using mutual information, which measures both linear and non-linear relationships. It supports both regression and classification tasks.
- Inherits from:
FilterMethod
: The base class that provides the structure for filtermethods.
- fit(X_train: DataFrame, y_train: Series) MutualInformation #
Computes mutual information between each feature and the target variable, and ranks the features.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series) – Target variable of the training data.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- class dataclr.methods.OptunaMethod(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_trials: int = None, n_results: int = 3, seed: int = 42)#
Optuna-based wrapper method for feature selection.
This method utilizes Optuna for hyperparameter optimization to evaluate and select feature subsets. It performs a specified number of trials to explore feature combinations and identify the best subsets based on the provided metric.
- Inherits from:
WrapperMethod
: The base class that provides the structure for wrappermethods.
- Extended Arguments:
- n_trials (int, optional): The number of trials for Optuna’s optimization
process. Defaults to
config.OPTUNA_METHOD_N_TRIALS
.
- fit(X_train: ~pandas.core.frame.DataFrame, y_train: ~pandas.core.series.Series = Series([], dtype: object)) OptunaMethod #
Placeholder method for fitting the feature selection process.
- Parameters:
X_train (pd.DataFrame, optional) – Feature matrix of the training data. Defaults to an empty DataFrame.
y_train (pd.Series, optional) – Target variable of the training data. Defaults to an empty Series.
- Returns:
The instance itself.
- Return type:
- transform(X_train: DataFrame, X_test: DataFrame, y_train: Series, y_test: Series, max_features: int = -1) list[Result] #
Applies the feature selection process by evaluating subsets using Optuna optimization.
- Parameters:
X_train (pd.DataFrame) – Training feature matrix.
X_test (pd.DataFrame) – Testing feature matrix.
y_train (pd.Series) – Training target variable.
y_test (pd.Series) – Testing target variable.
max_features (int) – Number of max features count in results. Defaults to -1 (all features number).
- Returns:
- A list of results containing feature subsets and their
corresponding performance metrics.
- Return type:
list[Result]
- class dataclr.methods.RecursiveFeatureAddition(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
Recursive Feature Addition (RFA) feature selection method.
This method iteratively adds the most important feature and evaluates the model’s performance to determine the optimal subset of features.
- Inherits from:
WrapperMethod
: The base class for wrapper-based feature selection methods.
- fit(X_train: ~pandas.core.frame.DataFrame = Empty DataFrame Columns: [] Index: [], y_train: ~pandas.core.series.Series = Series([], dtype: object)) RecursiveFeatureAddition #
Fits the model.
- Parameters:
X_train (pd.DataFrame) – Training feature matrix.
y_train (pd.Series) – Training target variable.
- Returns:
Returns self.
- Return type:
- transform(X_train: DataFrame, X_test: DataFrame, y_train: Series, y_test: Series, max_features: int = -1) list[Result] #
Performs Recursive Feature Addition and selects the optimal subset of features.
- Parameters:
X_train (pd.DataFrame) – Training feature matrix.
X_test (pd.DataFrame) – Testing feature matrix.
y_train (pd.Series) – Training target variable.
y_test (pd.Series) – Testing target variable.
max_features (int) – Number of max features count in results. Defaults to -1 (all features number).
- Returns:
A list of feature subsets and their corresponding performance metrics.
- Return type:
list[Result]
- class dataclr.methods.RecursiveFeatureElimination(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
Recursive Feature Elimination (RFE) feature selection method.
This method iteratively removes the least important feature and evaluates the model’s performance to determine the optimal subset of features.
- Inherits from:
WrapperMethod
: The base class for wrapper-based feature selection methods.
- fit(X_train: ~pandas.core.frame.DataFrame = Empty DataFrame Columns: [] Index: [], y_train: ~pandas.core.series.Series = Series([], dtype: object)) RecursiveFeatureElimination #
Fits the model.
- Parameters:
X_train (pd.DataFrame) – Training feature matrix.
y_train (pd.Series) – Training target variable.
- Returns:
Returns self.
- Return type:
- transform(X_train: DataFrame, X_test: DataFrame, y_train: Series, y_test: Series, max_features: int = -1) list[Result] #
Performs Recursive Feature Elimination and selects the optimal subset of features.
- Parameters:
X_train (pd.DataFrame) – Training feature matrix.
X_test (pd.DataFrame) – Testing feature matrix.
y_train (pd.Series) – Training target variable.
y_test (pd.Series) – Testing target variable.
max_features (int) – Number of max features count in results. Defaults to -1 (all features number).
- Returns:
A list of feature subsets and their corresponding performance metrics.
- Return type:
list[Result]
- class dataclr.methods.ShapMethod(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
SHAP-based wrapper method for feature selection.
This method utilizes SHAP (SHapley Additive exPlanations) values to evaluate the importance of features based on the model’s predictions. It supports models with
feature_importances_
(e.g., tree-based models) orcoef_
(e.g., linear models).- Inherits from:
WrapperMethod
: The base class that provides the structure for wrappermethods.
- fit(X_train: DataFrame, X_test: DataFrame) ShapMethod #
Computes SHAP values for each feature and ranks them.
- Parameters:
X_train (pd.DataFrame) – Training feature matrix.
X_test (pd.DataFrame) – Testing feature matrix.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- Raises:
ValueError – If the model lacks both
feature_importances_
andcoef_
attributes, which are required for SHAP computation.
- fit_transform(X_train: DataFrame, X_test: DataFrame, y_train: Series, y_test: Series, max_features: int = -1) list[Result] #
Fits the model using the training data and returns results based on the model.
This method combines the functionality of fit and transform to perform both steps in sequence.
- Parameters:
X_train (pd.DataFrame) – The training features.
X_test (pd.DataFrame) – The test features.
y_train (pd.Series) – The training target variable.
y_test (pd.Series) – The test target variable.
- Returns:
- A list of results generated by the transformation.
Returns an empty list if fitting the model fails.
- Return type:
list[Result]
- transform(X_train: DataFrame, X_test: DataFrame, y_train: Series, y_test: Series, max_features: int = -1) list[Result] #
Applies the SHAP-based feature selection process to evaluate and optimize subsets.
- Parameters:
X_train (pd.DataFrame) – Training feature matrix.
X_test (pd.DataFrame) – Testing feature matrix.
y_train (pd.Series) – Training target variable.
y_test (pd.Series) – Testing target variable.
max_features (int) – Number of max features count in results. Defaults to -1 (all features number).
- Returns:
- A list of results containing feature subsets and their
corresponding performance metrics.
- Return type:
list[Result]
- Raises:
ValueError – If
fit()
has not been called prior totransform
.
- class dataclr.methods.Skewness(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
Skewness filter method for feature selection.
This method evaluates the asymmetry of the distribution of each feature by calculating skewness. Features with higher skewness may indicate potential outliers or non-normal distributions.
- Inherits from:
FilterMethod
: The base class that provides the structure for filtermethods.
- fit(X_train: ~pandas.core.frame.DataFrame, y_train: ~pandas.core.series.Series = Series([], dtype: object)) Skewness #
Computes the skewness for each feature and ranks them in descending order.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series, optional) – Target variable of the training data. Not used in this method but included for compatibility. Defaults to an empty Series.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- class dataclr.methods.SpearmanCorrelation(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
Spearman Correlation filter method for feature selection.
This method evaluates the monotonic relationship between each feature and the target variable using Spearman’s rank correlation coefficient. It is suitable for both regression and classification tasks.
- Inherits from:
FilterMethod
: The base class that provides the structure for filtermethods.
- fit(X_train: DataFrame, y_train: Series) SpearmanCorrelation #
Computes Spearman’s rank correlation coefficient for each feature and ranks them.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series) – Target variable of the training data.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- class dataclr.methods.VarianceInflationFactor(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], threshold: int = 5, n_results: int = 3, seed: int = 42)#
Variance Inflation Factor (VIF) filter method for feature selection.
This method identifies features with multicollinearity by calculating the Variance Inflation Factor (VIF) for each feature. Features with a VIF above the specified threshold are considered collinear and ranked based on their VIF values.
- Inherits from:
FilterMethod
: The base class that provides the structure for filtermethods.
- Extended Arguments:
- threshold (int, optional): The VIF threshold above which features are considered
multicollinear. Defaults to 5.
- fit(X_train: ~pandas.core.frame.DataFrame, y_train: ~pandas.core.series.Series = Series([], dtype: object)) VarianceInflationFactor #
Computes the Variance Inflation Factor (VIF) for each feature and filters those exceeding the specified threshold.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series, optional) – Target variable of the training data. Not used in this method but included for compatibility. Defaults to an empty Series.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- class dataclr.methods.VarianceThreshold(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
Variance Threshold filter method for feature selection.
This method ranks features based on their variance. Features with higher variance are considered more informative, while features with low variance may contribute less to the model’s performance.
- Inherits from:
FilterMethod
: The base class that provides the structure for filtermethods.
- fit(X_train: ~pandas.core.frame.DataFrame, y_train: ~pandas.core.series.Series = Series([], dtype: object)) VarianceThreshold #
Computes the variance for each feature and ranks them.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series, optional) – Target variable of the training data. Not used in this method but included for compatibility. Defaults to an empty Series.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- class dataclr.methods.WrapperMethod(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
A base class for wrapper feature selection methods.
Wrapper methods use a machine learning model to evaluate feature subsets by training and validating the model on different combinations of features. This class serves as the foundation for implementing specific wrapper-based feature selection algorithms.
- Inherits from:
Method
: The base class that provides the structure for feature selection methods.
- class dataclr.methods.ZScore(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
Z-Score filter method for feature selection.
This method evaluates the importance of features by calculating the mean of the absolute Z-scores for each feature. Features with higher mean absolute Z-scores are considered more informative.
- Inherits from:
FilterMethod
: The base class that provides the structure for filtermethods.
- fit(X_train: ~pandas.core.frame.DataFrame, y_train: ~pandas.core.series.Series = Series([], dtype: object)) ZScore #
Computes the mean absolute Z-score for each feature and ranks them in descending order.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series, optional) – Target variable of the training data. Not used in this method but included for compatibility. Defaults to an empty Series.
- Returns:
The fitted instance with ranked features stored in
self.ranked_features_
.- Return type:
- class dataclr.methods.mRMR(model, metric: Literal['rmse', 'r2', 'accuracy', 'precision', 'recall', 'f1'], n_results: int = 3, seed: int = 42)#
Minimum Redundancy Maximum Relevance (mRMR) filter method for feature selection.
This method selects features by maximizing relevance to the target variable while minimizing redundancy among selected features. Relevance is measured using either ANOVA F-statistic for classification or regression tasks, and redundancy is computed using Pearson correlation.
- Inherits from:
FilterMethod
: The base class that provides the structure for filtermethods.
- fit(X_train: DataFrame, y_train: Series) mRMR #
Selects features by optimizing for both relevance and minimal redundancy.
- Parameters:
X_train (pd.DataFrame) – Feature matrix of the training data.
y_train (pd.Series) – Target variable of the training data.
- Returns:
- The fitted instance with ranked features stored in
self.ranked_features_
.
- Return type:
- Raises:
ValueError – If the input is incompatible with regression or classification tasks.