Overview#
dataclr
is a Python library for streamlined feature selection in tabular datasets.
It offers a variety of filter and wrapper methods, delivering robust and interpretable
feature rankings to enhance model performance and simplify feature engineering.
Key Features#
Comprehensive Feature Selection Methods:
Filter Methods: - Evaluate features independently of the predictive model. - Include techniques such as
MutualInformation
,VarianceThreshold
,ANOVA
,KendallCorrelation
, and more.Wrapper Methods: - Evaluate subsets of features using a predictive model. - Include methods such as
BorutaMethod
,ShapMethod
,HyperoptMethod
, andOptunaMethod
.
Customizable Evaluation Metrics:
Supports both regression and classification tasks with a wide range of metrics.
Automatically adapts feature selection strategies based on the nature of the target variable.
Highly Configurable and Scalable:
Allows fine-grained control over the number of selected features, optimization trials, and thresholds.
Scales efficiently to handle large datasets and high-dimensional feature spaces.
Interpretable Results:
Provides ranked lists of features with detailed importance scores.
Supports visualization and reporting for better interpretability.
Seamless Integration:
Compatible with popular Python libraries such as
pandas
,scikit-learn
.Designed to integrate seamlessly into existing machine learning pipelines.
Use Cases#
Dimensionality Reduction: Select the most relevant features for high-dimensional datasets, reducing computational overhead and improving model performance.
Feature Engineering: Identify redundant or irrelevant features to focus on meaningful transformations.
Explainable AI (XAI): Use interpretable methods like
ShapMethod
to understand feature importance and model behavior.Optimization: Improve the generalization of machine learning models by using well-curated feature subsets.
How It Works#
dataclr
operates by:
Accepting Tabular Data: Input datasets in the form of
pandas
DataFrames.Applying Feature Selection Methods:
Filter methods evaluate features based on statistical metrics or relationships with the target.
Wrapper methods iteratively select subsets by evaluating feature combinations against a predictive model.
Returning Ranked Features Sets: Output a ranked list of features sets along with used methods and additional metrics.
dataclr
enables machine learning practitioners to perform feature selection efficiently and with ease.