Feature Selection and EDA in Machine Learning
Updated: May 25, 2021
How to Use Data Visualization to Guide Feature Selection
In the machine learning lifecycle, feature selection is a critical process that selects a subset of input features that would be relevant to the prediction. Including irrelevant variables, especially those with bad data quality, can often contaminate the model output.
Additionally, feature selection has following advantages:
1) avoid the curse of dimensionality, as some algorithms perform badly when high in dimensionality, e.g. general linear models, decision tree
2) reduce computational cost and the complexity that comes along with a large amount of data
3) reduce overfitting and the model is more likely to be generalized to new data
4) increase the explainability of models
In this article, we will discuss two main feature selection techniques: filter methods and wrapper methods, as well as how to use data visualization to guide decision making.
Before jumping into the feature selection, we should always perform data preprocessing and data transformation:
1. Load dataset and import libraries
I am using Credit Card Customer Dataset from Kaggle to predict who is more likely to get churned.
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from pandas.api.types import is_string_dtype, is_numeric_dtype df = pd.read_csv("../input/credit-card-customers/BankChurners.csv")
Now let's have a glimpse of the raw data.
2. Define the prediction: this exercise would be a classification problem, from which we predict the binary variable "Attrition_Flag" which can either be existing customers or attrited customers.
3. Examine missing data:
Luckily, this dataset does not contain any missing values, but it is not always the case. If you would like to know more about how to address missing data, you may find these articles helpful.
4. Variable transformation: This process consists of encoding categorical variables and transforming all variables into the same scale. I chose label encoder and min-max scaling respectively.
Exploratory Data Analysis (EDA)
Data visualization and EDA can be great complementary tools to feature selection process, and they can be applied in the following ways:
Univariate Analysis: Histogram and Bar Chart help to visualize the distribution and variance of each variable
Correlation Analysis: Heatmap facilitates the identification of highly correlated explanatory variables and reduce collinearity
Bivariate Analysis: Box plot and Grouped bar chart help to spot the dependency and relationship between explanatory variables and response variable
(to have a better understanding of the data, exploratory data analysis can be performed before data transformation