Upgrade to Get Unlimited Access
($10 One Off Payment)

Feature Selection and EDA in Machine Learning

Updated: May 25, 2021

How to Use Data Visualization to Guide Feature Selection

In the machine learning lifecycle, feature selection is a critical process that selects a subset of input features that would be relevant to the prediction. Including irrelevant variables, especially those with bad data quality, can often contaminate the model output.

Additionally, feature selection has following advantages:

1) avoid the curse of dimensionality, as some algorithms perform badly when high in dimensionality, e.g. general linear models, decision tree

2) reduce computational cost and the complexity that comes along with a large amount of data

3) reduce overfitting and the model is more likely to be generalized to new data

4) increase the explainability of models

In this article, we will discuss two main feature selection techniques: filter methods and wrapper methods, as well as how to use data visualization to guide decision making.



Data Preprocessing


Before jumping into the feature selection, we should always perform data preprocessing and data transformation:


1. Load dataset and import libraries

I am using Credit Card Customer Dataset from Kaggle to predict who is more likely to get churned.

import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 
from pandas.api.types import is_string_dtype, is_numeric_dtype  

df = pd.read_csv("../input/credit-card-customers/BankChurners.csv")	

Now let's have a glimpse of the raw data.


2. Define the prediction: this exercise would be a classification problem, from which we predict the binary variable "Attrition_Flag" which can either be existing customers or attrited customers.


3. Examine missing data:

Luckily, this dataset does not contain any missing values, but it is not always the case. If you would like to know more about how to address missing data, you may find these articles helpful.

How to Address Missing Data

How to Address Common Data Quality Issues Without Code


4. Variable transformation: This process consists of encoding categorical variables and transforming all variables into the same scale. I chose label encoder and min-max scaling respectively.




Exploratory Data Analysis (EDA)


Data visualization and EDA can be great complementary tools to feature selection process, and they can be applied in the following ways:

  1. Univariate Analysis: Histogram and Bar Chart help to visualize the distribution and variance of each variable

  2. Correlation Analysis: Heatmap facilitates the identification of highly correlated explanatory variables and reduce collinearity

  3. Bivariate Analysis: Box plot and Grouped bar chart help to spot the dependency and relationship between explanatory variables and response variable

(to have a better understanding of the data, exploratory data analysis can be performed before data transformation