Mar 30, 20216 min

Simple Logistic Regression using Python scikit-learn

Updated: Aug 7, 2021

Step-by-Step Guide from Data Preprocessing to Model Evaluation

What is Logistic Regression?

Don't let the name logistic regression tricks you, it usually falls under the category of the classification algorithm instead of regression algorithm.
 
Then, what is a classification model? Simply put, the prediction generated by a classification model would be a categorical value, e.g. cat or dog, yes or no, true or false ... On the contrary, a regression model would predict a continuous numeric value.
 
Logistic regression makes predictions based on Sigmoid function which is a squiggles-like line as shown below. Despite the fact that it returns the probabilities, the final output would be a label assigned by comparing the probability with a threshold, which makes it eventually a classification algorithm. In this example, I simply implement a logistic regression that returns binary outputs, aka binomial logistic regression. But logistic regression has the ability to tackle multiclass predictions.


 
In this article, I will walk through the following steps to build a simple logistic regression model using python scikit -learn:

  1. Data Preprocessing

  2. Feature Engineering and EDA

  3. Model Building

  4. Model Evaluation

The data is taken from Kaggle public dataset "Rain in Australia". The objective is to predict the binary target variable "RainTomorrow" based on existing knowledge, e.g. temperature, humidity, wind speed etc. If you would like to have the access to the full code, please check out Code Snippet page.


 

1. Data Preprocessing

Firstly let’s load the libraries and the dataset.

Use df.describe() to have an overview of raw data.

We cannot always expect that the data provided would be perfect for further analysis. In fact, it is rarely the case. Therefore, data preprocessing is crucial, especially, handling missing values is an imperative step to ensure the usability of the dataset.
 
We can use isnull() function to have a view of the scope of missing data. The following code snippet calculates the missing value percentage per column.

There are four fields with 38% to 48% of missing data. I dropped these columns since most probably these values are missing not at random. For example, we are missing a large number of evaporation figures and this may be limited by the capacity of the measuring instrument. Consequently, days with more extreme evaporation measures may not be recorded in the first place. Therefore, the remaining numbers are already biased. To that end, retaining these fields may contaminate the input data.
 
If you would like to distinguish three types of missing data, you may find this article "How to Address Missing Data" helpful.

After performing column-wise deletions, I deleted rows that are missing labels, "RainTomorrow", through dropna(). To build a machine learning model, we need labels to train or test the model, hence rows with no labels don't help much with either process. However, this section of the dataset can be separated out as the prediction set after the model implementation.
 
While handling missing data, it is inevitable that data shape changes, hence df.shape is a handy method allowing you to keep track of the data size. After the data manipulation above, the data shape changed from 145460 rows, 23 columns to 142193 rows, 19 columns.
 

For the remaining columns, I imputed the categorical variables and numerical variables separately. The code below classified columns into a categorical list and a numerical list, which would be also helpful in the later EDA process.

  • Numerical Variables: impute missing values with the mean of the variable. Notice that combining df.fillna() and df.mean() would be enough to transform only numerical variables.

  • Categorical Variables: iterate through the cat_list and replace missing values with "Unknown"


 
2. Feature Engineering and EDA


 
Coupling these two processes together is beneficial for choosing the appropriate feature engineering techniques based on the distribution and characteristics of the dataset.

In this example, I did not go in-depth into the exploratory data analysis(EDA) process. If you are interested to know more, feel free to have a read of my article on a more comprehensive EDA guide.
 
I automated the univariate analysis through a FOR loop. If a numerical variable is encountered, a histogram will be generated to visualize the distribution. On the other hand, a bar chart is created for the categorical variable.


 
Address Outliers
 
Now that we have a holistic view of the data distribution, it is much easier to spot outliers. For instance, Rainfall has a heavily right-skewed distribution, indicating that there is at least one significantly high record.

To eliminate the outliers, I used quantile(0.9) to limit the dataset to those fall into the 90% quantile of the dataset. As the result, the upper bound of Rainfall values significantly dropped from 350 to 6.

Feature Transformation
 
Date variable was transformed into Month. This is because Date has such high cardinality which makes it impossible to bring out patterns. Whereas using month may give suggestions whether it is more likely to rain in certain months of the year.


 
Feature Encoding
 
Logistic regression only accepts numeric values as the input, therefore, it is necessary to encode the categorical data into numbers.
 
The most common techniques are one-hot encoding and label encoding. I found this article brings an excellent comparison between these two.
 
Take RainToday as an example:

  • label encoding better for ordinal data with high cardinality

  • one hot encoding better for low cardinality and not ordinal data

I chose label encoding even though these columns are not ordinal. This is due to the fact that most fields have no less than 17 unique values and one-hot encoding will make the data size grow too wide.

Now all variables are transformed into either integer or float.


 
Feature Selection
 
Correlation matrix is a common multivariate EDA method that assists in identifying highly correlated variables.

For example:

  • MinTemp, MaxTemp, Temp9am and Temp3pm

  • RainFall and RainToday

  • Pressure9am and Pressure3am

Since logistic regression requires there to be little multicollinearity among predictors, I tried to keep only one variable in each group of highly correlated variables.


 

 
3. Model Building
 

Previously, I mentioned that the objective of this exercise is to predict RainTomorrow. Therefore, the first task is to separate the input features (independent variables - X) and the label (dependent variable - y). df.iloc[:, :-1] is a handy function to grab all rows and all columns except the last one.

Secondly, both features and labels are broken down into a subset for training and another for testing. As the result, four portions are returned, X_train, X_test, y_train and y_test. To achieve this, we introduce the train_test_split function and specify the parameter test_size. In the example below, test_size = 0.33, hence roughly 2/3 data used for training and 1/3 used for testing.

Thanks to scikit-learn, we can avoid the tedious process of implementing all the math and algorithms from scratch. Instead, all we need to do is to import LogisticRegression from the sklearn library and fit the training data into the model. However, there is still the flexibility of changing the model by specifying several parameters, e.g. max_iter, solver, penalty. More complicated machine learning models would usually involve hyperparameter tuning process that searches through the possible hyperparameter values and finds the optimal combinations.

For this beginner-friendly model, I only alter the max_iter parameter to let the logistic regression converge, but at the same time, the number should not be too high to cause overfitting.


 

4. Model Evaluation
 

ROC, AUC, Confusion Matrix and Accuracy are widely used for evaluating Logistic Regression model. All of these metrics are based on calculating the difference between the y values predicted by the model and the actual y values of the test set, hence y_pred and y_test. There are four possible scenarios while comparing the differences:

  1. True Positive: it does rain tomorrow when predicted raining

  2. True Negative: it doesn't rain tomorrow when predicted not raining

  3. False Positive: it doesn't rain tomorrow when predicted raining

  4. False Negative: it does rain when predicted not raining


 
Confusion Matrix

I used plot_confusion_matrix() to provide a visual representation that clearly indicates counts of the four scenarios mentioned above. As shown, the true negative is 33122 cases, suggesting that the model is good at predicting not raining tomorrow when it is actually not going to rain. However, it still needs improvement on true positive rate, hence successfully predict raining tomorrow (only 2756 cases).
 

Accuracy

Accuracy calculates the ratio of all correct predictions: (true positive + true negative) / (true positive + false positive + false negative + false positive)
 

 

ROC and AUC

ROC plots the true positive rate and false positive rate upon various thresholds. For example, the point indicates the true positive rate and false positive rate when the threshold is set to 0.7, hence RainTomorrow = Yes when the predicted probability is greater than 0.7. As the probability threshold drops to 0.4, more cases will be predicted as positive (RainTomorrow = Yes), hence both true positive rate and false positive rate go up. AUC stands for area under curve, and different models will have different ROC hence different AUC scores. In this example, model 2 has a larger AUC than model 1 and it is the better model. This is because, at the same level of false positive rate, model 2 has a higher true positive rate. Therefore, model 2 has a higher AUC score which makes it the better model.
 
Three functions are used to plot ROC and calculate AUC:

  • predict_proba(): generates the probability score for each instance

  • roc_curve(): returns false positive rate, true positive rate and which are essential to plot the curve.

  • roc_auc_score(): calculates the AUC


 

 

 
Take-Home Message

This article covers some fundamental steps in a logistic regression model building process:

  1. Data Preprocessing: with the focus on missing value imputation

  2. Feature Engineering and EDA: univariate analysis and multivariate analysis; handling outliers and feature transformation

  3. Model Building: split dataset and fit the data logistic regression

  4. Model Evaluation: confusion matrix, accuracy, ROC, and AUC

However, it is just a basic guide which is aiming to let you have a grasp of using logistic regression hopefully in a timely manner. There is ample space to improve the current model, by introducing hyperparameter tuning, feature importance, and values standardization. As always, let's keep learning.

    15430
    3