# Top Machine Learning Algorithms for Regression

Updated: Apr 18, 2022

**A Comprehensive Guide to Implementation and Comparison**

__upgrade__ and grab the cheatsheet from our __infographics gallery__

In my previous post "__Top Machine Learning Algorithms for Classification__", we walked through common classification algorithms. Now let’s dive into the other category of supervised learning - regression where the output variable is continuous and numeric. There are four common types of regression models.

Linear Regression

Lasso Regression

Ridge Regression

Polynomial Regression

For people who prefer video walkthrough

**Linear Regression**

Linear regression finds the optimal linear relationship between independent variables and dependent variables, thus makes prediction accordingly. The simplest form is *y = b0 + b1x.* When there is only one input feature, linear regression model fits the line in a 2 dimensional space, in order to minimize the residuals between predicted values and actual values. The common cost function to measure the magnitude of residuals is residual sum of squared (RSS).

As more features are introduced, simple linear regression evolves into multiple linear regression *y = b0 + b1x1 + b2x2 + ... + bnxn. *Feel free to visit my __article__ if you want the specific guide to simple linear regression model.

**Lasso Regression**

Lasso regression is a variation of linear regression with L1 regularization. Sounds daunting? Simply put, it adds an extra element to the residuals that regression models are trying to minimize. It is called L1 regularization because this added regularization term is proportional to the **absolute value of coefficients** - degree of 1. Compared to Ridge Regression, it is better at bringing the coefficients of some features to 0, hence a suitable technique for feature elimination. You’ll see in the later section “__Feature Importance__”.

**Ridge Regression**

Ridge regression is another regression variation with L2 regularization. So not hard to infer that the regularization term is based on the **squared value of coefficients** - degree of 2. Compared to Lasso Regression, Ridge Regression has the advantage of **faster convergence and less computation cost.**

The regularization strength of Lasso and Ridge is determined by lambda value. Larger lambda values shrink down the coefficients values which makes the model more flattened and with less variance. Therefore, regularization techniques are commonly used for prevent model overfitting.

**Polynomial Regression**

Polynomial regression is a variation of linear regression with polynomial feature transformation. It adds interaction between independent variable. *PolynomialFeatures(degree = 2)* is applied to transform input features to a maximum degree of 2. For example, if the original input features are x1, x2, x3, this expands features into x1, x2, x3, x1^2, x1x2, x1x3, x2^2, x2x3, x3^2. As the result, the relationship is no longer linear, instead provide a non-linear fit to the data.

**Regression Models in Practice**

Let's implement and compare these 4 types of regression models, and explore how different lambda values affect model performance.

Please check out __code snippet__ if you are interested in getting the full code of this project.

**1. Objectives and Dataset Overview**

This project aims to use regression models to make prediction of the country happiness scores based on other factors “GDP per capita”, “Social support”, Healthy life expectancy”, “Freedom to make life choices”, “Generosity” and “Perceptions of corruption”.

I used “World Happiness Report” dataset on Kaggle, which includes 156 entries and 9 features. *df.describe()* is applied to provide an overview of the dataset.

**2. Data Exploration and Feature Engineering**

**1) drop redundant features**

Feature "Overall rank" is dropped as it is a direct reflection of the target “Score”. Additionally, “Country or Region” is dropped because it doesn’t bring any values to the prediction.

**2) univariate analysis**

Apply histogram to understand the distribution of each features. As shown below, “Social support” appears to be heavily left skewed whereas “Generosity” and “Perceptions of corruption” are right skewed - which informs the feature engineering techniques for transformation.

```
# univariate analysis
fig = plt.figure(figsize=(16, 8))
i = 0
for column in df:
sub = fig.add_subplot(2,4 , i + 1)
sub.set_xlabel(column)
df[column].plot(kind = 'hist')
i = i +
```

We can also combine the histogram with the measure of skewness below to quantify if feature is heavily left or right skewed.

```
# measure skewness
skew_limit = 0.7
for col in df.columns:
skewness = df[col].skew()
if skewness + skew_limit < 0:
print(col, ": left skewed", str(skewness))
elif skewness > skew_limit:
print(col, ": right skewed", str(skewness))
else:
print(col, ": not skewed", str(skewness))
```

**3) square root transformation**

*np.sqrt* is applied to transform **right skewed features** - “Generosity” and “Perceptions of corruption”. As the result, both features become more normally distributed.

**4) log transformation**

*np.log(2-df['Social support'])* is applied to transform **left skewed feature**. And the skewness significantly reduce from 1.13 to 0.39.

**5) bivariate analysis**

*sns.pairplot(df)* can be used to visualize the correlation between features after the transformation. The scatter plots suggest that “GDP per capita”, “Social support”, “Healthy life expectancy” are correlated with target feature “Score”, hence may have higher coefficient values. Let’s find out if that’s the case in the later section.

**6) feature scaling**

Because regularization techniques are manipulating the coefficients value, this makes the model performance sensitive to the scale of features. Therefore, features should be transformed to the same scale. I experimented on three scalers - StandardScaler, MinMaxScaler and RobustScaler.

Check out my article on “3 Common Techniques for Data Transformation” for more comprehensive guide of data transformation techniques.

Please note that **the scaler is fit using the training set only and then apply the transform to both training and testing set.** So, dataset should be split first.

Then, iterate through these 3 scalers to compare their outcome.

As you can see, scalers won’t affect the distribution and shape of the data but will change the range of the data.

**3. Regression Model Comparisons**

Now let’s compare three linear regression models below - linear regression, ridge regression and lasso regression.

```
lr = LinearRegression().fit(X_train, y_train)
l2 = Ridge(alpha = 0.1).fit(X_train, y_train)
l1 = Lasso(alpha = 0.001).fit(X_train, y_train)
```

**1) Prediction Comparison**

Firstly, visualize the predicted values vs actual values of three models in one scatter plot, which suggests that their predictions mostly overlapped with each other under current parameter settings.

**2) Feature Importance**

The second step is to experiment on how different lambda (alpha in scikit-learn) values effect the models. Most importantly, how** feature importance and coefficient values** alter as alpha value increased from 0.0001 to 1.

Based on coefficients values generated from both Lasso and Ridge models, “GDP per capita”, “Social support”, “Healthy life expectancy” appeared to be the top 3 highest importance features. This is aligned with the findings from previous scatter plots, suggesting that they are the main drivers of Country Happy Score. The side by side comparison also indicate that the increase in alpha values impact Lasso and Ridge at different level, features in Lasso are more strongly suppressed. That’s why **Lasso is often chosen for the purpose of feature selection.**

**3) Apply Polynomial Effect**

Additionally, polynomial features are introduced to enhance baseline linear regression - which increases the number of features from 6 to 27.

```
# apply polynomial effects
from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(degree = 2, include_bias = False)
X_train_poly = pf.fit_transform(X_train)
X_test_poly = pf.fit_transform(X_test)
```

Have a look of their distribution after the polynomial transformation.

**4. Model Evaluation**

Last step, evaluate and compare the performance of Lasso Regression vs. Ridge Regression model performance, before and after polynomial effect. In the code below, I created four models:

l2: Ridge regression without polynomial features

l2_poly: Ridge regression with polynomial features

l1: Lasso regression without polynomial features

l1_poly: Lasso regression with polynomial features

Commonly used regression model evaluation metrics are MAE, MSE, RMSE and R Squared - check out my article on “A Practical Guide to Linear Regression” for detailed explanation. Here I used MSE (mean squared error) to evaluate model performance.

1) By comparing **Ridge and Lasso in one chart**, it indicates that they have similar accuracy when alpha values are low but Lasso significantly deteriorates when alpha is closer to 1.

2) By comparing **with or without polynomial effect in one chart,** we can tell that polynomial decreases MSE in general - hence enhance model performance. This effect is more significant in Ridge regression when alpha increases to 1, and more significant in Lasso regression when alpha is closer to 0.0001.

However, even if polynomial transformation improves the performance of regression models, it makes the model interpretability more difficult - it is hard to tell the main model drivers from a polynomial regression. Less error does not always guarantee a better model, and it is about to find the right balance between predictability and interpretability based on the project objectives.

**Hope you found this article helpful. If you’d like to support my work and see more articles like this, treat me a coffee ☕️ by signing up **__Premium Membership__** with $10 one-off purchase.**