Upgrade to Get Unlimited Access
($10 One Off Payment)

Top Machine Learning Algorithms for Classification

Updated: Mar 21


machine leanring algorithms in python
grab the cheatsheet from our infographics gallery

Types of Machine Learning Algorithms

Supervised vs. Unsupervised vs. Reinforcement Learning

The easiest way to distinguish a supervised learning and unsupervised learning is to see whether the data is labelled or not.

Supervised learning learns a function to make prediction of a defined label based on the input data. It can be either classifying data into a category (classification problem) or forecasting an outcome (regression algorithms).

Unsupervised learning reveals the underlying pattern in the dataset that are not explicitly presented, which can be discovering the similarity of data points (clustering algorithms) or the hidden relationships of variables (association rule algorithms) ...

Reinforcement learning is another category of machine learning, where the agents learn to take actions based on its interaction with the environment, with the aim to maximize rewards. It is most similar to the learning process of human, following a trial-and-error method.


Classification vs Regression

Supervised learning can be furthered categorized into classification and regression algorithms. Classification model identifies which category an object belongs to whereas regression model predicts a continuous output.

Sometimes there is an ambiguous line between classification algorithms and regression algorithms. Many algorithms can be used for both classification and regression, and classification is just regression model with a threshold applied. When the number is higher than the threshold it is classified as true while lower classified as false.


In this article, we will discuss top 6 machine learning algorithms for classification problems, including: logistic regression, decision tree, random forest, support vector machine, k-nearest neighbour and naive bayes. I will summarize the theory behind each as well as how to implement each using python. Check out the code for model pipeline here.

 

1. Logistic Regression

logistic regression

Logistics regression uses sigmoid function above to return the probability of a label. It is widely used when the classification problem is binary, for example true or false, win or lose, positive or negative etc.

The sigmoid function generates a probability output. And by comparing the probability with a pre-defined threshold, the object is assigned to a label accordingly. Check out my posts on logistic regression for a detailed walkthrough.

Below is the code snippet for a default logistic regression. I have also provided the common hyperparameters to experiment on, to see which combinations bring the best result.

from sklearn.linear_model import LogisticRegression
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)

logistic regression common hyperparameters: penalty, max_iter, C, solver


2. Decision Tree

decision tree

Decision tree builds tree branches in a hierarchy approach and each branch can be considered as an if-else statement. The branches develop by partitioning the dataset into subsets, based on most important features. Final classification happens at the leaves of the decision tree.

from sklearn.tree import DecisionTreeClassifier
reg = LogisticRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

decision tree common hyperparameters: criterion, max_depth, min_samples_split, min_samples_leaf; max_features


3. Random Forest

random forest

As the name suggest, random forest is a collection of decision trees. It is a common type of ensemble methods - which aggregate results from multiple predictors. Random forest additionally utilizes bagging technique that allows each tree trained on a random sampling of original dataset and takes the majority vote from trees. Compared to decision tree, it has better generalization but less interpretable because of more layers added to the model.

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)

random forest common hyperparameters: n_estimators, max_features, max_depth, min_samples_split, min_samples_leaf, boostrap


4. Support Vector Machine (SVM)

support vector machine

Support vector machine finds the best way to classify the data based on the position in relation to a border between positive class and negative class. This border is known as the hyperplane which maximize the distance between data points from different classes. Similar to decision tree and random forest, support vector machine can be used in both classification and regression, SVC (support vector classifier) is chosen for classification problem.

from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)