Upgrade to Get Unlimited Access
($10 One Off Payment)

How to Self Learn Data Science in 2022

Updated: Apr 17

A Project-Based Approach to Get Started in Data Science

Data Science Learning
grab the cheatsheet from our infographics gallery

As someone who don’t hold a degree in data science, I am truly passionate about this field and decided to experiment on building my own curriculum to self learn data science in spare time. I would like to share my experience and hope to bring some insights if you want to share the same journey.

Project based learning is a good starting point for people already have some technical background but also want to explore the building blocks of data science. A typical data science / machine learning project comprises a lifecycle - from defining the objectives, data preprocessing, exploratory data analysis, feature engineering, model implementation to model evaluation. Each phase requires different skillsets, mainly statistics, programming, SQL, data visualization, mathematics and business knowledge.

I highly recommend Kaggle as the platform to experiment with your data science projects. With plenty of interesting datasets and a cloud based programming environment, you can easily get data source, code and notebooks from Kaggle for free. As a reader/writer on Medium , I also recommend using the platform to gain data science knowledge from professionals and share your own project all at the same place.

Why Project Based Approach?

  1. It is practical and gives us a sense of achievement that we are doing something real!

  2. It highlights the rationale of learning each pieces of content. This goal-oriented approach provides a bird eye view of how each little pieces work together to form the big picture

  3. It allow us to actively retrieve the information as we are learning. “Active Recall” is proven to significantly enhance information retention, compared to conventional learning mechanism which only requires passively consuming knowledge.

Let's break down the project lifecycle into following 5 steps and we will see how each step connects to various knowledge domain.

1.Business Problem & Data Science Solution

The first step of a data science project is to identify the business problem and define the objectives of an experiment design or model deployment.

Skillset - Business Knowledge

At this stage, it doesn’t need technicals skill yet but demands business understanding to identify the problem and define the objectives. First to understand the domain specific terminology appeared in the dataset, then to translate a business requirement to a technical solution. It requires years of experience in the field to build up your knowledge. I can only recommend some websites that increase your exposure to some business domains, for example Harvard Business Review, Hubspot, Investopedia, TechCrunch. Additionally, I recommend the book "Data Science for Business" as an integrated view of data science and business.

Skillset - Statistics (Experiment Design)

After defining the problem, then it is to frame it into a data science solution. This starts with the knowledge in Experiment Design such as hypothesis testing, sampling, bias / variances, different types of errors, overfitting / underfitting.

In the article "An Interactive Guide to Hypothesis Testing in Python", I introduced various type of statistical testing - t test, ANOVA, Chi Square test etc.

Machine Learning fundamentally can be considered as a hypothesis testing process, where we needs to search for a model in the hypothesis space that best fits our observed data, and allows us to make prediction to unobserved data.

Useful Resource:

2. Data Extraction & Data Preprocessing

The second step is to collect data from various sources and transform the raw data into digestible format.

Skillset - SQL

SQL is a powerful language for communicating with and extracting data from structured database.

Additionally, learning SQL also assists with framing a mental model that helps you to generate insights through data querying techniques, such as grouping, filtering, sorting, and joining. You will also find similar logics appearing in other languages, such as Pandas and SAS.

Useful Resources:

Skillset - Python (Pandas)

It is essential to get comfortable with a programing language. The simple syntax makes Python a relatively easy language to start with. Here is a great video tutorial if you are new to Python: Python for Beginners - Learn Python in 1 Hour.

After a basic understanding, it is worth spending some time to learn Pandas library. Pandas is almost unavoidable if you use python for data extraction. It transforms database into dataframe - a table like format that we are most familiar with.In the stage of data preprocessing, it is required to examine and address following data quality issues, and these can all be done using Pandas.

  • address missing data

  • transform inconsistent data type

  • remove duplicated value

Useful Resources: