Upgrade to Get Unlimited Access
($10 One Off Payment)

How to Build ML Model using BigQuery



While taking the first step into the field of machine learning, it is so easy to get overwhelmed by all sorts of complex algorithms and ugly symbols. Therefore, hopefully, this article can lower the entry barrier by providing a beginner-friendly guide. Allow you to get a sense of achievement by building your own ML model using BigQuery and SQL. That's right, we can use SQL to implement machine learning. If you are looking for several lines of code to get your hands dirty in the ML field, please continue reading :)


1. Set Up the Basic Structure 📁


In a nutshell, BigQuery project contains datasets and a dataset contains tables and models.


Follow these steps to set up a project using Google public dataset - "Google Analytics Sample". If you are familiar with the procedure, feel free to jump to the second section of this article.


1) Navigate to BigQuery in the left sidebar of the Google Cloud Platform Console and then create a project in Google Cloud Platform, name it as anything you want (e.g. “project” in this example)


2) With the "SQL Workspace" panel, select your project and then click “Create Dataset” and give your dataset a name (e.g. analytics).


3) Click “Add data" and select "Explore public data sets". From the public dataset library, search google analytics sample and add it to your project. Have a read of the Google Analytics sample schema if interested.


4) Explore the dataset by navigating to the "ga_sessions_" in the "bigquery-public-data" and select the "Preview" tab. It is a massive dataset with multiple columns and a collection of subsets which are partitioned by date (e.g. "2017-08-01" in the example below). Notice that the dataset sits within the project "bigquery-public-data" rather than our own project "project-303707", but we can still query this dataset from our project as it is publicly accessible.



2. What's Your Problem 🤨 ?


Before jumping into building the model, clarify the problem we are trying to solve by asking the following questions:


1) What is the label?

A label is a variable to be predicted. In this example, I will predict whether the website visitor will make any transactions and I gave this label the name "purchase". This can be derived from the existing variable "totals.transactions".


2) How is the label value defined?

For simplicity, let's make this prediction a black or white situation, either "purchase" or "no purchase". Since the model training cannot handle string value as the output result, therefore it is necessary to code them into numbers.