How to read BigQuery data from TensorFlow 2.0 efficiently

Using tensorflow_io.bigquery.BigQueryClient to create a tf.data.dataset

Lak Lakshmanan
Towards Data Science
5 min readFeb 6, 2020

--

TensorFlow’s BigQueryClient uses the Storage API to efficiently read data directly out of BigQuery storage (i.e., without having to issue a BigQuery query). In this article, I show you how to use this class in your Keras/TensorFlow 2.0 model to create a tf.data Dataset. You can follow along with this notebook on GitHub.

Use BigQuery as a Data Lake: read data directly from BigQuery into TensorFlow

As an illustration, I will use a dataset of credit card transactions. Some 0.17% of these transactions are fraudulent, and the challenge is to train a classification model on this very, very unbalanced dataset. So, along the way, you will also learn some tricks on how to deal with imbalanced data.

Benchmark

A best practice whenever developing a machine learning model is to have a simple benchmark. In this case, I will develop the benchmark model using BigQuery ML:

CREATE OR REPLACE MODEL advdata.ulb_fraud_detection 
TRANSFORM(
* EXCEPT(Amount),
SAFE.LOG(Amount) AS log_amount
)
OPTIONS(
INPUT_LABEL_COLS=['class'],
AUTO_CLASS_WEIGHTS = TRUE,
DATA_SPLIT_METHOD='seq',
DATA_SPLIT_COL='Time',
MODEL_TYPE='logistic_reg'
) AS
SELECT
*
FROM `bigquery-public-data.ml_datasets.ulb_fraud_detection`

Note that I do several things in this super-simple model:

  • I am training a logistic regression model (or linear classifier) — this is the simplest possible ML model for this problem and can serve as a check against more complex models.
  • I am dividing the data using the Time column, so that the first 80% of transactions are the training dataset and the last 20% are the evaluation dataset. This way, we don’t leak knowledge in case there is a time-dependence to the fraudulent activities.
  • I am asking BigQuery ML to automatically weight the classes based on their frequency in the training dataset.
  • I am transforming the Amount column which has a very wide range using the log function so that it is also a relatively small number. I do this using the TRANSFORM clause of BigQuery ML.
  • Because the Amount in some cases is zero, I use SAFE.LOG to avoid numeric errors.

This gives me a model with an AUC (area under the curve) of 0.9863, illustrating how powerful BigQuery ML can be. Let’s see if we can beat this with a more complex machine learning model written in Keras.

Creating a tf.data.dataset

To read the BigQuery data into Keras, I am going to use the BigQueryClient. Here’s code that will create a dataset:

import tensorflow as tf
from tensorflow.python.framework import dtypes
from tensorflow_io.bigquery import BigQueryClient
from tensorflow_io.bigquery import BigQueryReadSession
def features_and_labels(features):
label = features.pop('Class') # this is what we will train for
return features, label
def read_dataset(client, row_restriction, batch_size=2048):
GCP_PROJECT_ID='your_project_name' # CHANGE
COL_NAMES = ['Time', 'Amount', 'Class'] + ['V{}'.format(i) for i in range(1,29)]
COL_TYPES = [dtypes.float64, dtypes.float64, dtypes.int64] + [dtypes.float64 for i in range(1,29)]
DATASET_GCP_PROJECT_ID, DATASET_ID, TABLE_ID, = 'bigquery-public-data.ml_datasets.ulb_fraud_detection'.split('.')
bqsession = client.read_session(
"projects/" + GCP_PROJECT_ID,
DATASET_GCP_PROJECT_ID, TABLE_ID, DATASET_ID,
COL_NAMES, COL_TYPES,
requested_streams=2,
row_restriction=row_restriction)
dataset = bqsession.parallel_read_rows()
return dataset.prefetch(1).map(features_and_labels).shuffle(batch_size*10).batch(batch_size)
client = BigQueryClient()
train_df = read_dataset(client, 'Time <= 144803', 2048)
eval_df = read_dataset(client, 'Time > 144803', 2048)

Essentially, we use client.read_session() to create a session, passing in the table to read, the columns of the table to read, a simple restriction on the rows we care about. This data is read in parallel into the tf.data.dataset, and I use it to create a training dataset and an evaluation dataset.

Note that the code above uses some of the tf.data.dataset best practices such as prefetching, shuffling and batching.

Create Keras model input layer

Creating a Keras model to read structured data involves feature columns. To learn more, see my article on creating wide-and-deep models in Keras and on keeping transformation code separate from inputs. So, without repeating myself, here’s the code to create the input layer to Keras:

# create inputs, and pass them into appropriate types of feature columns (here, everything is numeric)
inputs = {
'V{}'.format(i) : tf.keras.layers.Input(name='V{}'.format(i), shape=(), dtype='float64') for i in range(1, 29)
}
inputs['Amount'] = tf.keras.layers.Input(name='Amount', shape=(), dtype='float64')
input_fc = [tf.feature_column.numeric_column(colname) for colname in inputs.keys()]
# transformations. only the Amount is transformed
transformed = inputs.copy()
transformed['Amount'] = tf.keras.layers.Lambda(
lambda x: tf.math.log(tf.math.maximum(x, 0.01)), name='log_amount')(inputs['Amount'])
input_layer = tf.keras.layers.DenseFeatures(input_fc, name='inputs')(transformed)

Handling class imbalance

Handling class imbalance in a Keras model consists of two steps:

  • Specify an initial bias to the output layer that is log(pos/neg)
  • Weight the infrequent class much more than the frequent class in such a way that the total weights equal the number of training samples

We can compute the necessary values using BigQuery:

WITH counts AS (
SELECT
APPROX_QUANTILES(Time, 5)[OFFSET(4)] AS train_cutoff
, COUNTIF(CLASS > 0) AS pos
, COUNTIF(CLASS = 0) AS neg
FROM `bigquery-public-data`.ml_datasets.ulb_fraud_detection
)
SELECT
train_cutoff
, SAFE.LOG(SAFE_DIVIDE(pos,neg)) AS output_bias
, 0.5*SAFE_DIVIDE(pos + neg, pos) AS weight_pos
, 0.5*SAFE_DIVIDE(pos + neg, neg) AS weight_neg
FROM counts

This gives me the following numbers: Keras model’s output bias needs to be set at -6.36 The class weights need to be 289.4 and 0.5.

Creating Keras model

We can then create a Keras model with 2 hidden fully-connected layers and a dropout layer (to limit overfitting) and taking care to provide an initial bias to the output layer and class weights to the loss function:

# Deep learning model
d1 = tf.keras.layers.Dense(16, activation='relu', name='d1')(input_layer)
d2 = tf.keras.layers.Dropout(0.25, name='d2')(d1)
d3 = tf.keras.layers.Dense(16, activation='relu', name='d3')(d2)
output = tf.keras.layers.Dense(1, activation='sigmoid', name='d4', bias_initializer=tf.keras.initializers.Constant())(d3)
model = tf.keras.Model(inputs, output)
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=metrics)
class_weight = {0: 0.5, 1: 289.4}
history = model.fit(train_df, validation_data=eval_df, epochs=20, class_weight=class_weight)

The result? After 20 iterations, I got:

val_accuracy: 0.9718 - val_precision: 0.0401 - val_recall: 0.8831 - val_roc_auc: 0.9865

This is better than the benchmark logistic regression model, but only barely. We will need to further hyperparameter tune the # of nodes, dropout, etc. of the Deep Learning model to do even better than this.

Loading TensorFlow model into BigQuery

You can load the trained TensorFlow model into BigQuery and use it to do inference. To load the model, call this from Keras:

model.save('gs://{}/bqexample/export'.format(BUCKET))

Then, in BigQuery, do:

CREATE OR REPLACE MODEL advdata.keras_fraud_detection 
OPTIONS(model_type='tensorflow',
model_path='gs://BUCKETNAME/bqexample/export/*')

You can predict with this model just as if it were a native BigQuery ML logistic regression model:

SELECT d4, Class
FROM ML.PREDICT( MODEL advdata.keras_fraud_detection,
(SELECT * FROM `bigquery-public-data.ml_datasets.ulb_fraud_detection` WHERE Time = 85285.0)
)

The reason it’s called d4 above is that my Keras output node was called d4.

Summary

In this article, you saw how to:

  • Read directly from BigQuery into a TensorFlow 2.0/Keras model
  • How to load the trained model into BigQuery

Along the way, you also saw how to train a BigQuery ML model and a Keras model on highly imbalanced data.

Next steps:

  1. The code in this article is in a notebook on GitHub. Try it out in AI Platform Notebooks.
  2. For more on BigQuery, read the book BigQuery: The Definitive Guide, published by O’Reilly Media:

Note: Depending on your company’s pricing plan, use of the Storage API beyond a certain limit may incur BigQuery charges. At the time I am writing this, customers on the flat rate plan start incurring these charges if they read in excess of 300 TB/month.

--

--