Machine Learning in your Database? The Case For and Against BigQuery ML
The week has been full of announcements coming from the Google NEXT conferences in which Google has been unveiling their next generation cloud technologies. Artificial intelligence(AI) was front and center of the conference with new AI capabilities released across different areas of the Google Cloud stack. Among those announcements, the release of BigQuery ML attracted a lot of attention and for a very good reason: BigQuery ML brings machine learning(ML) capabilities to one of the most popular cloud data warehouse platforms in the planet. Unfortunately, like it happens with any overhyped release, the internet was immediately bombarded with uninformed articles highly exaggerating the capabilities of BigQuery ML. Today, I would like to take a pragmatic look at BigQuery ML through the lens of machine learning practitioners and try to highlight its benefits and limitations.
What is BigQuery ML?
In a nutshell, BigQuery ML is a series of SQL extensions that allow data scientists to build and deploy machine learning models that use data stored in the BigQuery platform. The use of SQL is the key theoretical contribution of BigQuery ML as it abstracts the complex lifecycle of machine learning model in a simple and familiar SQL syntax. From that perspective, BigQuery ML automatically implements aspects such as training/test split, regularization, optimization or feature standardization based on a specific SQL syntax.
The promise of BigQuery ML is to make machine learning more accessible to developers familiar with SQL and abstract many of the painful and highly mathematical aspects of machine learning methods into simple SQL statements. What does it do specifically? Well, the current release only supports two types of machine learning models:
- Linear regression : These models can be used for predicting a numerical value.
- Binary logistic regression : These models can be used for predicting one of two classes (such as identifying whether an email is spam).
Developers can create a model by using the CREATE or REPLACE MODEL syntax as shown in the following code:
CREATE MODEL dataset.model_name
OPTIONS(model_type=’linear_reg’, input_label_cols=[‘input_label’])
AS SELECT * FROM input_table;
After that, we can train a model and evaluate its initial statistics:
SELECT
*
FROM
ML.TRAINING_INFO(MODEL `my model`)
The models can be evaluated at any time and obtain metrics such as precision, recall, accuracy, etc.
WITH eval_table AS (
SELECT
*,
label
FROM
`my dataset` )
SELECT
*
FROM
ML.EVALUATE(MODEL `my model`,
TABLE eval_table)
Predictions can be executed by simply ML.PREDICT statement
SELECT
game_id,
predicted_label
FROM
ML.PREDICT(MODEL `my model`,
table dataset_to_predict) ) AS predict
So there you have it, simple machine learning in SQL 😉 As you can see, the BigQuery ML model abstracts many of the regularization and optimization aspects of machine learning solutions while still offer experts some level of control over them. The capabilities of BigQuery ML are available through the BigQuery UI and command line interfaces as well as programmatically via APIs. Not surprisingly, the platform integrates seamlessly with other Google Cloud services such as DataFlow or Data Studio.
From Google’s standpoint, they see BigQuery ML as an intermediate step between the complete self-service capabilities of AutoML and the low-level, data-science expert model of TensorFlow and Cloud ML.
BigQuery ML is not the first attempt to bring machine learning to a database engine. Microsoft has been trying a similar approach with its SQL Server Machine Learning Services offering although it hasn’t been widely successful. Compared to SQL Server, BigQuery is definitely a better candidate to incorporate machine learning capabilities and BigQuery ML capitalizes well on the flexible-serverless programming model and highly scalable architecture of the platform.
What BigQuery ML Is Not?
Despite its unquestionable benefits I think we should be well served putting the technology in the right context. From the case studies presented at Google NEXT, I got the impressions that customers have difficulty delineating the value of BigQuery ML compared to full-blown machine learning stacks.
I’ve been skeptical of these simpler forms of machine learning for a while but my position has little to do with technology. For the record, I think BigQuery ML is the best-designed, in-database machine learning stack I’ve seen thus far. However, in my experience, there is a knowledge friction that happens when you give the power of data science to non-data-scientists 😉 If a person doesn’t really understand machine learning, even if they can create a basic model, they typically struggle reasoning through its performance and architecture. In order for self-service machine learning models to be successful, I believe we need a different mathematical abstraction to represent machine learning models that is not based on hyperparameters, multi-layer neural networks, etc. I know this is a controversial point but hopefully my explanation makes sense.
In addition to the previous topic, I think there are very tangible limitations to the BigQuery ML approach:
· Machine learning problems, deep learning in particular, are not just about selecting an algorithm like prediction or linear regression but selecting the right architecture (number of layers, number of nodes, sequence of algorithms, etc). You can’t simply abstract that in a two-line SQL statement.
· A complement to the previous point is the fact that dynamic computation graphs are becoming more and more relevant in deep learning models with technologies like PyTorch. Again, this will be very hard to implement in SQL.
· Many relevant deep learning disciplines such as reinforcement learning or generative models require computational graphs that, I believe, are nearly impossible to express in a SQL syntax.
· Regularization and optimization are rarely a one-step process to change a hyperparameter and go but rather highly methodical processes that often require changes in the architecture of the model, incorporating new hidden layers etc.
All in all, I think BigQuery ML is an intriguing value proposition and certainly a great addition to the Google Cloud stack. About my skepticisms, think of those as mere practical points from machine learning practitioners. As machine learning evolves, we are going to see more and more attempts to expand its value proposition to non-data scientists. Its just human nature. I think most of the self-service machine learning technologies in the space are causing more harm than good but I think BigQuery ML might have found an interesting middle ground.
Data Science @ Accenture Strategy || IIT Kharagpur
4yGreat read
Thinknowlogy is the world's only naturally intelligent knowledge technology, based on Laws of Intelligence that are naturally found in the human language. Open souce software.
6yIn regard to "As machine learning evolves": Technology (including AI) is intelligently designed by definition – using a structured approach – while the theory of evolution doesn't support any intelligent influence, nor any structured approach. So, the theory of evolution doesn't apply to technology.