Understanding the Data Science Pipeline
Defining "data science" is tricky business. People explain the term in a myriad of ways. I believe the crux of the confusion is that data science is not a single task but rather a process.
In his excellent article The Age of the Data Product, Ben Bengfort spells out the concept of a data science pipeline. (I highly encourage this audience to read the article in full). Based on my experiences, I've adapted the pipeline presented in the article, which is shown below.
Let's review each step.
Ingestion. This is the beginning of the pipeline. It centers on pinpointing and obtaining the necessary data, which can embody many forms. We might need to write SQL queries, construct a web scraper, or simply track down a handful of CSVs.
Exploration. Once we have our data, we can begin exploring it. This stage takes the form of computing summary statistics and building visualizations. Oftentimes, we will want to inspect the distributions of continuous variables and the sparsity of categorical features. We will also want to search for issues with our data: missing observations, outliers, nonsensical values. Beyond that, our work in this phase should inspire ideas for feature engineering; that is, creating new features given our current data. Related, our efforts might spark ideas for more data to ingest.
Wrangling. We now need to clean our data and engineer features. This step is focused on developing modular functions to "get our data into shape." The takeaways from our exploration step should directly inform how we wrangle our data.
Modeling. This is the sexy part of data science, the one that perhaps attracts the most attention. In many cases, "modeling" might be synonymous with machine learning, though it may not always be the case. In some instances, we might simply need traditional or Bayesian statistics.
Deployment. The goal of any data science project should be to deploy a model. A model is substantially more powerful when it's not sitting on your isolated laptop. Deploying a model might mean developing a REST API, or it could involve using AWS Batch to run the model and push predictions to an endpoint at given intervals. The main idea is to have the model act autonomously.
Tracking. Placing our model into a production environment means it will be seeing new data; that is, data not used for training and testing in our modeling stage. Models "in the wild" are a whole new ballgame. It is incumbent on us to track our model's performance, recording predictions and comparing them to actual outcomes. These comparisons will allow us to understand if our model is performing as anticipated and if we need to adjust our approach. In that vein, our tracking leads directly into ingestion, which kicks off the cycle once more.
Thus far, I have presented the pipeline as a linear process. That is partially correct; however, the first few stages are recursive. After exploring our data, we might deem we need to ingest more data. Once we've wrangled our data, we often want to kick off another round of exploration. After building a handful of models, we might decide to wrangle the data in new ways in an attempt to improve model performance.
We can also view the pipeline as automatic vs. manual. The first few steps in our pipeline, the recursive ones, typically require human judgment. At a certain point, we can move straight from ingestion to modeling and deployment. Model re-trains and re-deployments can either be performed in a batch (i.e. we wait until we have X amount of data to refresh our model) or as a streaming process (i.e. we update our model as soon as new data becomes available). In either instance, we are bypassing human intervention and shortening our pipeline. Once we have a well-oiled system and a proven model (which should have feature creation and cleaning embedded as parameters), we can transition into a manual approach.
Framing data science as a pipeline has a number of benefits. First, it allows us to clearly segment and identify work. Second, it supplies a common language for discussing data science projects. Beyond that, it communicates that the discipline is multi-faceted and customizable within given parameters.