Iterating Quickly On Large Data and Slow Models

Iterating Quickly On Large Data and Slow Models


It’s no longer a secret that deep learning has been a huge boon to the field of AI, and definitely to the field of machine learning. Their ability to build hierarchical representations which then lead to accurate predictions is truly impressive, even if they are really just sophisticated curve fitters.

Many smart people in AI have pointed out that deep networks are not the future, but they are the best that we’ve got right now, so it’s important that we learn how to work with them until the next best thing comes along.

At Apteo, we’ve been using them extensively since our founding, so we’ve come to appreciate them while working around their quirks. One of which is the relatively large amount of time it takes for them to train.

At the beginning, we had a lot of inefficient processes in our data science pipeline (and we still have various steps that we can continue to optimize). For one, we had to compile and transform a large dataset every time we wanted to retrain our models or evaluate a new feature. We also had networks that weren’t really optimized to do any sort of dimensionality reduction, which caused our training times to be extremely large. More recently, we switched to a method of time-dependent cross-validation that required us to re-build a network for a large number of folds in order to evaluate our performance.

In this post, I wanted to share the techniques we’ve chosen to address these three issues.

Recreating a large dataset every time we wanted to train a model

As I mentioned above, in our earlier days (and in startup life, that only equates to a few months ago), we were inefficiently recreating our dataset from scratch every time we needed to train our models. This entailed a series of very slow activities, including downloading data locally, looking up values from our database, and making calls to API providers. The entire process could take upwards of half a day, if not longer.

So we ended up optimizing this process heavily by separating the creation of our dataset from the training of our model. In this process, we utilize a “golden set” of our last known good data. It has all of the feature values and labels for every data instance that we have available to work with. Now, when we train a new model, we download our latest and greatest golden set and load that up into memory, after which we perform any transformations we need to before feeding our data through the network’s training methods.

We could have included the result of the transformations we perform in our golden set, but by keeping our data in a more generic format, we’re able to be flexible with the transformations we want to perform on our raw data.

Now, this process is still not fast. It takes some time to download our data and load it up into memory, but it is significantly faster than what we used to do.

In general, we recreate our golden set every week. Getting this process up and running did take some engineering work, but now that we have this infrastructure in place, we’re able to iterate much more quickly on model and feature evaluation.

Dimensionality reduction

LSTMs have been a great gift to the world of NLP. However, LSTMs on large datasets can be slow, they certainly are for us. However, one of the great things about deep networks today is that they can be built from many unique building blocks. One of those building blocks, CNNs, allows us to reduce the input dimensionality of the data that goes into our LSTMs, thereby allowing us to speed up the training process.

The idea behind this is fairly simple. CNNs are great at reducing input data to its most salient features. Traditionally, they’ve done this in two dimensional image space. However, they can also be used to do time-based convolutions. In this approach, which we use, sequence-based data is fed through multiple convolutional layers, and those layers learn to find the most important patterns in local space. Those patterns are then passed into an LSTM as a sequence, and since the patterns that are found have been condensed, the LSTM has less data to process.

There are other dimensionality reduction techniques we’ve examined as well, including PCA and t-SNE. We’ve also looked into capping the maximum sequence length that we pass into our CNNs/LSTMs as well. Though we don’t use more traditional dimensionality reduction techniques today, the combination of capped sequence length and CNNs in front of LSTMs has helped us incredibly.

GPUs

On a side note, using time-based convolutions has allowed us to heavily leverage GPUs, which has also significantly reduced the time it takes for us to train a new model. What used to take us days on a CPU now takes hours on a GPU. Anecdotally, I’ve seen a 10x-50x improvement in speed from being able to use GPUs, so using CNNs for dimensionality reduction has provided us two benefits for the price of one.

Efficient cross-validation

The final technique that I’ll discuss in this post is heavily-specific to our domain, but its principles should be applicable to many fields.

In our world, we deal heavily with time-dependent data, data that has a constantly-changing distribution over time. In order to evaluate our models, we adopted the idea of doing time-dependent cross-validation, whereby we would separate our data into multiple folds, each fold consisting of a training set of data from the past and a testing set of data from the future. Each fold, additionally, was contiguous to the next fold, creating a sliding window of training and testing data that could capture a large amount of our data and use it for evaluating our models.

As robust as this methodology is, it’s incredibly slow. When we were using ten folds, we would need to create ten different networks and evaluate our networks ten different times. After the evaluation process, we also created a single, production, network to be used on all of our data. This process could take days or weeks, depending on the exact model we were training. Obviously that’s terrible for productivity.

Our solution was to select a small number of time periods that were significant for our problem at hand and use those periods as the test set for our networks. We continue to separate our training and testing data by time, but we now use a discrete set of time periods, some far in the past (allowing for less data for training but correspondingly less time for training), some more recent. We cross-validate our network on those time periods and perform a weighted-average of the metrics on each period, resulting in a cross-validated set of metrics that are representative of important periods in history.

Using this methodology, we do miss out on the ability to cross-validate our network on all days available to us, but we capture those time periods that have the biggest impact to our domain, and we use the metrics we derive as representational of our larger performance.

Wrapping up

All-in-all, it still takes us a large amount of time to train and evaluate our networks. However, using the techniques mentioned above, we’ve been able to dramatically reduce the amount of time that it takes us to iterate on a new feature or an improvement to our network structure — it’s gone from a matter of weeks to a matter of hours.

We’re continually investigating new techniques and approaches to both improve our accuracy and decrease the time it takes to build models. The approaches in this post were simply a few of our most significant methodologies.



If you’re interested in learning more about optimizing model training, or have a general interest in learning more about Apteo, please don’t hesitate to reach out to us at info@apteo.co

To view or add a comment, sign in

More articles by Shanif Dhanani

Insights from the community

Others also viewed

Explore topics