Data Infrastructure
The technology world has come a long way in a short time. What may have been considered best practice a few years ago is now considered out of date and old fashioned. I still remember when “SOA” was a thing and OLAP cubes were how people handled big data analytics. Though some parts of the world may still use those terms and technologies, Silicon Valley has zoomed way past.
What hasn’t changed, though, is the importance of accessing and retrieving data efficiently. If anything, the importance of getting access to one’s data has grown immeasurably with the emergence of large scale machine learning, and that statement holds true for both large and small companies.
At Apteo, we spent a lot of time on our Phase 1 R&D. It was heavily geared towards utilizing the best ML methods available to us while finding and integrating new data sources that we found to be useful for our purposes. We came out of this effort with what we branded as our “V1” platform — a set of deep networks, analytical techniques, datasets, and a strategy for putting them all together into a smart index investment product.
V1 was a lot of fun. Not only did we learn a lot, but we also worked on some really fascinating data science problems, all while coming together as a small but productive team.
We’re now getting close to what we’re calling our “V2” platform. In contrast to what we did in V1, this version has been all about infrastructure. We’ve spent the past couple of months working on solidifying our prediction mechanisms, creating more robust tools for generating our ML models robustly and repeatably, implementing best practices when it comes to both technology and data science, squashing bugs, paying off tech debt, and when we had a chance, we built out a user-facing dashboard.
Now why would we actually spend all this time on backend engineering tasks when we have an entire world of data science and user-facing product to build?
Because without it, we would be far too unproductive and slow in the future.
The fact that we had to face is that data infrastructure enables everything else in our world, and my guess is, in the world of nearly every machine learning company out there.
Before we solidified our infrastructure, our jobs would error out every so often, we wouldn’t reliably get the predictions we needed to move our investment strategy forward, we had very little insight into the status of our scheduled apps, and we would spend far too much time in devops mode.
All of this took time away from what we really should be spending our time on, which is research, development, and analysis of better investment models.
We’re small, but this issue impacts everyone. Google and Uber have both produced posts on the massive infrastructure they built so that their ML engineers and data scientists could easily create and deploy models. The same thing happened at my former company.
Allowing data scientists quick and efficient access to their data is a luxury that many data scientists don’t get in their jobs. As we grow, we’ll undoubtedly have to recreate our platform from scratch at least once, if not more often, to accommodate our needs. For now, I’m happy to be nearing the end of a rather long engineering effort that will hopefully pay dividends for the foreseeable future.