Accelerating AI @Intuit With Feature Pipelines and Store
For someone new to the machine learning space, a feature is a measurable attribute of a focal entity under observation. They are often derived from raw input data available in a real-time event stream or tables in a data warehouse. For example:
- Aggregate (max, mean, median, sum, min, etc) over a window of time (in last 5 days)
- Word embeddings for natural language processing
Most of the time data scientists derive several features from the raw input data available and as part of model training, they evaluate which features produce best model performance.
Operating an ML pipeline in production and dealing with complex infrastructure like AWS and stream technologies such as Kafka, Spark Streaming, Flink, etc. is hard and not the best use of Data Scientists' time. We want the data scientists to focus on solving critical customer problems using the right transformation logic for features and parameters/algorithms for models. At Intuit, we have been on this journey to build a sophisticated Machine Learning Platform for more than two years now. As part of model development lifecycle management, we have built several key capabilities covering the major areas of feature engineering, feature store, model training, inference and feedback which makes it easier for Data Scientists to focus on feature transformational logic and model algorithms while infrastructure management is handled by the ML Platform. The diagram below shows a high level view of different phases of developing a machine learning model.
In this article, the feature engineering and feature store sections of the platform are covered. While building the feature management capabilities, our focus has been to solve customer problems around three foundational areas:
- Feature Availability - make features available in as real-time as possible especially for use-cases involving fraud detection and in-product (or in-experience) personalization.
- Feature Consistency - ensure the features are calculated through the same mechanism for training and inference purposes and also the features validated against the defined schema
- Feature Usability - make features easier to discover, understand and re-use through proper metadata and lineage tracking and at the same time guarantee that the features are well governed from compliance standpoint
Feature Engineering
Feature Engineering involves building a feature pipeline to process raw event streams or data in the lake depending on whether it’s a stream or batch use-case.
We have created a completely self-serve experience to on-board a feature pipeline through a developer portal, where the user provides few details about their project so that we can create the code repository in the appropriate GitHub location for the team and tag resources for billing purposes. After the user on-boards, they get:
- Git repository with boilerplate code to write feature engineering code,
- CI/CD Automation to build and deploy,
- Artifactory locations for snapshots and releases
- Runtime environment to promote the pipeline through dev, pre-prod and prod environments
The feature engineering jobs use the featurizer interface to easily read data from a source, transform it and publish the resulting features to the Feature Store. We have made this interface available both for batch and stream use-cases and in different programming languages such as Java and Python. More details are included in the Feature Publishing APIs section below.
The features are persisted in the Feature Store using the ingestion infrastructure which also ensure feature validation and consistency across online and offline stores. Next section covers the aspects of the Feature Store in detail.
Feature Store
A Feature Store is at the heart of the feature management. Once the user has written the business logic for features the rest is handled by the interfaces around the Feature Store. The diagram below shows the different interfaces around the store
There are four major components to focus on in the above diagram:
- Feature Registration APIs
- Feature Publishing APIs
- Online and Offline Feature Store
- Feature Serving APIs
Feature Registration APIs
Feature Set is an important concept we introduced for managing features. It provides logical grouping of features that are related and associated with the same entity. It serves multiple purposes:
- Ensure data consistency - Using the specifications provided for feature set and features, the framework validates the published features before storing them in the online and offline stores
- Easier feature discovery and consumption - Once registered, the feature sets can be looked up in Data Catalog service. This is key to make sure features are discoverable and reusable. The entire feature set could be read for the model training or inference purpose with much simpler access pattern
- Better management of entities - Feature Set provides a better way to organize features associated with a given entity. Especially as the usage grows and teams create new custom entities that haven’t been thought of initially. It provides an easier way to manage entities that need any special handling due to compliance reasons.
- Optimized writes to Online Store - Instead of frequent writes to low latency stores such as dynamoDB, writing a composite feature vector consisting of all features in a feature set reduces cost as the number of writes gets reduced.
An important step in building a feature pipeline is to register a Feature Set, which involves providing metadata such as name, description, focal entity and metadata of enclosed features which are part of the Feature Set. For example:
Registering the Feature Set is self-served through a developer portal where the users can fill out a form and behind the scenes the platform also makes an entry into the Data Catalog for easier discovery.
Feature Publishing APIs
The publishing APIs are used in the feature engineering jobs to send features to the Feature Store. These APIs abstract the complexity of feature vectors that need to be prepared and expose user friendly interfaces to send transformed data. An example code snippet of the interface is shown below:
Offline and Online Feature Store
Offline feature store is used for batch oriented activities such as exploration, model training and batch inference. Typically, users start with exploring existing feature sets in the offline store to discover and re-use any features that they find fit for their models. If they find anything useful, they just need to apply a simple interface to read features in the training and batch inference jobs. The next section talks more about the feature serving interface. The offline store contains all the values of features over time.
Online feature store on the other hand is used to serve models that need low latency access to features during inference time. Models used in scenarios such as fraud detection, onsite personalization etc. have strict SLAs for which they need to respond to the application in milliseconds. To support such real-time use-cases, the latest value of features is stored in an online store. The online store can be looked up for a given entity, for example a person logging into a TurboTax account, to get all the features and make predictions for that person.
Feature Serving APIs
The features can be easily consumed in the model training and inference jobs in a declarative way using just a few lines of configuration which specify the features the model wants to use. For example,
In case of real-time inference, the feature serving layer would return the latest value retrieved from the online store. For batch inference and training, the serving layer could either return the entire dataset for the mentioned Feature Set or it could also be filtered using a query, say for example you only want to include feature values for the entities in the last six months.
Active Areas of Development
Versioning
We are working on adding versioning of feature sets, which will add major and minor versioning. With minor versions, additive changes to feature sets (new features or feature changes) can be done in a backwards compatible manner. Major version changes will involve removal of features from the feature set, changing the nature of a feature etc., and will be backwards incompatible.
Last Mile Transformations
The feature store was designed assuming that all the features that model needs would be pre-calculated and made available for consumption in real-time or batch mode for inference. But we are uncovering use-cases, where the final feature value is ephemeral, for example an aggregated feature that is based on the last N events, which becomes obsolete as soon as the next event occurs. Another example is when the features are needed usually for a smaller subset of users who interact with a product experience. We want to build a capability using which transformations could be applied on the base features stored in the Feature Store. This would provide flexibility to users on how they calculate final feature values just before invoking the model and also reduces the amount of pre-calculated features which otherwise we have to store, hence potential cost reductions.
In the future articles, we will share progress on the above mentioned topics and many more model development lifecycle capabilities of the platform.
Acknowledgements
Shoutout to the entire ML Platform team at Intuit for developing such awesome capabilities for our data science users. And many thanks to Anwar Habeeb, Manuela Wei, Srivathsan Canchi and Suresh Raman for reviewing and providing valuable feedback on this article.
Generative AI Specialist and strategic advisor
4yAny plans to open source this capability?
Generative AI & Data Transformation Leader | Architecting Scalable AI Solutions for Global Impact | Certified Azure, AWS, GCP Cloud Strategist
4yseamless feature discovery and data catalog integration is interesting, good to know 👍