Bloomberg’s Machine Learning/Text Analysis team has developed many machine learning libraries for fast real-time sentiment analysis of incoming news stories. These models were developed using smaller training sets, implemented in C++ for minimal latency, and are currently running in production. To facilitate backtesting our production models across our full data set, we needed to be able to parallelize our workloads, while using the actual production code.
We also wanted to integrate the C++ code with PySpark and use it to run our models. In this talk, I will discuss some of the challenges we faced, decisions we made, and other options when dealing with integrating existing C++ code into a Spark system. The techniques we developed have been used successfully by our team multiple times and I am sure others will benefit from the gotchas that we were able to identify.