Today we’re announcing a new Netflix-OSS project called Surus. Over the next year we plan to release a handful of our internal user defined functions (UDF’s) that have broad adoption across Netflix. The use cases for these functions are varied in nature (e.g. scoring predictive models, outlier detection, pattern matching, etc.) and together extend the analytical capabilities of big data.
The first function we’re releasing allows for efficient scoring of predictive models in Apache Pig using Predictive Modeling Markup Language. PMML is an open source standard that supports a concise representation of predictive models in XML and hence the name of the new function, ScorePMML.
At Netflix, we use predictive models everywhere. Although the applications for each model are different, the process by which each of these predictive models is built and deployed is consistent. The process usually looks like this:
- Someone proposes an idea and builds a model on “small” data
- We decide to “scale-up” the prototype to see how well the model generalizes to a larger dataset
- We may eventually put the model into “production”
At Netflix, we have different tools for each step above. When scoring data in our hadoop environment, we noticed a proliferation of custom scoring approaches operating in steps two and three. This implementation of custom scoring approaches added overhead as individual developers migrated models through the process. Our solution was to adopt PMML as a standard way to represent model output and to write ScorePMML as a UDF for scoring PMML files at scale.
ScorePMML aligns Netflix predictive modeling capabilities around the open-source PMML standard. By leveraging the open-source standard, we enable a flexible and consistent representation of predictive models for each of the steps mentioned above. By using the same PMML representation of the predictive model at each step in the modeling process, we save time/money by reducing both the risk and cost of custom code. PMML provides an effective foundation to iterate quickly for the modeling methods it supports. Our data scientists have started adopting ScorePMML where it allows them to iterate and deploy models more effectively than the legacy approach.
Now for the practical part. Let’s imagine that you’re building a model in R. You might do something like this….
# Required Dependencies
# Column Names must NOT contain periods
names(iris) <- gsub("\\.","_",tolower(names(iris)))
# Build Models
iris.rf <- randomForest(Species ~ ., data=iris, ntree=5)
iris.gbm <- gbm(Species ~ ., data=iris, n.tree=5)
# Convert to pmml
# Output to File
And, now let’s say that you want to score 100 billion rows…
DEFINE pmmlRF com.netflix.pmml.ScorePMML('~/iris.rf.xml');
DEFINE pmmlGBM com.netflix.pmml.ScorePMML('~/iris.gbm.xml');
-- LOAD Data
iris = load '~/iris.csv' using PigStorage(',') as
-- Score two models in one pass over the data
scored = foreach iris generate pmmlRF(*) as RF, pmmlGBM(*) as GBM;
That’s how easy it should be.
There are a couple of things you should think about though before trying to score 100 billion records in Pig.
- We throw a Pig FrontendException when the Pig/Hive data types and column names don’t match the data types and column names in PMML. This means that you don’t need to wait for the Hadoop MR job to start before getting the feedback that something is wrong.
- The ScorePMML constructor accepts local or remote file locations. This means that you can reference an HDFS or S3 path, or you can reference a local path (see the example above).
- We’ve made scoring multiple models in parallel trivial. Furthermore, models are only read into memory once, so there isn’t a penalty when processing multiple models at the same time.
- When scoring big (and usually uncontrolled) datasets it’s important to handle errors gracefully. You don’t want to rescore 100 records because you fail on the 101st record. Rather than throwing an exception (and failing the job) we’ve added an indicator to the output tuple that can be used for alerting.
- Although this is currently written to be run in Pig we may migrate in the future to different platforms.
Obviously, more can be done. We welcome ideas on how to make the code better. Feel free to make a pull request!
We’re excited to introduce Surus and share with the world in the upcoming months various UDF’s we find helpful while analyzing data at Netflix. ScorePMML was a big win for Netflix as we sought to streamline our processing and to minimize the time to production for our models. We hope that with this function (and others soon to be released) that you’ll be able to spend more time making cool stuff and less time struggling with the mundane.
- ScorePMML is built on jPMML 1.0.19, which doesn’t fully support the 4.2 PMML specification (as defined by the Data Mining Group). At the time of this writing not all enumerated missing value strategies are supported. This caused problems when we wanted to implement GBMs in PMML, so we had to add extra nodes in each tree to properly handle missing values.
- Hive 0.12.0 (and thus Pig) has strict naming conventions for columns/relations which are relaxed in PMML. Non alpha-numeric characters in column names are not supported in ScorePMML. Please see the Hive documentation for more details on column naming in the Hive metastore.
- The Data Mining Group PMML Spec: The 4.1.2 specification is currently supported. The 4.2 version of the PMML spec is not currently supported. The DMG page will give you a sense of which model types are supported and how they are described in PMML.
- RPMML: An R-package for creating PMML files from common predictive modeling objects.