An open source project now under Apache’s guidance uses a template system for easy training and deployment of Spark-powered machine learning models
The Apache Foundation has added a new machine learning project to its roster, Apache PredictionIO, an open-sourced version of a project originally devised by a subsidiary of Salesforce.
What PredictionIO does for machine learning and Spark
Apache PredictionIO is built atop Spark and Hadoop, and serves Spark-powered predictions from data using customizable templates for common tasks. Apps send data to PredictionIO’s event server to train a model, then query the engine for predictions based on the model.
Spark, MLlib, HBase, Spray, and and Elasticsearch all come bundled with PredictionIO, and Apache offers supported SDKs for working in Java, PHP, Python, and Ruby. Data can be stored in a variety of back ends: JDBC, Elasticsearch, HBase, HDFS, and their local file systems are all supported out of the box. Back ends are pluggable, so a developer can create a custom back-end connector.
How PredictionIO templates make it easier to serve predictions from Spark
PredictionIO’s most notable advantage is its template system for creating machine learning engines. Templates reduce the heavy lifting needed to set up the system to serve specific kinds of predictions. They describe any third-party dependencies that might be needed for the job, such as the Apache Mahout machine-learning app framework.
Some existing templates include:
- A universal recommendation engine.
- Text classification.
- Survival analysis (for time-between-failure predictions).
- Labeling topics using Wikipedia as a knowledge base.
- Similarity analysis.
Some templates also integrate with other machine learning products. For example, two of the prediction templates currently in PredictionIO’s gallery, for churn rate detection and general recommendations, use H2O.ai’s Sparkling Water enhancements for Spark.
PredictionIO can also automatically evaluate a prediction engine to determine the best hyperparameters to use with it. The developer needs to choose and set metrics for how to do this, but there’s generally less work involved in doing this than in tuning hyperparameters by hand.
When running as a service, PredictionIO can accept predictions singly or as a batch. Batched predictions are automatically parallelized across a Spark cluster, as long as the algorithms used in a batch prediction job are all serializable. (PredictionIO’s default algorithms are.)
Where to download PredictionIO
PredictionIO’s source code is available on GitHub. For convenience, various Docker images are available, as well as a Heroku build pack.