With version 2.2 of Apache Spark, a long-awaited feature for the multipurpose in-memory data processing framework is now available for production use.
Structured Streaming, as that feature is called, allows Spark to process streams of data in ways that are native to Spark’s batch-based data-handling metaphors. It’s part of Spark’s long-term push to become, if not all things to all people in data science, then at least the best thing for most of them.
Structured Streaming in 2.2 benefits from a number of other changes aside from losing its experimental designation. It can now work as a source or a sink for data coming from or being written to an Apache Kafka source, with lower latency for Kafka connections than previously.
Kafka, itself an Apache Software Foundation, is a distributed messaging bus widely used in streaming applications. Kafka has typically been paired with another stream-processing framework, Apache Storm, but Storm is limited to stream processing only, and Spark presents less complex APIs to the developer.
Structured Streaming jobs can now use Spark’s triggering mechanism to run a streaming job once and quit. Databricks, the chief commercial outfit supporting Spark development, claims this is a more efficient execution model than running Spark batch jobs intermittently.
The native collection of machine learning libraries in Spark, MLlib, has been outfitted with new algorithms for tasks like performing PageRank on datasets, or running multiclass logistic regression analysis (e.g., which current hit movie will a person in various demographic categories probably like best?). Machine learning is a common use case for Spark.
Machine learning in Spark also gets a major boost from improved support for the R language. Earlier versions of Spark had wider support for Java and Python than R, but Spark 2.2 adds R support for 10 distributed algorithms. Structured Streaming and the Catalog API (used for accessing query metadata in Spark SQL) can now also be used within Spark.