In the past couple of years, there’s been a trend to code with streams. Ever wondered why everything is turning into a stream? What are their advantages? What should you do if you wanted use streams? What libraries are out there and which ones are the good ones? In this article, we will be looking for answers to these questions.

Streams are a form of glue that binds components together with minimal coupling. Like the listener pattern, instead of components knowing about each other directly, they all just know about the connecting tissue, in this case the stream, which means they can easily evolve separately, which is what creates the louse coupling. Unlike the listener pattern though, there are just two operations: pushing data and reading data. This restriction on how the data flows in just one direction, allows for some very effective optimizations that do not require any locking,

Processing new records

All updates to local state that read their data from a stream typically turn into appends, which do not require any locking. This is the reason updates from streams tend to be so fast and computationally cheap. Another property of streaming is that it only works on one chunk of data at a time, so this means that all data does not have to fit into the same computer or even cluster, which again enables it to work with large, even infinite, datasets.

Updating existing records and error correction

If a record key already exists, it is simply replaced by the new record, again without any locking. The last one wins. For many systems this is just as efficient as an append. This is often referred to as idempotent behavior: Retransmitting the same record with the same key will have no effect which is great for error correction since duplicates can be easily introduced onto a stream over flaky communication links. When a link breaks, the stream can be simply replayed with some overlap, and the duplicates will naturally fall out on their own.

What applications benefit the most from using streams?

So what would be a good example of a streaming application. In order to answer this questions lets look at why streams have gotten so popular. Data grows at a phenomenal rate. This is no surprise given that now there are 3.8 billion people on the Net. That’s a lot! They use their apps, purchase on line services, etc. All this creates a trail of rich data. This massive growth in data generation calls for faster and better ways of processing data. So Big Data applications which generate literally infinite data streams are perhaps the greatest beneficiaries of streaming technologies, because they are the most desparate for efficient processing.

 

Example streaming problem: Credit card fraud detection

It’s not uncommon for a decent size bank to receive tens of millions of credit card transactions a day. Each of those transactions have to be analyzed for suspicious activity. The work can be easily broken down into sequential processing of chunks of data, because the transactions already have this natural separation. We can easily add as many processors as we want by simply hooking up more processing computers to the stream. This would be a perfect example of a streaming application.

 

3 hot streaming technologies on the market

 

Akka Streams

This is a low level streaming technology that requires advanced programming knowledge. It it based on Akka Actors, and allows very fine grained control of building data streams.

Spark Streams

This is is based on Spark, which is a batch based transformation technology. The greatest benefits of using Spark are that it can easily be spread over a cluster, so that many computers can act together to contain the data and do complex transformations on them, without requiring the knowledge of programming, One only needs to understand SQL to create transformations using Spark.

Spark Streaming essentially reads rows from a streamed source, most often Kafka, then processes them as micro-batches. It’s not a native streaming technology. It breaks the stream into these micro-batches, and subsequently simply uses the batch Spark API on them

Despite this fact that streaming was an afterthought for Spark, this is a very popular streaming platform because it can use all the complex transformations and aggregations which have been developed for Spark over many years. This includes an advanced data science library Spark ML, which is the tool of choice for many analytics projects. For instance you can build regression models from Kafka streams using Spark packages for processing and Jupiter Notebook as your front end, with very minimal programming knowledge.

Flink

Then there is this technology which aims at putting the Spark folks out of business. Flink was constructed from the ground up to be stream-based. It does not turn its data streams into micro-batches,

While undoubtedly this is a more modern streaming technology, it does not have the rich echo system of Spark transformations, which limits its popularity.