Had it with Apache Storm? Heron swoops to the rescue

Datetime:2016-08-23 02:36:16          Topic: Heron           Share

Last year, Twitter dropped two bombshells. First, it would no longer use Apache Storm in production. Second, it had replaced it with a homegrown data processing system, Heron .

Despite releasing a paper detailing the architecture of Heron, Twitter's alternative to Storm remained hidden in Twitter's data centers. That all changed last week when Twitter released Heron under an open source license . So what is Heron, and where does it fit in the world of data processing at scale?

A directed acyclic graph (DAG) data processing engine, Heron is another entry in a very crowded field right now. But Heron is not a "look, me too!" solution or an attempt to turn DAG engines into big data's equivalent of FizzBuzz.

Heron grew out of real concerns Twitter was having with its large deployment of Storm topologies. These included difficulties with profiling and reasoning about Storm workers when scaled at the data level and at a topology level, the static nature of resource allocation in comparison to a system that runs on Mesos or YARN , lack of back-pressure support, and more.

Although Twitter could have adopted Apache Spark or Apache Flink , that would have involved rewriting all of Twitter's existing code. (Don't forget, Twitter has used Storm longer than anybody else, acquiring BackType, Storm's creator, back in 2011 before it was open source.) Instead, Twitter took a different approach: a new stream processing framework with a Storm-compatible API.

At this point in our walk through a new framework, I'd normally go through some examples to show you what coding in the framework feels like, but there's little point with Heron -- you write Storm bolts and tuples in exactly the same manner as you would with Storm. All you need to do to run your Storm code on Heron is to add this section to your pom.xml's dependencies:

<dependency>

 <groupId>com.twitter.heron</groupId>

 <artifactId>heron-api</artifactId>

 <version>SNAPSHOT</version>

 <scope>compile</scope>

</dependency>

<dependency>

 <groupId>com.twitter.heron</groupId>

 <artifactId>heron-storm</artifactId>

 <version>SNAPSHOT</version>

 <scope>compile</scope>

</dependency>

Then you remove your storm-code and clojure-plugin dependencies. Recompile, and your code will run on Heron with no further changes necessary. Simple! (Mostly, anyhow, but we’ll come back to that.)

Operationally, Heron's current implementation runs on top of Apache Mesos, using Apache Aurora , the Mesos scheduling framework developed by Twitter (surprise!). Since switching all its Storm topologies over to Heron, Twitter managed to reduce hardware resources dedicated to the topologies by a factor of three while increasing throughput and reducing latency in processing -- not bad.

Perhaps one of the most interesting aspects about Heron is that while code for it will be written in Java (or Scala), and the web-based UI components are written in Python, the critical parts of the framework, the code that manages the topologies and network communications are not written in a JVM language at all.

Indeed, at the heart of Heron, you’ll find code in a language you might not expect: C++. I think this is an aspect of the big data world that we'll see more of in the years to come.

The Apache Storm maintainers have removed many elements of its original Clojure code in favor of Java reimplementations, and the Apache Spark project currently generates Java code on-the-fly to speed up its DataFrame processing. But both are still tied to the JVM -- and the JVM has problems at scale. Don't get me wrong, the JVM is an amazing creation that has stood the test of time for 20 years, but when running on machines with huge amounts of RAM and processing tremendous amounts of data, problems with garbage collection emerge, no matter what fancy collector scheme you use.

At which point, moving back to a language like C++ starts to look appealing. As an example, Scylla , a C++ reimplementation of Apache Cassandra , has 10 times the throughput of Cassandra with none of the GC pauses that Cassandra is notorious for at large deployments. I'm fairly confident we'll see Heron's approach spread to other frameworks soon. This may be helped by Project Panama's attempt to improve the interface between Java and other languages.

Given that Heron requires fewer resources and provides more throughput and less latency than Apache Storm, you should move all your topologies over to Heron right now, yes? Well, maybe. Heron is currently tied to Mesos, so if you don't have existing Mesos infrastructure, you'll need to set that up as well, which is no small undertaking. Also, if you're making use of Storm's DRPC features, they're deprecated in Heron.

On the plus side, Heron has been running all of Twitter's processing needs in production for more than a year, so it should be able to handle anything you can throw at it. Plus, Twitter points out that Heron is used at Microsoft and other Fortune 500 companies, so you can be relatively confident it's going to stick around.

On the other hand, Storm hasn't been standing still. The Apache Storm team might quibble with Twitter's description of Heron as the "next generation of Apache Storm." While Twitter was working on Heron, Apache Storm reached 1.0 -- which includes support for back pressure, improved debugging and profiling options, a 60 percent decrease in latency, and up to a 16-fold speed improvement.

In addition, Storm 1.0 adds pacemaker, a daemon for offloading heartbeat traffic from ZooKeeper, freeing larger topologies from the infamous ZooKeeper bottleneck. Heron's speed improvements are measured from the Storm 0.8.x code it diverged from, not the current version; if you have migrated over to Storm 1.0 already, you might not see much more improvement over your current Storm topologies, and you may run into incompatibilities between the implementation of new features like back-pressure support between Storm and Heron.

All in all, I don't believe that Heron is likely to cause much of a dent in the uptake of data processing frameworks such as Apache Spark, Apache Flink, or Apache Beam. Their higher-level abstractions and APIs provide a much more developer-friendly experience than the lower-level Storm/Trident APIs. However, I believe the blend of JVM code with non-JVM modules for the critical paths is going to be a more popular approach going forward, and in this aspect, Heron shows us all the direction we'll be traveling in the months and years to come.





About List