Pig is Flying: Apache Pig on Apache Spark by Mayur Rustagi.
From the post:
Analysts can talk about data insights all day (and night), but the reality is that 70% of all data analyst time goes into data processing and not analysis. At Sigmoid Analytics, we want to streamline this data processing pipeline so that analysts can truly focus on value generation and not data preparation.
We focus our efforts on three simple initiatives:
- Make data processing more powerful
- Make data processing more simple
- Make data processing 100x faster than before
As a data mashing platform, the first key initiative is to combine the power and simplicity of Apache Pig on Apache Spark, making existing ETL pipelines 100x faster than before. We do that via a unique mix of our operator toolkit, called DataDoctor, and Spark.
DataDoctor is a high-level operator DSL on top of Spark. It has frameworks for no-symmetrical joins, sorting, grouping, and embedding native Spark functions. It hides a lot of complexity and makes it simple to implement data operators used in applications like Pig and Apache Hive on Spark.
For the uninitiated, Spark is open source Big Data infrastructure that enables distributed fault-tolerant in-memory computation. As the kernel for the distributed computation, it empowers developers to write testable, readable, and powerful Big Data applications in a number of languages including Python, Java, and Scala.
Introduction to and how to get started using Spork (Pig-on-Spark).
I know, more proof that Phil Karton was correct in saying:
There are only two hard things in Computer Science: cache invalidation and naming things.