Announcing Apache Pig 0.14.0

Datetime:2016-08-23 02:38:15          Topic: Apache Pig           Share

WithYARNas its architectural center,Apache Hadoopcontinues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it simultaneously in different ways.Apache Tezsupports YARN-based, high performance batch and interactive data processing applications in Hadoop that need to handle datasets scaling to terabytes or petabytes.

The Apache community just released Apache Pig 0.14.0 ,and the main feature is Pig on Tez. In this release, we closed 334 Jira tickets from 35 Pig contributors. Specific credit goes to the virtual team consisting of Cheolsoo Park, Rohini Palaniswamy, Olga Natkovich, Mark Wagner and Alex Bain who were instrumental in getting Pig on Tez working!

Screen Shot 2014-11-24 at 10.40.43 AM This blog gives a brief overview of Pig on Tez and other new features included in the release.

Pig on Tez

Apache Tezis an alternative execution engine focusing on performance. It offers a more flexible interface so Pig can compile into a better execution plan than is possible with MapReduce. The result is consistent performance improvements in both large and small queries.

To run Pig script in Tez mode, we simply put “-x tez” in Pig command line:

pig –x tez script.pig

If you see the error message “tez.lib.uris is not defined”, you need to add the tez conf directory, which contains tez-site.xml to environmental variable “PIG_CLASSPATH”. The only required entry for tez-site.xml is “tez.lib.uris”. Here is a sample tez-site.xml :

<property>
    <name>tez.lib.uris</name>
    <value>${fs.default.name}/apps/tez/tez-0.5.2.tar.gz</value>
 </property>

There is also a Tez local mode. It runs Pig script locally using the Tez execution engine. Tez local mode is still in the experimental stage. To enable it, put “-x tez_local” in Pig command line:

pig –x tez_local script.pig

If you do explain in Tez mode, you will see the Tez plan instead of the MapReduce plan.

There are 2 known limitations in Tez mode:

  1. Illustrate is not yet supported.
  2. The Tez UI is not ready, but you will be able to see AM/container logs in NodeManager Web UI.

ORCStorage

ORC file is a binary data format used in Hive since Hive 0.11. ORCStorage provide a way to read and write an ORC file directly in Pig.

Here is an example on the loader side:

A = load ‘student.orc’ using OrcStorage();

Here is an example on the storer side:

store A into ‘student.orc’ using OrcStorage([‘options’]);

You can specify a number of options on the storer side about how to write your ORC file, such as the size of a stripe or whether or not to use compression. Check the Apache Pig documentation for detail.

Predicate pushdown

In OrcStorage, we also implemented a new interface: LoadPredicatePushdown . With predicate pushdown, we can utilize the stats stored in ORC file/stripe/row group, and eliminate some blocks entirely. For data blocks containing data satisfying the predicate, Pig will do the filter again to remove the individual record that should be filtered away.

Here is one example:

A = LOAD 'student.orc' USING OrcStorage();
B = filter A by age > 25 and gpa < 3;
dump B;

Pig will figure out the filter condition “age > 25 and gpa<3” is eligible to pushdown and will be pushed to OrcStorage.

If your LoadFunc is able to be optimized with predicate pushdown, implementing the LoadPredicatePushdown interface.

Automatic UDF-dependent jars

Some LoadFunc/StoreFunc/EvalFunc depends on external jars at runtime. Pig user needs to register those jars manually in the Pig script. With Pig 0.14, it is possible to declare runtime dependency inside UDF so user don’t need to manually registering.

Here is the implementation in AvroStorage:

class AvroStorage {
    ….
    public List<String> getShipFiles() {
      Class[] classList = new Class[] { org.apache.avro.Schema.Schema.class, org.apache.avro.mapred.AvroInputFormat.AvroInputFormat.class};
      return FuncUtils.getShipFiles(classList);
  }
}

Jar refactor

In a previous Pig release, we shipped an uber jar called pig-withouthadoop.jar. However, we haven’t published pig-withouthadoop.jar to maven and that might have created some trouble for downstream projects. In general, an uber jar is not a good practice.

So we decided to remove pig-withouthadoop.jar and pig-withouthadoop-h2.jar in this release, and instead we shipped pig-core.jar and pig-core-h2.jar, along with dependent jars in the lib directory. There are also lib/h1 and lib/h2 directories, which contain jars only applicable for hadoop 1 or hadoop 2. Pig script will find out which version of Hadoop you are using and weaving the right jars into the CLASSPATH.

Download Apache Pig and Learn More





About List