Introducing the CUBE operator for Apache Pig

Datetime:2016-08-23 02:40:07          Topic: Apache Pig  DataBase           Share

Guest post by Prasanth Jayachandran , who has been working on implementing CUBE support for Pig, as part of the large-scale distributed cubing effort.

The next version of Apache Pig will support the CUBE operator ( patch available here ). The CUBE operator represents grouped aggregations on all possible combinations of the input dimensions, for a given input measure.

This patch adds syntactic sugar to the already existing built-in CubeDimensions UDF. With this new addition, aggregations across multiple dimensions can be easily represented using CUBE operator. The following example illustrates the CUBE operator usage in Apache Pig:

Basic Usage:
inp = LOAD ‘/pig/data/salesdata’ USING PigStorage() AS(product:chararray,location:chararray, year:int, sales:long);
cubed_inp = CUBE inp BY (product,location,year);
out = FOREACH cubed_inp GENERATE FLATTEN AS (product, location,year), COUNT as total, SUM as sales;

Sample output:
For a sample input tuple (ipod, miami, 2012, 200000) , the above query generates all combinations of the tuple:
(ipod, miami, 2012, 1, 200000)
(ipod, NULL, NULL, 1, 200000)
(NULL, miami, NULL, 1, 200000)
(NULL, NULL, 2012, 1, 200000)
(ipod, miami, NULL, 1, 200000)
(ipod, NULL, 2012, 1, 200000)
(NULL, miami, 2012, 1, 200000)
(NULL, NULL, NULL, 1, 200000)

Output Schema for CUBE operator:
grunt> describe cubed_inp;
cubed_inp: {group: (dimensions::product: chararray, dimensions::location: chararray, dimensions::year: int), cube: {(dimensions::product: chararray, dimensions::location: chararray, dimensions::year: int,sales: long)}}

Note the second column in cubed_input bag ‘cube’ field which is a bag of all tuples that belong to ‘group’. Also note that the measure attribute ‘sales’ in load statement is pushed down to CUBE statement so that it can be referenced later while computing aggregates on the measure like in this case SUM

Upcoming Enhancements:
The current implementation is equivalent to the naive implementation in MRCube . Following are the core features that I am planning to implement as a part of the Google Summer of Code 2012 program:

  • Optimize naive implementation
  • Support for hierarchical dimension
  • Support for ROLLUP/GROUPING SETS operation similar to SQL/Oracle server
  • Distributed cubing for holistic measures

All these features should be available by end of this summer. Keep an eye on PIG-2167 (and all its sub-tasks) for more updates!





About List