Big Data SQL Quick Start. Partition Pruning - Part7.

Datetime:2016-08-22 21:53:44          Topic: SQL  Hive           Share

partitioning is very common technique in data warehousing and all kind of databases. I assume that reader know what partitioning is and I will not explain theoretical part. If you want you could considerOracle RDBMS example.

But I directly start with the practical.

Hive partitioning.

Hive originaly was created as easy way to write MapReduces over  HDFS. HDFS is file system, which has linux like structure. So, it's easy to assume that partiton in this case is just sub-directory. I used Intel BigBench dataset for create partitioned hive table. I tool two tables - big fact table store_sales and small dimension table date_dim. They have follow relationship:

fact table (store_sales) doesn't have clear time identificator, this table related with dimension (dictonary) table date_dim, which have cloumns for explicit data devenition (d_dom - day of month, d_moy - month of year, d_year - year). Now i'm going to create partitioned store_sales table:

CREATE TABLE store_sales_part(

ss_sold_date_sk bigint,

...

ss_net_profit double)

partitioned by (

yearINT,

month INT,

day INT)

stored as ORC;

Statement above creates partitioned table with 3 virtual columns - year, month, day. Not I will insert data, into this hive table (I added few parameters which are mandatory for dynamic partitioning):

SET hive.exec.dynamic.partition=true;

SET hive.exec.dynamic.partition.mode=nonstrict;

SET hive.exec.max.dynamic.partitions =10000;

INSERT INTO TABLE store_sales_part PARTITION ( year, month, day )

SELECT store_sales.*, dt.d_year, dt.d_moy, dt.d_dom  

FROM  store_sales, date_dim dt

WHERE

dt.d_date_sk = store_sales.ss_sold_date_sk;  

after this insert i want to check file distribution on HDFS.

$ hadoop fs -du -h /user/hive/warehouse/orc.db/store_sales_part/*/*/|tail -2

168.5 M  505.5 M  /user/hive/warehouse/orc.db/store_sales_part/year=2005/month=9/day=8

168.7 M  506.0 M  /user/hive/warehouse/orc.db/store_sales_part/year=2005/month=9/day=9

so, now new table store_sales_part has three virtual column , that actually don't stored into disks (and don't occupy space ), but could be used for avoiding unnecessary IO. Also, those columns could be queried from hive console:

hive> select ss_sold_date_sk, year, month, day from store_sales_part limit 2;

OK

36890   2001    1       1

36890   2001    1       1

Time taken: 6.783 seconds,

Fetched: 2 row(s)

Great! Now let's turn to Oracle RDBMS and will create table there, that will be linked with this hive table:

CREATE TABLE STORE_SALES_ORC_PART

(SS_SOLD_DATE_SK NUMBER(10,0),

....

SS_NET_PROFIT BINARY_DOUBLE,

YEAR NUMBER,

MONTH NUMBER,

DAY NUMBER)

ORGANIZATION EXTERNAL    

( TYPE ORACLE_HIVE      

DEFAULT DIRECTORY DEFAULT_DIR    

ACCESS PARAMETERS    

( com.oracle.bigdata.cluster=bds30  

com.oracle.bigdata.tablename=orc.store_sales_part) )

REJECT LIMIT UNLIMITED  

PARALLEL;

Now  in Oracle we have external table, that has columns, which could prune unnecessary partition. Let's verify this!

Query table without partition predicate: 

SQL>  SELECT COUNT(1) FROM STORE_SALES_ORC_PART

and after query has finished check the statistics: 

SQL> SELECT   n.name,  round(s.value / 1024 / 1024 / 1024)  

FROM v$mystat   s, v$statname n

WHERE s.statistic# = 462  

AND s.statistic# = n.statistic#;

----------------------------------------------------------------

cell XT granule bytes requested for predicate offload         298  

then check size of the directory: 

hadoop fs -du -h -s /user/hive/warehouse/orc.db/store_sales_part/

297.8 G893.5 G  /user/hive/warehouse/orc.db/store_sales_part

it matches, so everything correctly. Now try to query one particular year:

SQL> SELECT COUNT(1) FROM STORE_SALES_ORC_PART WHERE YEAR=2004

check the database statistics:

SQL> SELECT   n.name,  round(s.value / 1024 / 1024 / 1024)  

FROM v$mystat   s, v$statname n

WHERE s.statistic# = 462  

AND s.statistic# = n.statistic#;

----------------------------------------------------------------

cell XT granule bytes requested for predicate offload         60

compare it with HDFS file size:

hadoop fs -du -h -s /user/hive/warehouse/orc.db/store_sales_part/year=2004

60.3 G893.5 G  /user/hive/warehouse/orc.db/store_sales_part

but  most interesting thing, that you couldfilter sub partition. For example, is I want to have statistics for all Decembers (all years), I will read only December partitions, like this:

SQL> SELECT COUNT(1) FROM STORE_SALES_ORC_PART WHERE MONTH=12

Check the database statistics:

SQL> SELECT   n.name,  round(s.value / 1024 / 1024 / 1024)  

FROM v$mystat   s, v$statname n

WHERE s.statistic# = 462  

AND s.statistic# = n.statistic#;

----------------------------------------------------------------

cell XT granule bytes requested for predicate offload          23

and great thing that on the HDFS we have exact same amout of data:

$ hadoop fs -du -h -s /user/hive/warehouse/orc.db/store_sales_part/*/month=12

5.1 G  15.3 G  /user/hive/warehouse/orc.db/store_sales_part/year=2001/month=12

5.1 G 15.3 G/user/hive/warehouse/orc.db/store_sales_part/year=2002/month=12

5.1 G  15.3 G  /user/hive/warehouse/orc.db/store_sales_part/year=2003/month=12

5.1 G  15.3 G  /user/hive/warehouse/orc.db/store_sales_part/year=2004/month=12

2.1 G  6.3 G  /user/hive/warehouse/orc.db/store_sales_part/year=2005/month=12

Simple, but very powerful feature!





About List