Creating Hive tables on compressed files

Datetime:2016-08-22 21:49:37          Topic: Hive  HDFS           Share

Stuck with creating Hive tables on compressed files?

Well the documentation on apache.org suggests that Hive natively supports compressed file – https://cwiki.apache.org/confluence/display/Hive/CompressedStorage

Lets try that out.

Store a snappy compressed file on HDFS. … thinking, I do not have such file… Wait! What?

Okay so lets create one.

Upload a text file on hdfs. Create folder /tmp/input if it does not exist – $ hdfs dfs -mkdir /tmp/input

$ hdfs dfs -put sample.txt /tmp/input

sample.txt content

1,a
2,b
3,c
4,d
5,e

Use following pig scripts to compress the file with snappy compression codec.

grunt> SET mapred.output.compress 'true';
grunt> SET mapred.map.output.compress.codec org.apache.hadoop.io.compress.SnappyCodec;
grunt> data = LOAD '/tmp/input' USING PigStorage();
grunt> store data into '/tmp/output_snappy' using PigStorage();

/tmp/output_snappy directory now contains snappy compressed file

…thinking, how do I know? .. Wait! What?

Eh, Lets check the contents

Run the below command to fetch and print contents

$ hdfs dfs -cat /tmp/output_snappy/part-m-00000.deflate | head -1

Garbled? Try this

$ hdfs dfs -text /tmp/output_snappy/part-m-00000.deflate | head -1

Looks good? Okay!

Lets build our Hive table on top of /tmp/ouput_snappy as we normally would

create external table sample_snappy(id int, text string) 
row format delimited 
fields terminated by ',' 
location '/tmp/output_snappy';

Tempted to test?

0: jdbc:hive2://localhost:10000> select * from sample_snappy;
+-------------------+---------------------+--+
| sample_snappy.id  | sample_snappy.name  |
+-------------------+---------------------+--+
| 1                 | a                   |
| 2                 | b                   |
| 3                 | c                   |
| 4                 | d                   |
| 5                 | e                   |
+-------------------+---------------------+--+

Enjoy :wink:

All in one commands and scripts

[ $ shell ]
echo -e '1,a\n2,b\n3,c\n4,d\n5,e' > sample.txt
hdfs dfs -mkdir /tmp/input
hdfs dfs -put sample.txt /tmp/input
[ grunt> ]
SET mapred.output.compress 'true';
SET mapred.map.output.compress.codec org.apache.hadoop.io.compress.SnappyCodec;
data = LOAD '/tmp/input' USING PigStorage();
store data into '/tmp/output_snappy' using PigStorage();
[ $ shell ]
hdfs dfs -cat /tmp/output_snappy/part-m-00000.deflate | head -1
hdfs dfs -text /tmp/output_snappy/part-m-00000.deflate | head -1
[ beeline > ]
create external table sample_snappy(id int, text string)
row format delimited
fields terminated by ','
location '/tmp/output_snappy';
select * from sample_snappy;




About List