Apache Hadoop has always been associated with storing & processing vasts amount of data. But did you know it’s also an awesome engine to power interactive data exploration and visualization? With the development ofApache Hive LLAP(a recent innovation included in the Hortonworks Data Platform ), you can use Hadoop with Business Intelligence tools (like Tableau) to analyze and visualize large data sets. Plus, being able to do this in a cloud environment means you can get started quickly while working seamlessly and directly with data available in cloud storage systems (like Amazon S3).
In this blog, let’s take a look at how we can use Hortonworks Data Cloud and the Hive LLAP Analytics cluster configuration to model, interact and visualize data available in Amazon S3. To do so, you’ll need the following:
- An Amazon Web Services account.
- A Hortonworks Data Cloud deployment in that account.
- Tableau with the Hortonworks ODBC Driver for Apache Hive installed in your local environment (to be able to connect to Hive from Tableau).
First Things First: Hortonworks Data Cloud
For those that aren’t familiar: Hortonworks Data Cloud is a platform for analyzing and processing data in the cloud. “HDCloud” enables businesses to achieve insights more quickly and with greater flexibility than ever before with a greatly reduced operational effort. You can choose from a set of prescriptive cluster configurations to start modeling and analyzing your data sets in minutes. When you are done with your analysis, you can give the resources back to the cloud, reducing your costs.
To get started, you need to launch the HDCloud “cloud controller” into your AWS environment. You can follow the instructions here:
Once the controller is ready, login and create a cluster for “Analytics with Hive LLAP”. The general instructions can be found at the link below but you’ll want to be sure to use the following parameters and options when you create your cluster:
|CREATE CLUSTER PROPERTY||VALUE|
|HDP Version||HDP 2.6 (Cloud)|
|Cluster Configuration Type||EDW-Analytics: Apache Hive 2 LLAP, Apache Zeppelin 0.7.0|
|Master Instance Type||m4.xlarge|
|Worker Instance Type||m3.2xlarge|
|Worker Instance Count||3|
You Have a Cluster…Now What?
For our scenario, we have some “airports” data available in Amazon S3 to use. To use this data, we need to create the Hive schema for the data. To create the schema, we can use the Hive View that comes pre-installed in the cluster you just created. Open the Hive View from the Ambari dropdown menu on the cluster details page.
From a Worksheet in the Hive View , copy the content from this DDL and execute:
Let’s briefly highlight this particular part of the DDL:
STORED AS ORC LOCATION 's3a://hw-sampledata/airline_orc/airline_ontime.db/airports'
We are creating schema using LOCATION as the Amazon S3 data. This type of Hive external “managed table” allows the actual table data to live in S3 but the schema be defined in Hive. External tables are optimal when the data is already present in Amazon S3 and once populated, data can be accessed multiple times and it is not deleted even if the tables are accidentally deleted. Plus, we are storing the data in Apache ORC , a high-performant columnar storage format which is great for Hadoop workloads.
Next: even though the DDL creates the schema definition in Hive, we need populate partition-related information. In a new Hive View Worksheet, run the following (be sure to switch to the “hwxdemo” database that was created with the DDL). Now you have some data to play with.
MSCK REPAIR TABLE flights;
You Have Data…Again, Now What?
Ok…well…you have data. It’s time to point your Tableau to Hive LLAP and connect. We have some instructions here on how to do that:
Time to Model and Visualize
With Tableau connected to Hive LLAP, select the “hwxdemo” schema and start creating a data model. You can using the following 4 tables:
For example, you can create a data model using the following tables loaded in order with associations as described below:
|Load Order||Table||Inner Join Condition|
|2||airlines||Uniquecarrier = Code|
|3||planes||Tailnum = Tailnum(Planes)|
|4||airports||Origin = Iata, rename “airports” to “airports_origin”|
|5||airports||Dest = Iata(Airports), rename “airports” to “airports_dest”|
The final table associations should look like this:
Once you have the data model built, go to town and starting visualizing airline growth over the years, or airports with the most delays.
Where to go from here?
If you are ready to get started with Hortonworks Data Cloud for AWS , go here for a 5 day free trial from the AWS Marketplace listing. For more information, please refer to the following links.
|“Get Started with HDCloud” Webinar||http://hortonworks.com/webinar/hadoop-in-the-cloud-aws/|