Adrift in a Data Lake – an Oracle Developer’s first steps in Hadoop

Datetime:2016-08-23 01:47:37          Topic: HDFS  Hive           Share

We live in interesting times. As I write, the political life of a great nation is currently in thrall to a wealthy right-wing populist with a rather distinctive hairstyle.But enough about Boris Johnson.

For someone used to the way things are done in Oracle, Hadoop can be something of a culture shock.

My initial introduction to Hadoop and it’s “vibrant ecosystem” was some internally written documentation.

As with most technical documentation, it was written with the assumption of a certain level of prior knowledge on the part of the reader. For this particular reader, it proved to be an erroneous assumption.

After a half an hour of struggling through this text , I was left wondering what you would use a Khafka Banana Oozie for.

Maybe it’s for killing zombie processes ? Let’s face it, with all that Java running on your system there are bound to be a few knocking around.I’m a little hazy on my Zombie Lore, so I’m not entirely clear as to why a rapid application of fresh fruit would pose an existential threat to your average zombie. Maybe they’re potassium intolerant ?

There are a bewildering array of tools associated with Hadoop, many of which appear to provide almost identical functionality.For example, a relational database traditionally requires only one SQL Engine…and I’ll be taking a look at two of them.

Moving from Oracle to Hadoop can feel rather like trading your car in for a box of Lego.

If the box does contain any instructions they seem to have been translated from Java into English…by someone who doesn’t speak either.

Note to reader : please substitute your own language as appropriate .

Fortunately, there are Hadoop distributions available which bundle the core tools required to get up an running. We’ll gloss over the fact that these distributions – Cloudera and Hortonworks – don’t contain the exact same tools.

In my search to find a coherent explanation of how to use Hadoop, I’ve found that the VM provided by Cloudera, together with the introductory tutorial, is a valuable aid to familiarising myself with the basics.

To start with then, I’ll cover getting hold of the Cloudera VM and setting it up in VirtualBox.

Then I’ll go through some of the tools available and what they do.

I’ll do this from the perspective of an Oracle developer (mainly because I don’t have much option) and will point out the driftwood of familiarity that you might be able to cling to in order to stay afloat in your Data Lake.

What I’ll cover is :

  • The core components of Hadoop
  • HDFS commands
  • Transferring data between a relational database and hadoop using SQOOP
  • Querying structured data using Hive and Impala
  • Uploading data using Kite

Ready to Dive in ?

Cloudera VM setup for Virtualbox

Before we head off to get the VM, there are a couple of points worth mentioning.

First of all, the VM is running on a 64-bit version of Centos 6.4. Therefore you need to make sure that your host operating system ( i.e. the one that you’ll be running Virtual Box on) is also 64-bit.

Secondly, the minimum recommended memory allocation for the VM is 4GB.

Finally, I’m using the Cloudera Quickstart 5.5 VM. The behaviour of some of the tools covered here differs in 5.4 and 5.5

The Cloudera VM can be downloaded from here .

You’ll need to fill in some details about yourself before moving on.

Once you’re at the downloads page, select VirtualBox from the Platform drop-down list.VMs for other platforms are also available.

Once you hit the “Download Now” button you’ll be asked for some further details. However, these do not appear to be validated.

The download is quite chunky, just over 5GB.

Once it’s completed you should have a zip file called :

cloudera-quickstart-vm-5.5.0-0-virtualbox.zip

You can use a standard unzip utility to extract this file ( e.g. Winzip on Windows, plain old zip on Linux).

Once extracted, you’ll see a directory called cloudera-quickstart-vm-5.5.0-0-virtualbox which contains two files :

cloudera-quickstart-vm-5.5.0-0-virtualbox-disk1.vmdk
cloudera-quickstart-vm-5.5.0-0-virtualbox.ovf

The .ovf file is the VirtualBox image and the .vmdk is the VM’s hard-drive image.

To set up the VM in Virtualbox, you’ll need to import the .ovf file as an Appliance.

The steps to do this are the same as those for importing an Oracle Developer Day 12c Image .

When you first start the VM it does take a while to come up. When it does, you should see

Tweaking the VM settings

If you don’t happen to live in Boston and/or you don’t have a US keyboard, you may want to make a couple of configuration changes to the VM.

To change the Location ( and consequently the Time Zone) :

  1. Click on the Date and Time on the Status Bar. It’s in the top right-hand corner of the screen.
  2. Hit the Edit button next to Locations
  3. Add your location by typing it in the Location Name dialog box (a drop-down list will appear as you start typing).
  4. Now highlight Boston and click the Remove button. Your new location should now show at the bottom of the Time window
  5. Move your mouse over your location and an Edit button should appear. Click this and the Date and Time should now reflect your new location

You will be prompted for a password to confirm this change. You need to enter cloudera

To change the keyboard layout :

  1. Go to the System Menu and select Preferences and then Keyboard
  2. Navigate to the Layouts tab and click Add
  3. Select your preferred layout from the drop-down list
  4. Once your new layout is shown, click the Default radio button next to it

These changes should take effect without the need to re-start the VM.

OK, so let’s get started….

The Core components of Hadoop

Hadoop itself comprises three main components :

  • HDFS – a distributed file system
  • The MapReduce framework
  • Yarn – a Job scheduler / resource manager for the parallel execution of MapReduce jobs

MapReduce

The MapReduce framework consists of a Mapper and a Reducer.

In SQL terms, the Mapper program applies a query predicate against the data you are querying – essentially, it does the filtering.

The Reducer then does any aggregation on the result set produced from the Mapper process.

Yarn

Yarn is the default job scheduler and resource manager. It facilitates parallel execution of MapReduce jobs.

HDFS

HDFS – Hadoop File System – is a distributed file system. The idea is that datafiles are replicated across multiple nodes (physical servers) in a cluster. Essentially, any program can run on any node. By replicating the data to each node, network latency is minimised for these programs.

It comes with it’s own set of commands which you can use interactively.

These appear to be largely a subset of those you’d find on a Linux OS.

The format of these commands is :

hadoop fs -command [some arguments]

However, the way that the VM is configured, we need to run these commands as the hdfs user.Therefore, in the examples that follow the commands will follow the format..

sudo -u hdfs hadoop fs -command [some arguments]

To start with, let’s see what we’d need to do to upload a csv file to HDFS.

The file in question is called tools.csv, which I’ve created in the cloudera user’s home directory on the vm.It contains the following :

tools.csv

To start with, let’s see what’s currently in hdfs :

sudo -u hdfs hadoop fs -ls /

Found 5 items
drwxrwxrwx   - hdfs  supergroup          0 2015-11-18 10:57 /benchmarks
drwxr-xr-x   - hbase supergroup          0 2016-03-23 12:08 /hbase
drwxrwxrwt   - hdfs  supergroup          0 2016-03-22 12:23 /tmp
drwxr-xr-x   - hdfs  supergroup          0 2015-11-18 11:01 /user
drwxr-xr-x   - hdfs  supergroup          0 2015-11-18 11:00 /var

This shows us the top-level directories. Let’s take a look at what’s in /user :

sudo -u hdfs hadoop fs -ls /user

Found 8 items
drwxr-xr-x   - cloudera cloudera            0 2015-11-18 10:56 /user/cloudera
drwxr-xr-x   - mapred   hadoop              0 2015-11-18 10:57 /user/history
drwxrwxrwx   - hive     supergroup          0 2015-11-18 11:01 /user/hive
drwxrwxrwx   - hue      supergroup          0 2015-11-18 10:58 /user/hue
drwxrwxrwx   - jenkins  supergroup          0 2015-11-18 10:58 /user/jenkins
drwxrwxrwx   - oozie    supergroup          0 2015-11-18 10:59 /user/oozie
drwxrwxrwx   - root     supergroup          0 2015-11-18 10:58 /user/root
drwxr-xr-x   - hdfs     supergroup          0 2015-11-18 11:01 /user/spark

For the purposes of this test, I’ll create a directory under the /user/cloudera directory, and then check that it’s been created as expected :

sudo -u hdfs hadoop fs -mkdir /user/cloudera/test
sudo -u hdfs hadoop fs -ls /user/cloudera
Found 1 items
drwxr-xr-x   - hdfs cloudera          0 2016-03-23 14:25 /user/cloudera/test

Notice that only the directory owner has write permissions on the directory.

As I’m feeling reckless, I want to grant write permissions to everyone.

This can be done as follows :

sudo -u hdfs hadoop fs -chmod a+w /user/cloudera/test
sudo -u hdfs hadoop fs -ls /user/cloudera
Found 1 items
drwxrwxrwx   - hdfs cloudera          0 2016-03-23 14:25 /user/cloudera/test

In HDFS, the chmod command seems to accept the same arguments as it’s Linux counterpart.

To check that we can now see the directory :

sudo -u hdfs hadoop fs -ls /home/cloudera/test

The simplest way to load our csv is to use the put command :

sudo -u hdfs hadoop fs -put tools.csv /user/cloudera/test/put_tools.csv
sudo -u hdfs hadoop fs -ls /user/cloudera/test
Found 1 items
-rw-r--r--   1 hdfs cloudera        301 2016-03-23 14:49 /user/cloudera/test/put_tools.csv

There is another way to do this :

sudo -u hdfs hadoop fs -copyFromLocal tools.csv /user/cloudera/test/tools.csv
sudo -u hdfs hadoop fs -ls /user/cloudera/test
Found 2 items
-rw-r--r--   1 hdfs cloudera        301 2016-03-23 14:49 /user/cloudera/test/put_tools.csv
-rw-r--r--   1 hdfs cloudera        301 2016-03-23 17:04 /user/cloudera/test/tools.csv

If we want to delete a file then :

sudo -u hdfs hadoop fs -rm /user/cloudera/test/put_tools.csv
16/03/23 17:06:51 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/cloudera/test/put_tools.csv
sudo -u hdfs hadoop fs -ls /user/cloudera/test
Found 1 items
-rw-r--r--   1 hdfs cloudera        301 2016-03-23 17:04 /user/cloudera/test/tools.csv

You can display the contents of a file in hdfs by using the cat command :

sudo -u hdfs hadoop fs -cat /user/cloudera/test/tools.csv
tool,description
hue,Web-based UI for Hadoop
sqoop,Transfer structured data between an RDBMS and Hadoop
flume,stream a file into Hadoop
impala,a query engine
hive,another query engine
spark,a query engine that is not hive or impala
khafka,a scheduler
banana,a web UI framework
oozie,another scheduler

In order to demonstrate copying a file from hdfs to the local filesystem in the VM, we’ll need to create a directory that the hdfs user has access to :

mkdir test
chmod a+rw test
cd test
ls -ld
drwxrwxrwx 2 cloudera cloudera 4096 Mar 23 17:13 .

Now, as the hdfs user, we can retrieve our file from hdfs onto the local file system using copyFromLocal :

sudo -u hdfs hadoop fs -copyToLocal /user/cloudera/test/tools.csv /home/cloudera/test/welcome_back.csv
ls -l welcome_back.csv
-rw-r--r-- 1 hdfs hdfs 301 Mar 23 17:18 welcome_back.csv

Another method of doing this is using get :

sudo -u hdfs hadoop fs -get /user/cloudera/test/tools.csv /home/cloudera/test/got_it.csv
ls -l got_it.csv
-rw-r--r-- 1 hdfs hdfs 301 Mar 23 17:21 got_it.csv

One final hdfs command that may come in useful is du, which shows the amount of space used by a directory or file on hdfs :

sudo -u hdfs hadoop fs -du /user/cloudera
301  301  /user/cloudera/test

SQOOP

The VM comes with a MySQL database from which data is loaded into Hadoop via SQOOP.SQOOP is a tool for transferring structured data between an RDBMS and Hadoop.

The documentation does say that SQOOP is capable of loading data into Oracle using the command (from the Local File System) :

sqoop import --connect jdbc:oracle:thin:@//db_name --table table_name

However, said documentation says that it’s been tested with Oracle 10.2 Express Edition, so you may want to have a play around with it before using it in anger.

The tutorial directs us to use SQOOP to ingest all of the data from the MySQL database by running the following command :

sqoop import-all-tables \
    -m 1 \
    --connect jdbc:mysql://quickstart:3306/retail_db \
    --username=retail_dba \
    --password=cloudera \
    --compression-codec=snappy \
    --as-parquetfile \
    --warehouse-dir=/user/hive/warehouse \
    --hive-import

There’s a fair amount going on here, we’re connecting to MySQL, then outputting the data as a compressed file onto hdfs in the /user/hive/warehouse directory.

The compression library being used is Snappy .

It’s instructive to see the output when we run this command as it shows both MapReduce and Yarn in action. You’ll probably see lines like :

...
16/03/23 17:36:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1458734644938_0001
16/03/23 17:36:08 INFO impl.YarnClientImpl: Submitted application application_1458734644938_0001
16/03/23 17:36:08 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1458734644938_0001/
16/03/23 17:36:08 INFO mapreduce.Job: Running job: job_1458734644938_0001
16/03/23 17:37:28 INFO mapreduce.Job: Job job_1458734644938_0001 running in uber mode : false
16/03/23 17:37:28 INFO mapreduce.Job:  map 0% reduce 0%
16/03/23 17:38:29 INFO mapreduce.Job:  map 100% reduce 0%
16/03/23 17:38:34 INFO mapreduce.Job: Job job_1458734644938_0001 completed successfully
...

After a fair amount of time, the command should end with :

...
16/03/23 17:53:53 INFO mapreduce.ImportJobBase: Transferred 46.1318 KB in 157.9222 seconds (299.1283 bytes/sec)
16/03/23 17:53:53 INFO mapreduce.ImportJobBase: Retrieved 1345 records.

If we now check, we can see that a directory has been created for each table :

hadoop fs -ls /user/hive/warehouse
Found 6 items
drwxrwxrwx   - cloudera supergroup          0 2016-03-23 17:38 /user/hive/warehouse/categories
drwxrwxrwx   - cloudera supergroup          0 2016-03-23 17:41 /user/hive/warehouse/customers
drwxrwxrwx   - cloudera supergroup          0 2016-03-23 17:44 /user/hive/warehouse/departments
drwxrwxrwx   - cloudera supergroup          0 2016-03-23 17:48 /user/hive/warehouse/order_items
drwxrwxrwx   - cloudera supergroup          0 2016-03-23 17:51 /user/hive/warehouse/orders
drwxrwxrwx   - cloudera supergroup          0 2016-03-23 17:53 /user/hive/warehouse/products

Digging further into the catogories directory and it’s children, we find that the table structure has also been transferred across :

hadoop fs -cat /user/hive/warehouse/categories/.metadata/schemas/1.avsc
{
  "type" : "record",
  "name" : "categories",
  "doc" : "Sqoop import of categories",
  "fields" : [ {
    "name" : "category_id",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "category_id",
    "sqlType" : "4"
  }, {
    "name" : "category_department_id",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "category_department_id",
    "sqlType" : "4"
  }, {
    "name" : "category_name",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "category_name",
    "sqlType" : "12"
  } ],
  "tableName" : "categories"
}

The data is stored in a .parquet file :

hadoop fs -ls /user/hive/warehouse/categories
Found 3 items
drwxr-xr-x   - cloudera supergroup          0 2016-03-23 17:35 /user/hive/warehouse/categories/.metadata
drwxr-xr-x   - cloudera supergroup          0 2016-03-23 17:38 /user/hive/warehouse/categories/.signals
-rw-r--r--   1 cloudera supergroup       1956 2016-03-23 17:38 /user/hive/warehouse/categories/a3db2e78-2861-4906-a769-e8035c03d7d2.parquet

There are a number of file formats you can use with Hadoop, each with their own pros and cons.

You can see a discussion of some of these formats here .

NOTE– the first time I ran this, I shutdown the VM after I got to here. When I started it again, Impala (see below) refused to see any databases. I had to trash and re-create the VM to get it to work. I’m not sure why this happened ( I did have a hunt around) but just thought I’d mention it in case you’re thinking of doing the same.

Anyway, now we have data in Hadoop, it would be good if we could interrogate it…

Not Quite SQL – Hive and Impala

Cloudera comes with a Web based UI for Hadoop in the form of Hue.Note that Hortonworks seem to be standardazing on a different UI tool – Apache Ambari.

In the Cloudera VM, there is a link to Hue on the Bookmarks bar in Firefox.Click on this link and then connect as cloudera (password cloudera).

The favoured query engine for interactive queries in Cloudera is called Impala.I believe that, at the time of writing, Hortonworks are sticking with Hive.

As far as I can tell, it seems that Hive has been retained by Cloudera to handle what, in RDBMS terms, would be called the catalog.In Oracle terms this would be the Data Dictionary.

This is essentially the metadata for the tables in the database.This metadata seems to be read by all of the SQL Engines irrespective of where it is updated from.

Once you’re connected via Hue you can select either of these tools from the Query Editors drop-down.

Initially, the tutorial directs you to Impala.

The Data Dictionary

The first thing to note about Impala is that it doesn’t bother reading the table metadata unless you tell it to. Therefore, if you make any changes DML or DDL changes, you probably need to tell Impala to check the metadata for any query results to pick up these changes.For example, we know that we’ve got six tables in our database which we’ve created via SQOOP. However, if you ask Impala about it :

show tables;

…you get the not entirely helpful :

The operation has no results

By contrast, if you try this in Hive (Query Editors/Hive), the tables are all present and correct.

To persuade Impala to see these changes you need to run :

invalidate metadata;

Note that you can also run this command for individual tables should the need arise, e.g. :

invalidate metadata categories;

Anyway, now we can see the tables in Impala, we can run some queries against them.Whilst we’re at it, we can do a simple comparison between Impala and Hive in terms of how they process the same query.

Comparative performance

The query in question (taken from the Getting Started Tutorial provided with the VM) is :

-- Most popular product categories
select c.category_name, count(order_item_quantity) as count
from order_items oi
inner join products p on oi.order_item_product_id = p.product_id
inner join categories c on c.category_id = p.product_category_id
group by c.category_name
order by count desc
limit 10;

The syntax looks reassuringly familiar.

In Impala this ran in around 35 seconds.I then ran the same query in Hive, which took about 7 minutes.

Once the query is run, Hue shows an Explain button. Click on this and you can see the execution plan for the query.

In Hive the plan looks like this :

STAGE DEPENDENCIES:
  Stage-9 is a root stage
  Stage-3 depends on stages: Stage-9
  Stage-4 depends on stages: Stage-3
  Stage-0 depends on stages: Stage-4

STAGE PLANS:
  Stage: Stage-9
    Map Reduce Local Work
      Alias -> Map Local Tables:
        c
          Fetch Operator
            limit: -1
        p
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        c
          TableScan
            alias: c
            Statistics: Num rows: 24 Data size: 2550 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: category_id is not null (type: boolean)
              Statistics: Num rows: 12 Data size: 1275 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 _col10 (type: int)
                  1 category_id (type: int)
        p
          TableScan
            alias: p
            Statistics: Num rows: 5737 Data size: 45896 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (product_id is not null and product_category_id is not null) (type: boolean)
              Statistics: Num rows: 1435 Data size: 11479 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 order_item_product_id (type: int)
                  1 product_id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: oi
            Statistics: Num rows: 206028 Data size: 1648231 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: order_item_product_id is not null (type: boolean)
              Statistics: Num rows: 103014 Data size: 824115 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 order_item_product_id (type: int)
                  1 product_id (type: int)
                outputColumnNames: _col3, _col10
                Statistics: Num rows: 113315 Data size: 906526 Basic stats: COMPLETE Column stats: NONE
                Map Join Operator
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0 _col10 (type: int)
                    1 category_id (type: int)
                  outputColumnNames: _col3, _col20
                  Statistics: Num rows: 124646 Data size: 997178 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: _col20 (type: string), _col3 (type: int)
                    outputColumnNames: _col20, _col3
                    Statistics: Num rows: 124646 Data size: 997178 Basic stats: COMPLETE Column stats: NONE
                    Group By Operator
                      aggregations: count(_col3)
                      keys: _col20 (type: string)
                      mode: hash
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 124646 Data size: 997178 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: string)
                        Statistics: Num rows: 124646 Data size: 997178 Basic stats: COMPLETE Column stats: NONE
                        value expressions: _col1 (type: bigint)
      Local Work:
        Map Reduce Local Work
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 62323 Data size: 498589 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-4
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col1 (type: bigint)
              sort order: -
              Statistics: Num rows: 62323 Data size: 498589 Basic stats: COMPLETE Column stats: NONE
              value expressions: _col0 (type: string)
      Reduce Operator Tree:
        Select Operator
          expressions: VALUE._col0 (type: string), KEY.reducesinkkey0 (type: bigint)
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 62323 Data size: 498589 Basic stats: COMPLETE Column stats: NONE
          Limit
            Number of rows: 10
            Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: 10
      Processor Tree:
        ListSink

The Impala plan looks a bit different :

Estimated Per-Host Requirements: Memory=4.16GB VCores=1
WARNING: The following tables are missing relevant table and/or column statistics.
default.categories, default.order_items, default.products

11:MERGING-EXCHANGE [UNPARTITIONED]
|  order by: count(order_item_quantity) DESC
|  limit: 10
|
06:TOP-N [LIMIT=10]
|  order by: count(order_item_quantity) DESC
|
10:AGGREGATE [FINALIZE]
|  output: count:merge(order_item_quantity)
|  group by: c.category_name
|
09:EXCHANGE [HASH(c.category_name)]
|
05:AGGREGATE
|  output: count(order_item_quantity)
|  group by: c.category_name
|
04:HASH JOIN [INNER JOIN, BROADCAST]
|  hash predicates: p.product_category_id = c.category_id
|
|--08:EXCHANGE [BROADCAST]
|  |
|  02:SCAN HDFS [default.categories c]
|     partitions=1/1 files=1 size=1.91KB
|
03:HASH JOIN [INNER JOIN, BROADCAST]
|  hash predicates: oi.order_item_product_id = p.product_id
|
|--07:EXCHANGE [BROADCAST]
|  |
|  01:SCAN HDFS [default.products p]
|     partitions=1/1 files=1 size=43.80KB
|
00:SCAN HDFS [default.order_items oi]
   partitions=1/1 files=1 size=1.57MB

As well as being somewhat shorter, the Impala plan appears more familiar to someone looking at an Oracle Query Plan.

To check for the possible effect of caching, I then ran these queries again.

For Impala, the runtime dropped to around 13 seconds.

For Hive, the runtime was the same as for the first run.

The Impala plan was unchanged, despite the fact that the runtime was drastically reduced.From this behaviour I would infer that there is some caching effect for Impala, although a cache similar to the Buffer Cache in Oracle is not shared between the two Query Engines.

The explanation offered in the tutorial is that Hive compiles SQL queries into MapReduce jobs wheras Impala was designed as a SQL engine.It would appear that Impala is better for small-scale interactive queries wheras Hive is more suited to large-scale ETL.

There’s a benchmarking comparison between Hive, Impala, and Spark that may be of interest here .

Optimizer Statistics

You’ll notice that, at the start of the Impala query plan, there’s a warning about missing statistics.

If you roll the mouse over one of the tables on the left-hand side of the Hue Query Editor Window, you’ll see an icon appear which enables you to explore the table’s metadata.

Looking at one of these tables we can see that there are no stats present :

We can fix this easily enough in Impala by gathering stats for each of our tables. For example :

compute stats categories;

If we now check the metadata for the CATEGORIES table, we can see that stats are present :

If we now re-run the original query, the plan will no longer show the warning. However, the rest of the plan remains unchanged.Given that these are quite small tables, this is probably not surprising.

There’s more information about stats gathering in Impala here .

External Tables

One thing that Hive is good for apparently is creating External Tables.

In the tutorial, some unstructured data ( a log file) is loaded and then external tables created in Hive as using the following code :

CREATE EXTERNAL TABLE intermediate_access_logs (
    ip STRING,
    date STRING,
    method STRING,
    url STRING,
    http_version STRING,
    code1 STRING,
    code2 STRING,
    dash STRING,
    user_agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    'input.regex' = '([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*) ([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)"',
    'output.format.string' = '%1$$s %2$$s %3$$s %4$$s %5$$s %6$$s %7$$s %8$$s %9$$s')
LOCATION '/user/hive/warehouse/original_access_logs';

CREATE EXTERNAL TABLE tokenized_access_logs (
    ip STRING,
    date STRING,
    method STRING,
    url STRING,
    http_version STRING,
    code1 STRING,
    code2 STRING,
    dash STRING,
    user_agent STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/hive/warehouse/tokenized_access_logs';

ADD JAR /usr/lib/hive/lib/hive-contrib.jar;

INSERT OVERWRITE TABLE tokenized_access_logs SELECT * FROM intermediate_access_logs;

Once again, it looks familiar for the most part.

One point of interest is the INSERT OVERWRITE command used to insert data into the table.

This has the effect of removing any pre-existing data in the table before inserting the new data.

String functions and other SQL Stuff

We can query the new external table in Impala :

invalidate metadata tokenized_access_logs;
select * from tokenized_access_logs;

The values in the url column contain %20 characters instead of spaces.Let’s see what Impala can do in terms of the standard SQL string manipulation functions.

Unlike Oracle, there’s no REPLACE function, there is however a REGEXP_REPLACE…

select regexp_replace(url, '%20', ' ')
from tokenized_access_logs;

When we run this, we can see that the REGEXP_REPLACE has done the job :

The logs we’re really interested in are where a specific product has been viewed.If we can get the name of the product from the url, then maybe that will help when relating this data back to the most popular items in terms of sales…

select substr(regexp_replace( url, '%20', ' '), instr( regexp_replace(url, '%20', ' '), '\/product\/') + 9)
from tokenized_access_logs
where url like '%\/product\/%';

When we run this, we can see that the INSTR function also works as expected…to a point. Note that the ‘/’ characters need to be escaped with a ‘\’.The result looks like this :

Now, there are a couple of records with the product name and “/add_to_cart” appended. With INSTR in Oracle, you can search for a pattern starting at the end of a string – e.g. :

select instr('/product/Pelican Sunstream 100 Kayak/add_to_cart', '/',-1,1)
from dual;

INSTR('/PRODUCT/PELICANSUNSTREAM100KAYAK/ADD_TO_CART','/',-1,1)
---------------------------------------------------------------
							     37

There is no such option in the Impala equivalent. It simply searches the string from the start and reports the first occurrence.Fortunately, we want to strip out these results for the stuff we’re going to do in a minute. Therefore, I’ve just amended the query to be :

select substr(regexp_replace( url, '%20', ' '), instr( regexp_replace(url, '%20', ' '), '\/product\/') + 9)
from tokenized_access_logs
where url like '%\/product\/%'
and url not like '%\/add_to_cart%';

We can use fairly standard SQL to get a listing of the products by number of views :

select count(*), substr(regexp_replace( url, '%20', ' '), instr( regexp_replace(url, '%20', ' '), '\/product\/') + 9)
from tokenized_access_logs
where url like '%\/product\/%'
and url not like '%\/add_to_cart%'
group by url
order by 1 desc;

In the tutorial, it mentions that there is one product that has a lot of views but no sales. I wonder if we can find this just using SQL ?

with viewed_products as
(
select count(*) as times_viewed,
substr(regexp_replace( url, '%20', ' '),
       instr( regexp_replace(url, '%20', ' '), '\/product\/') + 9) as product_name
from tokenized_access_logs
where url like '%\/product\/%'
and url not like '%\/add_to_cart%'
group by url
)
select v.times_viewed, v.product_name
from viewed_products v
where upper(v.product_name) not in
(
  select upper(p.product_name)
  from products p
  inner join
  (
    select oi.order_item_product_id as product_id,
        count(oi.order_item_id) as units_sold
    from order_items oi
    inner join orders o
        on oi.order_item_order_id = o.order_id
    where o.order_status not in ('CANCELED', 'SUSPTECTED_FRAUD')
    group by order_item_product_id
  ) s
    on p.product_id = s.product_id
);

OK, it’s not the most elegant SQL I’ve ever written but it does demonstrate that :

  • You can define in-line views using the WITH clause
  • You can use a NOT IN subquery

…better than that, it gives the correct answer :

adidas Kids' RG III Mid Football Cleat

In the tutorial, the reason for the lack of sales is a coding error in the product page. You can ask your own snarky question about Unit Testing practices here.

Going back to the csv files we were playing around with on hdfs earlier, how would we get those into our database ?

Go fly a Kite

Yes, it’s another tool. This one is called Kite. It’s purpose, in this case at least, is to help us create a table based on the tools.csv file and allow us to then make use of it in Impala.

First of all, I’ve created the tools.csv file on the local file system :

l tools.csv
-rw-rw-r-- 1 cloudera cloudera 303 Mar 26 17:21 tools.csv

Now to get kite to create a “table” definition based on the csv :

kite-dataset csv-schema tools.csv --class Tools -o tools.avsc
ls -l tools.*
-rw-rw-r-- 1 cloudera cloudera 373 Mar 26 17:22 tools.avsc
-rw-rw-r-- 1 cloudera cloudera 303 Mar 26 17:21 tools.csv

If we have a look at the new avro file that kite has generated, it looks similar in terms of format to the one that SQOOP generated earlier :

{
  "type" : "record",
  "name" : "Tools",
  "doc" : "Schema generated by Kite",
  "fields" : [ {
    "name" : "tool",
    "type" : [ "null", "string" ],
    "doc" : "Type inferred from 'hue'",
    "default" : null
  }, {
    "name" : "description",
    "type" : [ "null", "string" ],
    "doc" : "Type inferred from 'Web-based UI for Hadoop'",
    "default" : null
  } ]
}

The next step is to create the table metadata ( in kite terminology this is a dataset)…

kite-dataset create tools -s tools.avsc

…and finally add the data itself to our table…

kite-dataset csv-import tools.csv tools
The url to track the job: http://localhost:8080/
Added 9 records to "tools"

To check in Impala, we can head over to Hue, select Impala from the Query Editors drop-down and run :

invalidate metadata tools;
select * from tools;

The result should look like this :

Back on the command line, we can see that kite has added files to hdfs :

hadoop fs -ls /user/hive/warehouse/tools
Found 3 items
drwxr-xr-x   - cloudera supergroup          0 2016-03-26 17:32 /user/hive/warehouse/tools/.metadata
drwxr-xr-x   - cloudera supergroup          0 2016-03-26 17:35 /user/hive/warehouse/tools/.signals
-rw-r--r--   1 cloudera supergroup        622 2016-03-26 17:35 /user/hive/warehouse/tools/8baf8440-41b0-4f29-8750-16aeb3aec9b0.avro

The .avro file appears to have been compressed using the Snappy compression tool also used by SQOOP earlier.You can however read the file by running :

hadoop fs -text /user/hive/warehouse/tools/8baf8440-41b0-4f29-8750-16aeb3aec9b0.avro
{"tool":{"string":"banana"},"description":{"string":"a web UI framework"}}
{"tool":{"string":"flume"},"description":{"string":"stream a file into Hadoop"}}
{"tool":{"string":"hive"},"description":{"string":"another query engine"}}
{"tool":{"string":"hue"},"description":{"string":"Web-based UI for Hadoop"}}
{"tool":{"string":"impala"},"description":{"string":"a query engine"}}
{"tool":{"string":"khafka"},"description":{"string":"a scheduler"}}
{"tool":{"string":"oozie"},"description":{"string":"another scheduler"}}
{"tool":{"string":"spark"},"description":{"string":"a query engine that is not hive or impala ?"}}
{"tool":{"string":"sqoop"},"description":{"string":"Transfer structured data between an RDBMS and Hadoop"}}

Summary

The Getting Started Tutorial goes on to cover various other tools available in the Cloudera distribution for doing data manipulation and analysis.

Additionally, you may findthis presentation on Hadoop internals for Oracle Devs by Tanel Poder useful.

For now though, I have enough to keep my head above water.





About List