A First Look at Hive and HiveQL

Datetime:2016-08-22 21:53:10          Topic:          Share

Like many of my colleagues, I’ve been looking at various Big Data tools and technologies. It’s quite difficult (for me anyway) to completely grasp the similarities and differences and relationships between the large number of Big Data entities.

So, at a low level of abstraction is HDFS — a file system for distributed data. Unless you’re a person designing system level tools, it’s unlikely you’ll want to work directly with HDFS in much the same way that developers rarely work directly with Windows or Linux at a very low level

Residing on an HDFS-based system is data. Hadoop is a storage protocol plus a Java language library that allows developers to access the data. The original way to get at Hadoop data is to write a Java program that uses the so-called MapReduce paradigm. The Map part prepares the distributed data and performs some preliminary work, and the Reduce part combines the intermediate results from the Map part.

Writing MapReduce with Java is very time-consuming and not easy. So there are several code libraries that make writing MapReduce easier. Pig is one such library where you write code that looks quite a bit like SQL, which is translated to MapReduce code behind the scenes.

Hive is another library that makes writing MapReduce code easier. The language part of the Hive system is HiveQL, which looks very similar to SQL. In other words, Pig and Hive both do the same sort of things.

(Note: I found the image above on the Internet. When I explored Hive I used a Docker image and a command shell)

A third library that sits on top of MapReduce is Spark. You can access the Spark library using Java or Scala (an interactive version of Java) or Python or R. Spark works at a lower level than Pig and Hive, and does the Map part in memory rather than on disk, so Spark is very fast and useful for scenarios where you access data multiple times.

I’ve left out tons of details and over-simplified a bit, but I think I’ve hit the main ideas correctly. In particular, Hive is more complicated than I’ve described. Like anything, the only way to learn Hive is to use it. So I played around a bit. I’m starting to understand Hive and HiveQL. I think.