This tutorial explains basics of Apache Hive in great details. The tutorial explains about the need for hive and its characteristics. This Hive guide also covers internals of Hive architecture. Follow this quickstart guide for installation of Hive .
Apache Hive is a data warehousing solution forHadoop which provides data summarization, query, and ad-hoc analysis. It is used to process structured and semi-structured data in Hadoop. Analysis of large datasets stored in Hadoop’s HDFS and also in Amazon S3 filesystem is supported by Hive. Like SQL hive also provides query language named HiveQL. Ad-hoc queries can be run using Hive for the data analysis. Earlier we have to write complex Map-Reduce jobs, but now with the help of Hive we just need to submit merely SQL queries. These SQL queries are converted into MapReduce jobs by Hive.
In year 2007, some folks at Facebook started creating their own warehouse infrastructure solution on hadoop, as it was very expensive to buy traditional warehousing solutions and Hive was created as the result. After that it was further developed by Apache Software Foundation and a new open source warehousing solution came into market named as Apache Hive. Now it is being used and developed by a number of companies like IBM, Yahoo, Amazon, Netflix, The Financial Industry Regulatory Authority (FINRA) and many more.
4. Need for Hive
Hive saves developers from writing complex MapReduce jobs for ad-hoc requirements. It provides summarization, query and analysis of data. Hive is very fast and scalable, and is highly extensible. Hive consists of a huge user base, with the help of Hive thousands of jobs on the cluster can be run by hundreds of users at time. As Hive is similar to SQL, hence it becomes very easy for the SQL developers to learn and implement Hive Queries.
In hadoop it becomes very complex and challenging to write custom map reduce jobs for each and every ad-hoc requirements as there is huge shortage of Hadoop talent in the industry and we will be needing a military of MapReduce developer. Here Hive plays a crucial role, it reduces the complexity of mapReduce by providing an interface where user can submit SQL queries. Now business analysts can play with Big Data using hive and generate insights. Hive provides file access on various data stores like HBASE and HDFS. The most amazing feature of Hive is that to learn Hive we don’t need to learn Java.
5. Hive Architecture
The diagram below shows basic architecture of Hive. We have Hadoop cluster, masters and slaves where HDFS and map reduce runs. Hive is a high level component and sits on the top of the Hadoop. Hive is usually installed on the Master (or edge node in case of big cluster). Hive needs a Metastore which is an RDBMS (it can be any RDBMS like MySQL or Oracle). Hive stores its Metadata inside the RDBMS, which can be deployed on master or on third machine, follow this tutorial for configuring MySQL as meta-store for hive . User submits a SQL query in the Hive like creating a table then its metadata like number of rows, columns, etc. gets stored inside the RDBMS. Hive converts this SQL queries into MapReduce jobs and submits it to the cluster. And hence data is processed on the Slaves. Every time user submits DDL SQL queries Hive updates its Metastore.
6. Hive Shell
The shell is the primary way with the help of which we interact with the Hive; we can issue our commands or queries in HiveQL inside the Hive shell. Hive Shell is almost similar to MySQL Shell. It is the command line interface for Hive. In Hive Shell users can run HQL queries. HiveQL is also case insensitive (except for string comparisons) same as SQL.
We can run the Hive Shell in two modes which are: Non-Interactive mode and Interactive mode
Hive in Non-Interactive mode:
Hive Shell can be run in the non-interactive mode, with -f option we can specify the location of a file which contains HQL queries. For ex hive -f my-script.q
Hive in Interactive mode:
In this mode we directly need to go to the hive shell and run the queries there. In hive shell we can submit required queries manually and get the result. For ex $bin/hive, go to hive shell