Apache Hive 2.1 was released about a month ago and it’s a great opportunity to review how Hive 2 is drastically changing the landscape for SQL on Hadoop.
There is so much new in Hive it’s hard to pick highlights, but here are a few:
- Interactive query with Hive LLAP. LLAP was introduced in Hive 2.0 and improved in Hive 2.1 to deliver 25x faster performance than Hive 1 (covered in detail below).
- Robust SQL ACID support with more than 60 stabilization fixes.
- 2x Faster ETL through a smarter CBO, faster type conversions and dynamic partitioning optimizations.
- Procedural SQL support, dramatically simplifying migration from EDW solutions.
- Vectorization support for text files, introducing an option for fast analytics without any ETL.
- A host of new diagnostics and monitoring tools including a new HiveServer2 UI, a new LLAP UI and an improved Tez UI.
- In total, there are more than 2,100 features, improvements and fixes between Hive 2.0 and Hive 2.1. The rate of innovation is tremendous and momentum continues to grow.
We’ll explore these other topics in later posts. For now, we’ll focus on the most anticipated feature of Hive 2 and the massive performance gains it unlocks.
25x Faster Query Performance With Hive LLAP
The biggest story in Hive 2 is also its most anticipated feature: LLAP which stands for ‘Live Long and Process’. To summarize it, LLAP combines persistent query servers and optimized in-memory caching that allows Hive to launch queries instantly and avoids unnecessary disk I/O. Put another way, LLAP is a second-generation big data system: LLAP brings compute to memory (rather than compute to disk), it caches memory intelligently and it shares this data among all clients, while retaining the ability to scale elastically within a cluster.
Figure 1: The LLAP Architecture
Figure 2: Tez LLAP process compared to Tez execution process and MapReduce process
Comparing Hive With LLAP to Hive on Tez
To measure the improvement LLAP brings we ran 15 queries that were taken from the TPC-DS benchmark, similar to what we have done in the past. The entire process was run using the hive-testbench repository and data generation tools. The queries there are adapted to Hive SQL but are otherwise not modified from the standard TPC-DS queries using any of the tricks that some big data vendors routinely use to show better performance for their tools. This blog only covers 15 queries but a more comprehensive performance test is underway.
The full test environment is explored below but at a high level, the tests run using 10 powerful VMs with a 1TB dataset that is intended to show performance at data scales commonly used with BI tools. The same VMs and the same data are used both for Hive 1 and for Hive 2. All reported times represent the average across 3 runs in the respective Hive version.
Figure 3: Hive 1 with Tez versus Hive 2 with LLAP
As you can see, LLAP delivers a dramatic performance gain. Minimum query runtime with Hive LLAP is a mere 1.3 seconds, compared to 9.58 seconds in Hive 1.
Let’s discuss some of the main reasons for these performance gains.
Smarter Map Joins
Hive on Tez is a shared-nothing architecture: each processing unit works independently with its own memory and disk resources. LLAP is a multi-threaded process that allows memory sharing between workers. A map-side join requires a hash table to be distributed 1:1 into each map task. If you have 24 containers on a node you need to make 24 copies of the hash table and distribute it out. With LLAP you build the hash table once per node and cache it in-memory for all workers. This is especially important for low-latency SQL.
A great example of this is Query 55. In TPC-DS, Query 55 touches the smallest amount of data among any query that queries a fact table, just 1 month. To run this query in Hive on Tez the small date_dim and item tables must first be distributed to all Tez tasks. With LLAP this happens once per node, a large part of the reason LLAP’s average execution time is 1.3s, compared to Hive on Tez’s 24.72s.
Better MapJoin Vectorization for Joins
Many MapJoin optimizations have made their way into Hive 2. For example, joins against small dimension tables now run as fast as explicitly expanded lists.
A great example of where this helps is Query 43, which has a 37% selectivity in the store dimension join. Better MapJoin vectorization, which takes advantage of repeating sequences in the fact table, helps take Query 43 from 195.2s down to 4.2s.
A Fully Vectorized Pipeline
Hive 2 introduces Map Join vectorization in the reduce side with dynamically partitioned hash joins, essentially a reduce-side version of the MapJoin optimization. With this optimization, reducer inputs are unsorted and streamed through a hash table held on the reduce side. The optimization divides a large dimension table into many small disjoint dimension tables, allowing for the previous dimension table optimizations to scale upwards in size.
A great example here is Query 13 which touches a combination of several very large and several very small dimension tables, so it needs to run as a shuffle join for safety but gets high selectivity from the other dimension filters. This optimization helps take Query 13 from 90.2s down to 4.8s.
A Smarter CBO
The integration with Apache Calcite for sophisticated cost-based optimization continues to deepen and is reaping big rewards. For a few examples, Hive’s CBO can now factor join keys out of deeply nested predicates (avoiding cross joins), infer transitive predicates across joins, and apply basic transformations even when tables have no stats (a big win for ETL jobs).
Hive 1, Hive 2
- Hive 1.2.1
- Tez 0.7.0
- Hadoop 2.7.1
- Tez 0.9.0-snapshot
- Hadoop 2.7.1
Other Hive / Hadoop Settings:
All this software is deployed using Apache Ambari using HDP software that is currently in technical preview. In addition to the default settings from Ambari, some new optimizations are made for Hive 2. These optimizations will be set for default new installs at GA.
For the most part, OS defaults were used with 1 exception:
This screenshot gives a sense of how LLAP was configured to take advantage of the hardware within Ambari. Please note, LLAP configuration in Ambari is evolving at the time of this blog and current installation may look slightly different from this image.
Figure 4: Streamlined LLAP configuration in Apache Ambari
- The test was driven by the Hive Testbench https://github.com/t3rmin4t0r/hive-testbench/tree/hive14 in data generation and in queries. The same query text was used both for Hive 1 and for Hive 2.
- The 15 queries in this benchmark are a sample, a more comprehensive benchmark of Hive with LLAP will be published in a few weeks.
As always, Apache Hive is 100% open source and can be used on the Hadoop distribution of your choice, and that goes for the performance improvements as well as all the other features discussed in the blog.