Announcing hs2client, A Fast New C++ / Python Thrift Client for Impala and Hive

Datetime:2016-08-22 21:52:10          Topic: Python           Share

This new (alpha) C++ client library for Apache Impala (incubating) and Apache Hive provides high-performance data access from Python.

Earlier this year, members of the Python data tools and Impala teams at Cloudera began collaborating to create a new C++ library to eventually become a faster, more memory-efficient replacement for impyla , PyHive , and other (largely pure Python) client libraries for talking to Hive and Impala.

We are excited to release this effort, dubbed hs2client, as a new Apache-licensed open source project on GitHub . As you may guess from the name, this library implement the HiveServer2 Thrift API as a C++ library, with careful handling of result sets to allow languages like Python to access data at high performance.

The initial alpha preview release contains:

  • A C++ library, libhs2client , which provides a clean C++ API for the HiveServer2 Thrift API. This can be built and dynamically or statically linked in C/C++ applications with no direct exposure to Apache Thrift.
  • Python bindings, with optimized reads to pandas.DataFrame

To try out the library, you can install a dev build of the project right now with conda:

condainstallhs2client -c cloudera/channel/dev

Optimizing for Speed and Memory Use

The Python bindings for libhs2client are careful about performance and memory in a few key ways:

  • Converting row batches directly into pandas-compatible NumPy arrays
  • Care with categorical data: string data is deduplicated while it’s being converted for pandas. In the future, it would be straightforward to add an option to return all string data as pandas.Categorical arrays.

hs2client Demo and Simple Benchmark

The initial API is oriented at modeling the HiveServer2 protocol closely. Connecting to a cluster yields a Service instance:

In [1]:  imporths2client
 
          service = hs2client.connect('localhost', 21050, 'wesm')

To execute queries, you open a session then use its execute method.

In [2]:  session = service.open_session()
 
          op = session.execute("select 1")

Queries are asynchronous by default.

In [3]:  op.is_finished
 
Out[3]:  True
 
In [4]:  op.get_state()
 
Out[4]:  'finished'

Result sets can be fetched currently to pandas.DataFrame objects:

In [5]:  df = op.fetchall_pandas()
 
In [6]:  df
 
Out[6]:      1
          0  1

Compared with comparable database clients, we’ve optimized the fetch path to have excellent performance (IO speeds) and memory use.

In [10]:  importnumpyas np
          importpandasas pd
 
          N, K = 1000000, 10
 
          df = pd.DataFrame(np.random.randn(N, K),
          columns=['data{0}'.format(i) for i in range(K)])
 
          importibis
          con = ibis.impala.connect('localhost', 
          hdfs_client=ibis.hdfs_connect(port=5070))
          con.drop_database('hs2_test', force=True)
          con.create_database('hs2_test')
          con.create_table('test_data', df, database='hs2_test')
 
In [16]:  op = session.execute_sync('select * from hs2_test.test_data')
          %timedf = op.fetchall_pandas()
          CPUtimes: user 200 ms, sys: 64 ms, total: 264 ms
          Walltime: 682 ms
 
In [21]:  speed = df.memory_usage().sum() / 0.682 / 2**20
          print('Speed: {0:.2f} MB/s'.format(speed))
          Speed: 111.87 MB/s

This benchmark is being run on localhost, but it shows we can move over 110MB/s through the Thrift protocol and convert to a fully-formed pandas DataFrame.

hs2client Compared with impyla

In Ibis and other Python projects, we have been using impyla to execute queries and access result sets. impyla either uses Apache Thrift’s official Python implementation (on Python 2) or thriftpy (on Python 3) for interacting with the Impala or Hive Thrift service. Because of this, the main difference is the performance in fetching large result sets.

Here, I will perform an equivalent fetch using impyla via Ibis :

In [29]: # Connect to Impala via impyla (on Python 2)
        con = ibis.impala.connect('localhost', port=21050)
 
        expr = con.sql('select * from hs2_test.test_data')
 
        %timedf = expr.execute(limit=None)
        CPUtimes: user 1.15 s, sys: 52 ms, total: 1.2 s
        Walltime: 1.66 s
 
In [30]: speed = df.memory_usage().sum() / 1.66 / 2**20
        print('Speed: {0:.2f} MB/s'.format(speed))
        Speed: 45.96 MB/s

In this particular use case with an all-numeric table (which is ideal for deserialization performance), the speed difference is only 2x or a bit more.

hs2client Roadmap

The C++ library does not yet implement some important features that are needed to be a drop-in replacement for impyla or other Hive or Impala drivers:

  • SSL transport (with certificate verification)
  • SASL Thrift transport for secure (i.e. Kerberos) clusters, or insecure clusters using SASL
  • Note that many Hive clusters, even unsecured ones, use the SASL transport, so using the library with Hive right now may be a challenge until that is completed.

On the Python side, we must implement a DB API 2.0 compatibility layer, since currently data can only be fetched to pandas, not Python tuples as with most Python database drivers.

We of course welcome contributions from the community to build out some of these features.

Wes McKinney is a Software Engineer at Cloudera. He is the creator of Python’s pandas library and the Ibis project, a committer to Apache Arrow and Apache Parquet, and the author of Python for Data Analysis

Marcel Kornacker is the architect of Impala and its tech lead.

Matt Jacobs and Thomas Tauber-Marshall are Software Engineers working on Cloudera’s Impala team.





About List