Introduction To Apache Spark SQL

Datetime:2016-08-23 04:19:23          Topic: SQL           Share

1. Objective

After understandingWhat is Apache Spark, in this tutorial we will discuss about Spark SQL – which enables users to run SQL queries over Spark. In this Spark SQL tutorial we will understand various componenets and terminologies of Spark-SQL like what is DataSet and DataFrame, what is SqlContext and HiveContext, What are the features of Spark SQL.

2. Introduction

Spark SQL integrates relational processing with Sparks functional programming. It is Spark module for structured data processing. Spark SQL blurs the line between RDD and relational table. It offers much tighter integration between relational and procedural processing, through declarative Data Frame APIs which integrates with Spark code. It also provides higher optimization. DataFrame API and Datasets API are the ways to interact with Spark SQL. With Spark SQL Apache Spark is accessible to more users and improves optimization for current ones.

Spark SQL runs on top of Spark Core. Spark SQL allows developers to import relational data from Hive tables and parquet files, run SQL queries over imported data and existing RDDs and easily write RDDs out to Hive tables or Parquet files. As Spark SQL provides Data Frame APIs which performs relational operation on both external data sources and Sparks built in distributed collections. Spark SQL introduces extensible optimizer called Catalyst as it helps in supporting wide range of data sources and algorithms in Big-data.

3. Interfaces of Spark SQL

3.1. DataFrame

Spark Data frame are lazily evaluated like Transformations in Apache Spark. A data frame is equivalent to relational table in Spark SQL. A Data Frame is used for storing data into tables. It is similar/identical to a table in a relational database but with richer optimization. It is a data abstraction and domain specific language (DSL) applicable on structure and semi structured data. It is distributed collection of data in the form of named column and row. For accessing data frames either SQL Context or Hive Context is needed.

3.2. SQLContext

It’s the entry point for working along structured data (rows and columns) in Spark. Allows the creation of Data Frame objects as well as the execution of SQL queries.

3.3. Hive Context

Working with Hive tables, a descendant of SQLContext. Hive Context is more battle-tested and provides a richer functionality than SQLContext.

3.4. Datasets

A Dataset provides the benefits of RDDs like strongly-typed, immutable collection of objects that are mapped to a relational schema are already available (i.e. you can use field of a row by name naturally row.columnName). Datasets extends benefit of compile-time type safety – produced applications can be analyzed for errors before they are run. It also allow direct operations over user-defined classes.

3.5. JDBC Datasource

JDBC data source can be used to read data from relational databases using JDBC API. This is preferred over using the RDD because the data source returns the results as a DataFrame can be handled in Spark SQL or joined beside other data sources.

3.5. Catalyst Optimizer

It is based on functional programming constructed in Scala. It is the newest and most technically evolved component of SparkSQL. Catalyst is a query plan optimizer. It provides a general framework for transforming trees, which is used to perform analysis/evaluation, optimization, planning, and runtime code spawning. It supports  cost based optimization and rule based optimization. It makes queries run much faster than their RDD counterparts. Catalyst is a modular library which is made as a rule based system. Each rule in framework focuses on the distinct optimization.

4. Features of SparkSQL:

4.1. Integrated

Logically mix sql queries with spark programs. Spark SQL allows query structured data inside Spark programs, using SQL or a Data Frame API. Usable in Java, Scala, Python and R.

4.2. Uniform Data Access

Data Frames and SQL supports a common way to access a variety of data sources, like Hive, Avro, Parquet, ORC, JSON, and JDBC. Data can be joined across these sources.

4.3. Hive Compatibility

Runs unmodified Hive queries on current data. Spark SQL rewrites the Hive frontend and meta store, allowing full compatibility with current Hive data, queries, and UDFs.

4.4. Standard Connectivity

Connect through JDBC or ODBC. A server that supports industry norms JDBC and ODBC connectivity for business intelligence tools.

4.5. Performance & Scalability

Spark SQL incorporates a cost-based optimizer, code generation and columnar storage to make queries agile alongside computing thousands of nodes using the Spark engine, which provides full mid-query fault tolerance.





About List