How to Get More ROI from Big Data and Hadoop with Apache Drill
Many businesses have run a Hadoop Big Data system dedicated to a specific use case. Perhaps they are collecting call center records, analyzing sensor reports from the factory floor, or monitoring tweets to track customer experience in real-time.
Confining Big Data-driven projects to select initiatives initially made sense, as many of the initial Big Data analysis solutions were optimized for a limited set of use cases. But solution options have matured and expanded, as have the data sources businesses draw from. To get the best out of your Big Data investment now - and take full advantage of the famous "three Vs of Big Data": volume, variety, velocity - you'll want to begin planning your shift from the single use case stage to the multiple use scenario.
Expand Your Use Case Options with Apache DrillUtilizing data-driven intelligence across the enterprise requires solutions that enable interactive, self-service ways of working with historical and near real-time data. Core Hadoop platform has already solved many of the fundamental (legacy) Big Data access and availability problems. With the addition of standalone query engine Apache Drill, data analysts finally have the freedom to follow their data queries easily across multiple data sources, on demand.
Apache Drill was designed to support a wide range of SQL use cases on Big Data . Drill is particularly well-suited for use in situations that require low latency performance, including interactive query environments (OLAP, self‐service BI, data visualization) and investigative analytics (data science/exploration), and Day Zero analytics on near real-time data. It enables efficient analytics operations across a range of data sources and formats including JSON, Parquet and HBase tables.
Drill's efficiency across multiple use cases comes in great part from its architecture. Drill is built on hierarchically‐organized modules called drillbits, which are responsible for executing SQL statements. A drillbit is installed on each node that holds data, and is capable of executing SQL queries on the data that it manages. When data is stored across many nodes, all applicable drillbits process the query, parallelizing its execution. Applications accessing Drill are "connected" to different drillbits, avoiding availability bottlenecks and ensuring data locality.
Self-Service Data Exploration On-DemandDrill is the only SQL engine for Hadoop that doesn't demand schemas to be created and maintained, or data to be transformed, before it can be queried. Data analysts can query data in its native formats, including nested data, self-describing data, and data with dynamic schemas. There is no need to explicitly define and maintain schemas; Drill can automatically leverage the structure embedded in the data. Self-service data exploration is finally a reality. Data can be worked with immediately upon its arrival, with no need to prepare a schema. Analysts can change and expand their data sources on the fly without waiting for IT services to structure newly requested data.
Analysts can also leverage their existing SQL skills and BI tools to directly query self-describing data and process complex data types. Of course, Hadoop hasn't lacked for SQL or SQL-comparable solutions - but many were designed with from a historical perspective - reengineering old school tools for Big Data usage. These projects filled a real need, but solutions must now be built to support the myriad of data-producing sources we now utilize, as well as the ways that we transform Big Data into actionable intelligence.
Drill has been tested by the open source community - and it was designed to be extensible. New data sources, new file formats, new operators, and new query languages can be easily added via new user‐defined functions or custom-created storage plugins for traditional data sources.
Drill: The Future of Big Data ExplorationApache Drill was initially inspired by Google's Dremel project, and the open source community has worked hard to develop Drill is the ideal interactive SQL engine for Hadoop. The success of these efforts was recently acknowledged officially by the Apache Software Foundation, which announced in December 2014 that it has promoted Drill to a top-level project at Apache.
As a top-level project, Drill joins other illustrious projects such as Apache Hadoop and httpd (the world's most popular Web server). Drill now has its own board of directors, and users can be confident that the project has proven itself, has a viable roadmap for development, and can be confidently deployed for mission-critical use in the long term.
If you're ready to test-drive Drill, you can do so using the MapR Sandbox for Hadoop , which runs on PC, Mac, and Linux platforms. MapR Technologies is the provider of the top-ranked distribution for Apache Hadoop.
You can also view a tutorial on analyzing real-world data using Apache Drill .