Mylast article identified how certain types of record formatting problems can lead to corrupt inaccurate data when files are imported into Hadoop. In this post I’ll explore why Hadoop alone struggles to address these types of dirty data problems upon ingest. We’ll also look at how Podium Data overcomes these issues, under the covers.
Let’s start with a very simple example. In this case, assume we have a table with three columns of human resources data – first name, role, and tenure – and three rows representing three people – John, Jane and Jim. The delimiter between fields in the record is a comma, while the end of record marker is a new line character.
We start by moving this data into HDFS and registering this file with Hive. When we do this and Hive creates a table view of this data, Hive will correctly process the first record, showing that John is an SVP with 12 years of tenure.
However when Jane ’s record is processed Hive will not understand the double quoting schema around the role field in her record and will treat the two parts of her role as different fields. Her title will be recorded as VP and her tenure will be listed as AD . Her actual tenure date of 4 will be discarded.
Jim’s record presents a potentially more severe problem in that it contains an embedded end of record delimiter, a new line character, in the Role field. When Hive reads in Jim’s record, it will interpret this as two records. The first record captures Jim’s name and role but misses his tenure. The second record lists “ in the name field and 10 as the role and contains no data in the tenure field.
Clearly this is an erroneous view of the data. Moreover, when users run a Hive query against this table and try to average up the tenure, Hive will convert any of this bad data to nulls, and give the user an incorrect result, which will certainly not equal the average of 12 , 4 and 10 . In addition, the embedded record delimiters causes Hive to incorrectly calculate the number of records in the file.
If you are manually loading data into Hadoop, to fix these problems you would need to first write custom scripts and generally do sequential processing to clean up any record formatting issues before you could load the data into Hadoop.
Why can’t these problems be fixed in Hadoop? The answer lies with the way Hadoop stores its data for parallel processing. Every large file is broken into fixed-size blocks and stored on different data nodes in the cluster. When a job executes, each data node processes the data stored on it. Hadoop’s RecordReader knows that the last record in a block is actually split between that block and the next one, and it has special logic to recombine the record from the two blocks. Using our example, the last record being split across two blocks looks like this:
When this kind of split occurs in a record with extra embedded field or record delimiters, HDFS’s RecordReader logic will look for the first record delimiter to start processing its block. In our example, HDFS thinks the new line character embedded in double quotes in Jim ’s record is the end of that record, which is not correct.
To be able to handle this reliably and correctly, Podium Data developed a record reading process that is much more sophisticated. Our solution needed to detect and handle embedded enclosures and other types record formatting problems so that all files—regardless of quality—could be processed in parallel and accurately imported and rendered in Hadoop.
This enables Podium Data’s enterprise data lake management platform to load, manage, properly process and convert data with these kinds of structures into exactly the right answers in a single pass of the file using massively parallel processes.
Podium’s ability to handle these problems automatically is particularly important when importing data from enterprise systems where there is a business requirement that absolutely all records are accurately ingested into Hadoop and included in subsequent reporting and analysis. Those same enterprise systems, many of which include decades of accumulated transactional, customer or financial data are also some of the most likely systems to contain dirty data formatting problems as discussed here.
In our next post, I’ll explore another aspect of data quality issues in an enterprise data lake: the good, the bad and the ugly of data records on ingest.