Query escaped JSON string in Hive

Datetime:2016-08-22 21:50:00          Topic: Hive           Share

There are times when we want to parse a string that is actually a JSON. Usually that could be done with built in functions of Hive such as get_json_object(). Though get_json_object cannot parse JSON Array from my experience. These array needs to be exploded first using explode() and then use get_json_object on each element (JSON) of the exploded Array.

The one option that was left to me was to convert that JSON object Array to Hive Array.

  • We will be using Hive JSON Serde from – https://github.com/rcongiu/Hive-JSON-Serde
  • You can get help setting it up from here .

Below is a sample JSON for our exercise.

JSON Sample, having Message_json attribute containing a valid JSON Array, and Message_string attribute containing similar array in string notation. We are talking about parsing such values that are “JSON array”, “escaped” and present in “string format”.

{
"Message_json": [
{
"Key": "GetService1", 
"ResponseTime": "12"
}, 
{
"Key": "ServiceOperation", 
"ResponseTime": "2"
}, 
{
"Key": "Service", 
"ResponseTime": "16"
}
], 
"Message_string": "[{\"Key\":\"GetService2\",\"ResponseTime\":\"12\"},{\"Key\":\"ServiceOperation\",\"ResponseTime\":\"2\"},{\"Key\":\"Service\",\"ResponseTime\":\"16\"}]"
}

To prep your hive environment, use the following commands –

Create sample JSON file on local file system that will be loaded in a Hive managed table.

[]$ echo '{"Message_json":[{"ResponseTime":"12","Key":"GetService1"},{"ResponseTime":"2","Key":"ServiceOperation"},{"ResponseTime":"16","Key":"Service"}],"Message_string":"[{\"Key\":\"GetService2\",\"ResponseTime\":\"12\"},{\"Key\":\"ServiceOperation\",\"ResponseTime\":\"2\"},{\"Key\":\"Service\",\"ResponseTime\":\"16\"}]"}' > /tmp/s.json

In hive/beeline shell create the sample table

create table sample (
Message_json array<struct<key:string, ResponseTime:string>>,
Message_string string
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

Load JSON data in sample table

LOAD DATA local INPATH '/tmp/s.json' OVERWRITE INTO TABLE sample;

Lets check with a basic select if everything is in order

0: jdbc:hive2://localhost:10000> select * from sample;
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+--+
| sample.message_json | sample.message_string |
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+--+
| [{"key":"GetService1","responsetime":"12"},{"key":"ServiceOperation","responsetime":"2"},{"key":"Service","responsetime":"16"}] | [{"Key":"GetService2","ResponseTime":"12"},{"Key":"ServiceOperation","ResponseTime":"2"},{"Key":"Service","ResponseTime":"16"}] |
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+--+
1 row selected (0.147 seconds)

Following is how we would process a JSON array in normal circumstances. Message_json in our case –

0: jdbc:hive2://localhost:10000>
0: jdbc:hive2://localhost:10000> select val_json.Key, val_json.ResponseTime from sample
0: jdbc:hive2://localhost:10000> lateral view explode(Message_json) tbl_json as val_json;
+-------------------+---------------+--+
|        key        | responsetime  |
+-------------------+---------------+--+
| GetService1       | 12            |
| ServiceOperation  | 2             |
| Service           | 16            |
+-------------------+---------------+--+
3 rows selected (0.151 seconds)

Lets try that with the other attribute, Message_string –

0: jdbc:hive2://localhost:10000> select val_json.Key, val_json.ResponseTime from sample
0: jdbc:hive2://localhost:10000> lateral view explode(Message_string) tbl_json as val_json;
Error: Error while compiling statement: FAILED: UDFArgumentException explode() takes an array or a map as a parameter (state=42000,code=40000)

We get the error as we expected.

So lets take a look at the solution, that involves converting JSON Array represented in string to real Hive array.

0: jdbc:hive2://localhost:10000> select get_json_object(val_json,'$.Key') as key, get_json_object(val_json,'$.ResponseTime') as ResponseTime from sample
0: jdbc:hive2://localhost:10000> lateral view explode(split(regexp_replace(regexp_replace(Message_string,'\\}\\,\\{','\\}\\#\\{'),'\\[|\\]',''),'\\#')) tbl_json as val_json;
+-------------------+---------------+--+
|        key        | responsetime  |
+-------------------+---------------+--+
| GetService2       | 12            |
| ServiceOperation  | 2             |
| Service           | 16            |
+-------------------+---------------+--+
3 rows selected (0.156 seconds)

All commands are below for easy executing –

-- sample JSON: {"Message_json":[{"ResponseTime":"12","Key":"GetService1"},{"ResponseTime":"2","Key":"ServiceOperation"},{"ResponseTime":"16","Key":"Service"}],"Message_string":"[{\"Key\":\"GetService2\",\"ResponseTime\":\"12\"},{\"Key\":\"ServiceOperation\",\"ResponseTime\":\"2\"},{\"Key\":\"Service\",\"ResponseTime\":\"16\"}]"}
-- prep sample: echo '{"Message_json":[{"ResponseTime":"12","Key":"GetService1"},{"ResponseTime":"2","Key":"ServiceOperation"},{"ResponseTime":"16","Key":"Service"}],"Message_string":"[{\"Key\":\"GetService2\",\"ResponseTime\":\"12\"},{\"Key\":\"ServiceOperation\",\"ResponseTime\":\"2\"},{\"Key\":\"Service\",\"ResponseTime\":\"16\"}]"}' > /tmp/s.json
drop table sample;
create table sample (
Message_json array<struct<key:string, ResponseTime:string>>,
Message_string string
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
LOAD DATA local INPATH '/tmp/s.json' OVERWRITE INTO TABLE  sample;
--OKAY: Message_json
select val_json.Key, val_json.ResponseTime
from sample
lateral view explode(Message_json) tbl_json as val_json;
--NOT OKAY: Message_string - with previous approach
select val_json.Key, val_json.ResponseTime
from sample
lateral view explode(Message_string) tbl_json as val_json;
--SOLUTION: Message_string - conversion of string to array
select get_json_object(val_json,'$.Key') as key, get_json_object(val_json,'$.ResponseTime') as ResponseTime
from sample
lateral view explode(split(regexp_replace(regexp_replace(Message_string,'\\}\\,\\{','\\}\\#\\{'),'\\[|\\]',''),'\\#')) tbl_json as val_json;




About List