A Billion Taxi Rides: AWS S3 versus HDFS

Datetime:2016-08-23 01:47:32          Topic: HDFS           Share

In 2014 Saggi Neumann published an excellent article comparing the merits of AWS S3 and HDFS . He pointed out that he'd seen better throughput with HDFS on ephemeral storage than with S3. In this article I'll see how much faster queries executed via Presto on a small EMR cluster are against HDFS-stored data versus S3-stored data.

AWS CLI Up and Running

All the following commands were run on a fresh install of Ubuntu 14.04.3.

To start, I'll install the AWS CLI tool and a few dependencies it needs to run.

$ sudo apt-get update
$ sudo apt-get -y install \
    python-pip \
    python-virtualenv
$ virtualenv amazon
$ source amazon/bin/activate
$ pip install awscli

I'll then enter my AWS credentials.

$ read AWS_ACCESS_KEY_ID
$ read AWS_SECRET_ACCESS_KEY
$ export AWS_ACCESS_KEY_ID
$ export AWS_SECRET_ACCESS_KEY

5-node EMR Cluster Up and Running

I'll be launching a 5-node cluster of m3.xlarge instances using the 4.5.0 AMI. This AMI comes with Presto 1.4.0 and Hive 1.0.0. For a detailed run down on the parameters provided below please seethese notes.

$ aws emr create-cluster \
      --applications \
        Name=Hadoop \
        Name=Hive \
        Name=Presto-Sandbox \
      --ec2-attributes '{
          "KeyName": "emr",
          "InstanceProfile": "EMR_EC2_DefaultRole",
          "AvailabilityZone": "eu-west-1a",
          "EmrManagedSlaveSecurityGroup": "sg-89cd3eff",
          "EmrManagedMasterSecurityGroup": "sg-d4cc3fa2"
      }' \
      --service-role EMR_DefaultRole \
      --release-label emr-4.5.0 \
      --log-uri 's3n://aws-logs-591231097547-eu-west-1/elasticmapreduce/' \
      --name 'A Billion Taxi Trips' \
      --instance-groups '[{
          "InstanceCount": 2,
          "BidPrice": "0.048",
          "InstanceGroupType": "CORE",
          "InstanceType": "m3.xlarge",
          "Name": "Core instance group - 2"
      }, {
          "InstanceCount": 2,
          "BidPrice": "0.048",
          "InstanceGroupType": "TASK",
          "InstanceType": "m3.xlarge",
          "Name": "Task instance group - 3"
      }, {
          "InstanceCount": 1,
          "InstanceGroupType": "MASTER",
          "InstanceType": "m3.xlarge",
          "Name": "Master instance group - 1"
      }]' \
      --region eu-west-1

After 20 minutes the machines had all been provisioned, bootstrapped and I was able to SSH into the master node.

$ ssh -i ~/.ssh/emr.pem \
      hadoop@ec2-54-155-155-115.eu-west-1.compute.amazonaws.com

48 GB of Data on S3 and HDFS

I'll be using the 192 compressed ORC files I generated in my Billion Taxi Rides on Amazon EMR running Presto blog post. I keep these files stored on S3 and have used them for various benchmarks in the past.

$ aws s3 ls s3://<s3_bucket>/orc/
2016-03-14 13:54:41  398631347 1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000000
2016-03-14 13:54:40  393489828 1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000001
...
2016-03-14 14:00:23  265076102 1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000190
2016-03-14 14:00:12  115110402 1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000191

I'll execute distcp to copy these files into HDFS.

$ hadoop distcp s3://<s3_bucket>/orc/ /

The above took 5 minutes and 16 seconds to complete. Not a bad overhead in my opinion.

The following are the counters reported.

File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=2803656
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=46506
        HDFS: Number of bytes written=52457486697
        HDFS: Number of read operations=1521
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=430
        S3: Number of bytes read=52457486697
        S3: Number of bytes written=0
        S3: Number of read operations=0
        S3: Number of large read operations=0
        S3: Number of write operations=0
Job Counters
        Launched map tasks=22
        Other local map tasks=22
        Total time spent by all maps in occupied slots (ms)=159285792
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=4977681
        Total vcore-milliseconds taken by all map tasks=4977681
        Total megabyte-milliseconds taken by all map tasks=5097145344
Map-Reduce Framework
        Map input records=193
        Map output records=0
        Input split bytes=3014
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=17274
        CPU time spent (ms)=1436840
        Physical memory (bytes) snapshot=8793264128
        Virtual memory (bytes) snapshot=45056073728
        Total committed heap usage (bytes)=6723469312
File Input Format Counters
        Bytes Read=43492
File Output Format Counters
        Bytes Written=0
org.apache.hadoop.tools.mapred.CopyMapper$Counter
        BYTESCOPIED=52457486697
        BYTESEXPECTED=52457486697
        COPY=193

After the files were copied to HDFS I made sure they were properly in place. I abbreviated the output to the file paths and their respective sizes.

$ hdfs dfs -ls /orc/
Found 192 items
...  398631347 ... /orc/1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000000
...  393489828 ... /orc/1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000001
...
...  265076102 ... /orc/1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000190
...  115110402 ... /orc/1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000191

Creating Tables in the Hive Metastore

I'm going to create two tables using Hive to represent both data locations. trips_orc_s3 will point to the data on S3 and trips_orc_hdfs will point to the data on HDFS.

$ hive
CREATE EXTERNAL TABLE trips_orc_s3 (
    trip_id                 INT,
    vendor_id               STRING,
    pickup_datetime         TIMESTAMP,
    dropoff_datetime        TIMESTAMP,
    store_and_fwd_flag      STRING,
    rate_code_id            SMALLINT,
    pickup_longitude        DOUBLE,
    pickup_latitude         DOUBLE,
    dropoff_longitude       DOUBLE,
    dropoff_latitude        DOUBLE,
    passenger_count         SMALLINT,
    trip_distance           DOUBLE,
    fare_amount             DOUBLE,
    extra                   DOUBLE,
    mta_tax                 DOUBLE,
    tip_amount              DOUBLE,
    tolls_amount            DOUBLE,
    ehail_fee               DOUBLE,
    improvement_surcharge   DOUBLE,
    total_amount            DOUBLE,
    payment_type            STRING,
    trip_type               SMALLINT,
    pickup                  STRING,
    dropoff                 STRING,

    cab_type                STRING,

    precipitation           SMALLINT,
    snow_depth              SMALLINT,
    snowfall                SMALLINT,
    max_temperature         SMALLINT,
    min_temperature         SMALLINT,
    average_wind_speed      SMALLINT,

    pickup_nyct2010_gid     SMALLINT,
    pickup_ctlabel          STRING,
    pickup_borocode         SMALLINT,
    pickup_boroname         STRING,
    pickup_ct2010           STRING,
    pickup_boroct2010       STRING,
    pickup_cdeligibil       STRING,
    pickup_ntacode          STRING,
    pickup_ntaname          STRING,
    pickup_puma             STRING,

    dropoff_nyct2010_gid    SMALLINT,
    dropoff_ctlabel         STRING,
    dropoff_borocode        SMALLINT,
    dropoff_boroname        STRING,
    dropoff_ct2010          STRING,
    dropoff_boroct2010      STRING,
    dropoff_cdeligibil      STRING,
    dropoff_ntacode         STRING,
    dropoff_ntaname         STRING,
    dropoff_puma            STRING
) STORED AS orc
  LOCATION 's3://<s3_bucket>/orc/';
CREATE EXTERNAL TABLE trips_orc_hdfs (
    trip_id                 INT,
    vendor_id               STRING,
    pickup_datetime         TIMESTAMP,
    dropoff_datetime        TIMESTAMP,
    store_and_fwd_flag      STRING,
    rate_code_id            SMALLINT,
    pickup_longitude        DOUBLE,
    pickup_latitude         DOUBLE,
    dropoff_longitude       DOUBLE,
    dropoff_latitude        DOUBLE,
    passenger_count         SMALLINT,
    trip_distance           DOUBLE,
    fare_amount             DOUBLE,
    extra                   DOUBLE,
    mta_tax                 DOUBLE,
    tip_amount              DOUBLE,
    tolls_amount            DOUBLE,
    ehail_fee               DOUBLE,
    improvement_surcharge   DOUBLE,
    total_amount            DOUBLE,
    payment_type            STRING,
    trip_type               SMALLINT,
    pickup                  STRING,
    dropoff                 STRING,

    cab_type                STRING,

    precipitation           SMALLINT,
    snow_depth              SMALLINT,
    snowfall                SMALLINT,
    max_temperature         SMALLINT,
    min_temperature         SMALLINT,
    average_wind_speed      SMALLINT,

    pickup_nyct2010_gid     SMALLINT,
    pickup_ctlabel          STRING,
    pickup_borocode         SMALLINT,
    pickup_boroname         STRING,
    pickup_ct2010           STRING,
    pickup_boroct2010       STRING,
    pickup_cdeligibil       STRING,
    pickup_ntacode          STRING,
    pickup_ntaname          STRING,
    pickup_puma             STRING,

    dropoff_nyct2010_gid    SMALLINT,
    dropoff_ctlabel         STRING,
    dropoff_borocode        SMALLINT,
    dropoff_boroname        STRING,
    dropoff_ct2010          STRING,
    dropoff_boroct2010      STRING,
    dropoff_cdeligibil      STRING,
    dropoff_ntacode         STRING,
    dropoff_ntaname         STRING,
    dropoff_puma            STRING
) STORED AS orc
  LOCATION '/orc/';

Benchmarking S3

I'll be using Presto 1.4.0 to benchmark both the S3 and the HDFS-based data. The following shows the queries times I saw for the S3-based data.

$ presto-cli \
    --catalog hive \
    --schema default

The following completed in 1 minute and 1 second.

SELECT cab_type,
       count(*)
FROM trips_orc_s3
GROUP BY cab_type;
Query 20160414_110943_00006_hzqcy, FINISHED, 4 nodes
Splits: 749 total, 749 done (100.00%)
1:01 [1.11B rows, 48.8GB] [18.3M rows/s, 821MB/s]

The following completed in 58 seconds.

SELECT passenger_count,
       avg(total_amount)
FROM trips_orc_s3
GROUP BY passenger_count;
Query 20160414_111100_00007_hzqcy, FINISHED, 4 nodes
Splits: 749 total, 749 done (100.00%)
0:58 [1.11B rows, 48.8GB] [19.1M rows/s, 859MB/s]

The following completed in 1 minute and 43 seconds.

SELECT passenger_count,
       year(pickup_datetime),
       count(*)
FROM trips_orc_s3
GROUP BY passenger_count,
         year(pickup_datetime);
Query 20160414_111213_00008_hzqcy, FINISHED, 4 nodes
Splits: 749 total, 749 done (100.00%)
1:43 [1.11B rows, 48.8GB] [10.8M rows/s, 484MB/s]

The following completed in 1 minute and 41 seconds.

SELECT passenger_count,
       year(pickup_datetime) trip_year,
       round(trip_distance),
       count(*) trips
FROM trips_orc_s3
GROUP BY passenger_count,
         year(pickup_datetime),
         round(trip_distance)
ORDER BY trip_year,
         trips desc;
Query 20160414_111407_00009_hzqcy, FINISHED, 4 nodes
Splits: 749 total, 749 done (100.00%)
1:41 [1.11B rows, 48.8GB] [11.1M rows/s, 497MB/s]

Benchmarking HDFS

The following shows the query times I saw for the HDFS-based data.

The following completed in 35 seconds (1.75x faster than S3).

SELECT cab_type,
       count(*)
FROM trips_orc_hdfs
GROUP BY cab_type;
Query 20160414_110514_00002_hzqcy, FINISHED, 4 nodes
Splits: 793 total, 793 done (100.00%)
0:35 [1.11B rows, 48.8GB] [31.9M rows/s, 1.4GB/s]

The following completed in 39 seconds (1.5x faster than S3).

SELECT passenger_count,
       avg(total_amount)
FROM trips_orc_hdfs
GROUP BY passenger_count;
Query 20160414_110607_00003_hzqcy, FINISHED, 4 nodes
Splits: 793 total, 793 done (100.00%)
0:39 [1.11B rows, 48.8GB] [28.2M rows/s, 1.24GB/s]

The following completed in 1 minute and 4 seconds (1.6x faster than S3).

SELECT passenger_count,
       year(pickup_datetime),
       count(*)
FROM trips_orc_hdfs
GROUP BY passenger_count,
         year(pickup_datetime);
Query 20160414_110658_00004_hzqcy, FINISHED, 4 nodes
Splits: 793 total, 793 done (100.00%)
1:04 [1.11B rows, 48.8GB] [17.5M rows/s, 786MB/s]

The following completed in 1 minute and 21 seconds (1.25x faster than S3).

SELECT passenger_count,
       year(pickup_datetime) trip_year,
       round(trip_distance),
       count(*) trips
FROM trips_orc_hdfs
GROUP BY passenger_count,
         year(pickup_datetime),
         round(trip_distance)
ORDER BY trip_year,
         trips desc;
Query 20160414_110810_00005_hzqcy, FINISHED, 4 nodes
Splits: 793 total, 793 done (100.00%)
1:21 [1.11B rows, 48.8GB] [13.7M rows/s, 617MB/s]

Though the speed improvements using HDFS are considerable, S3 did perform pretty well. At worst there's a 1.75x overhead in exchange for virtually unlimited scalability, 11 9's of durability and no worrying about over/under-provisioning storage space.





About List