April 29, 2024

Partitioning and bucketing S3 data

Let's assume we need to optimize this query. The data is stored in S3 (around a few TB)

SELECT * FROM "bucketing_blog"."noaa_remote_original"
WHERE
    report_type = 'CRN05'
    AND ( station = '99999904237'
        OR station = '99999953132'
        OR station = '99999903061'
        OR station = '99999963856'
        OR station = '99999994644'
    );

There are around 14325 unique stations and report_type column contain around 13 types.

1) The first thing that can be tried is to create a table with partitioning (using report_type)

CREATE TABLE "bucketing_blog"."athena_non_bucketed"
WITH (
    external_location = 's3://<your-s3-location>/athena-non-bucketed/',
    partitioned_by = ARRAY['report_type'],
    format = 'PARQUET',
    write_compression = 'SNAPPY'
)
AS
SELECT * FROM "bucketing_blog"."noaa_remote_original";

If you need faster and cheaper query, use buckets

CREATE TABLE "bucketing_blog"."athena_bucketed"
WITH (
    external_location = 's3://<your-s3-location>/athena-bucketed/',
    partitioned_by = ARRAY['report_type'],
    bucketed_by = ARRAY['station'],
    bucket_count = 16,
    format = 'PARQUET',
    write_compression = 'SNAPPY'
)
AS
SELECT * FROM "bucketing_blog"."noaa_remote_original"

I created 16 buckets because that is the maximum number of stations that may appear in the SQL query.

https://aws.amazon.com/blogs/big-data/optimize-data-layout-by-bucketing-with-amazon-athena-and-aws-glue-to-accelerate-downstream-queries/

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.