Shantanu's Blog: Partitioning and bucketing S3 data

Let's assume we need to optimize this query. The data is stored in S3 (around a few TB)

SELECT * FROM "bucketing_blog"."noaa_remote_original"
WHERE
report_type = 'CRN05'
AND ( station = '99999904237'
OR station = '99999953132'
OR station = '99999903061'
OR station = '99999963856'
OR station = '99999994644'
);

There are around 14325 unique stations and report_type column contain around 13 types.

1) The first thing that can be tried is to create a table with partitioning (using report_type)

CREATE TABLE "bucketing_blog"."athena_non_bucketed"
WITH (
external_location = 's3://<your-s3-location>/athena-non-bucketed/',
partitioned_by = ARRAY['report_type'],
format = 'PARQUET',
write_compression = 'SNAPPY'
)
AS
SELECT * FROM "bucketing_blog"."noaa_remote_original";

If you need faster and cheaper query, use buckets

CREATE TABLE "bucketing_blog"."athena_bucketed"
WITH (
external_location = 's3://<your-s3-location>/athena-bucketed/',
partitioned_by = ARRAY['report_type'],
bucketed_by = ARRAY['station'],
bucket_count = 16,
format = 'PARQUET',
write_compression = 'SNAPPY'
)
AS
SELECT * FROM "bucketing_blog"."noaa_remote_original"

I created 16 buckets because that is the maximum number of stations that may appear in the SQL query.

https://aws.amazon.com/blogs/big-data/optimize-data-layout-by-bucketing-with-amazon-athena-and-aws-glue-to-accelerate-downstream-queries/

Shantanu's Blog

April 29, 2024

Partitioning and bucketing S3 data

No comments:

Post a Comment