Let's assume we need to optimize this query. The data is stored in S3 (around a few TB)
SELECT * FROM "bucketing_blog"."noaa_remote_original"
WHERE
report_type = 'CRN05'
AND ( station = '99999904237'
OR station = '99999953132'
OR station = '99999903061'
OR station = '99999963856'
OR station = '99999994644'
);
There are around 14325 unique stations and report_type column contain around 13 types.
1) The first thing that can be tried is to create a table with partitioning (using report_type)
CREATE TABLE "bucketing_blog"."athena_non_bucketed"
WITH (
external_location = 's3://<your-s3-location>/athena-non-bucketed/',
partitioned_by = ARRAY['report_type'],
format = 'PARQUET',
write_compression = 'SNAPPY'
)
AS
SELECT * FROM "bucketing_blog"."noaa_remote_original";
If you need faster and cheaper query, use buckets
CREATE TABLE "bucketing_blog"."athena_bucketed"
WITH (
external_location = 's3://<your-s3-location>/athena-bucketed/',
partitioned_by = ARRAY['report_type'],
bucketed_by = ARRAY['station'],
bucket_count = 16,
format = 'PARQUET',
write_compression = 'SNAPPY'
)
AS
SELECT * FROM "bucketing_blog"."noaa_remote_original"
I created 16 buckets because that is the maximum number of stations that may appear in the SQL query.
https://aws.amazon.com/blogs/big-data/optimize-data-layout-by-bucketing-with-amazon-athena-and-aws-glue-to-accelerate-downstream-queries/
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.