Shantanu's Blog

Database Consultant

July 01, 2025

disable dynamoDB table

I have DynamoDB table called "autocorrect" and I need to stop all traffic to/ from this table. I can use resouce-based policy like this...

{
"Version": "2012-10-17",
"Id": "PolicyId",
"Statement": [
{
"Sid": "AccessDisabledTemporary",
"Effect": "Deny",
"Principal": "*",
"Action": "dynamodb:*",
"Resource": "arn:aws:dynamodb:us-east-1:XXXX85053566:table/autocorrect"
}
]
}

Labels: aws

posted by shantanu # Tuesday, July 01, 2025 0 comments

June 02, 2025

Manage AWS resources using command line

1) add access key and secret key of a read-only user

aws configure

2) I need to install amazon Q using the instructions found on this page...

https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/command-line-installing-ssh-setup-autocomplete.html

3) I can now start the q service using the command...

./q/bin/q

Use natural language instructions like "list all dynamoDB tables" or "list all S3 buckets"

4) For advance users, I can create MCP server and save database credentials like username and password so that Q can query database and return results.

https://awslabs.github.io/mcp/servers/dynamodb-mcp-server/

Labels: aws, usability

posted by shantanu # Monday, June 02, 2025 0 comments

December 03, 2024

Avoid uploading a file to S3 again

Let's assume I uploaded a file to S3:

aws s3 cp dictionaries.xcu s3://cf-templates-us-east-1/

I need to update that file only if it does not already exist. In that case I use --if-none-match parameter as shown below:

aws s3api put-object --bucket cf-templates-us-east-1 --key dictionaries.xcu --body dictionaries.xcu --if-none-match "*"

Returns "An error occurred (PreconditionFailed)"

This will help me while trying to upload a large file again.

_____

Following features are available for S3 Express One Zone:

1) In directory buckets, clients can perform conditional delete checks on an object’s last modified time, size, and Etag using the x-amz-if-match-last-modified-time, x-amz-if-match-size, and HTTP if-match headers.

2) Append data to a file:

aws s3api put-object --bucket cf-templates-us-east-1 --key dictionaries.xcu --body dictionaries.xcu --write-offset-bytes file001.bin

Or use python:

s3.put_object(Bucket='amzn-s3-demo-bucket--use2-az2--x-s3', Key='2024-11-05-sdk-test', Body=b'123456789', WriteOffsetBytes=9)

It can not replace your database or Messaging Queues because only a few thousand updates are possible for each object.

3) You can configure S3 Lifecycle rules for S3 Express One Zone to expire objects on your behalf. For example, you can create an S3 Lifecycle rule that expires all objects smaller than 512 KB after 3 days and another rule that expires all objects in a prefix after 10 days.

Labels: aws

posted by shantanu # Tuesday, December 03, 2024 0 comments

September 28, 2024

Firefox and Libreoffice in your browser

Kasm VNC is a modern open source VNC server.

Quickly connect to your Linux server's desktop from any web browser.
No client software install required.

1) Firefox using VNC

docker run -d \
--name=firefox \
-e PUID=1000 \
-e PGID=1000 \
-e TZ=Etc/UTC \
-p 3000:3000 \
-p 3001:3001 \
-v /path/to/config2:/config \
--shm-size="1gb" \
--restart unless-stopped \
lscr.io/linuxserver/firefox:latest

2) Libreoffice using VNC

docker run -d \
--name=libreoffice \
--security-opt seccomp=unconfined `#optional` \
-e PUID=1000 \
-e PGID=1000 \
-e TZ=Etc/UTC \
-p 3000:3000 \
-p 3001:3001 \
-v /path/to/config:/config \
--restart unless-stopped \
lscr.io/linuxserver/libreoffice:latest

Labels: aws, docker, firefox, libreoffice, linux tips, usability

posted by shantanu # Saturday, September 28, 2024 0 comments

June 27, 2024

Disable dynamoDB table access

I can disable all access to a dynamoDB table using resource based policy. Here is an example:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Principal": {
"AWS": "*"
},
"Action": "dynamodb:*",
"Resource": "arn:aws:dynamodb:us-east-1:XXX885053566:table/sandhiDupe"
}
]
}

There are many other advantages of managing access at resource level.

Labels: aws, aws_lambda

posted by shantanu # Thursday, June 27, 2024 0 comments

April 29, 2024

Partitioning and bucketing S3 data

Let's assume we need to optimize this query. The data is stored in S3 (around a few TB)

SELECT * FROM "bucketing_blog"."noaa_remote_original"
WHERE
report_type = 'CRN05'
AND ( station = '99999904237'
OR station = '99999953132'
OR station = '99999903061'
OR station = '99999963856'
OR station = '99999994644'
);

There are around 14325 unique stations and report_type column contain around 13 types.

1) The first thing that can be tried is to create a table with partitioning (using report_type)

CREATE TABLE "bucketing_blog"."athena_non_bucketed"
WITH (
external_location = 's3://<your-s3-location>/athena-non-bucketed/',
partitioned_by = ARRAY['report_type'],
format = 'PARQUET',
write_compression = 'SNAPPY'
)
AS
SELECT * FROM "bucketing_blog"."noaa_remote_original";

If you need faster and cheaper query, use buckets

CREATE TABLE "bucketing_blog"."athena_bucketed"
WITH (
external_location = 's3://<your-s3-location>/athena-bucketed/',
partitioned_by = ARRAY['report_type'],
bucketed_by = ARRAY['station'],
bucket_count = 16,
format = 'PARQUET',
write_compression = 'SNAPPY'
)
AS
SELECT * FROM "bucketing_blog"."noaa_remote_original"

I created 16 buckets because that is the maximum number of stations that may appear in the SQL query.

https://aws.amazon.com/blogs/big-data/optimize-data-layout-by-bucketing-with-amazon-athena-and-aws-glue-to-accelerate-downstream-queries/

Labels: athena, aws

posted by shantanu # Monday, April 29, 2024 0 comments

September 27, 2023

Enable S3 versioning

It is highly recommended to enable versioning for all your S3 buckets. In case you delete a file by mistake, you can find the version ID of the older version and donload that copy.

1) Find the version ID:
aws s3api list-object-versions --bucket cdk-hnb --prefix tt/archive.tar

2) Donload that version:
aws s3api get-object --bucket cdk-hnb --key tt/archive.tar archive.tar --version-id mqfya

3) List all versions:
aws s3api list-object-versions --bucket cdk-hnb

If versioning is enabled, you can not remove the bucket easily. You first need to "empty" the bucket and then delete it.

Labels: aws

posted by shantanu # Wednesday, September 27, 2023 0 comments

November 01, 2022

Check for open ports

This code will check if there is any port open and send an alert to the subscribers of SNS topic.

import boto3, json
ec2 = boto3.client('ec2' , region_name='us-east-1')
for security_group in ec2.describe_security_groups()['SecurityGroups']:
for i in range(len(security_group['IpPermissions'][0]['IpRanges'])):
for k,v in security_group['IpPermissions'][0]['IpRanges'][i].items():
print (k, v)
if '0.0.0.0' in v:
message = {"alert": "open port found "}
sns_client = boto3.client("sns", region_name="us-east-1")
response = sns_client.publish(TargetArn='arn:aws:sns:us-east-1:102378362623:NotifyMe',
Message=json.dumps({'default': json.dumps(message)}), MessageStructure='json')

You may need to change the region name and SNS topic ARN address in the code mentioned above.
This code can be written as Lambda function and run every day.

Labels: aws, aws_lambda, boto

posted by shantanu # Tuesday, November 01, 2022 0 comments

July 06, 2021

Quick test tables in Athena

If you quickly ant to test data using athena query, use with syntax as shown below:

WITH countries(country_code) AS (VALUES 'pol', 'CAN', 'USA')
SELECT upper(country_code) AS country_code
FROM countries

In this example aggregate expression uses "OVER" function.

WITH students_results(student_id, result) AS (VALUES
('student_1', 17),
('student_2', 16),
('student_3', 18),
('student_4', 18),
('student_5', 10),
('student_6', 20),
('student_7', 16))
SELECT
student_id,
result,
count(*) OVER (
ORDER BY result
) AS close_better_scores_count
FROM students_results

Labels: athena, aws

posted by shantanu # Tuesday, July 06, 2021 0 comments

May 28, 2021

Athena and Unicode text

Athena supports unicode characters very well. For e.g. if the datafile looks like this...

"Root_word";"Word";"Primary";"Type";"Code";"Position";"Rule"
"अँटिबायोटिक","अँटिबायोटिक","अँटिबायोटिक","Primary","","",""
"अँटिबायोटिक","अँटिबायोटिकअंती","अँटिबायोटिक","Suffix","A","7293","001: 0 अंती ."
"अँटिबायोटिक","अँटिबायोटिकअर्थी","अँटिबायोटिक","Suffix","A","7293","002: 0 अर्थी ."
"अँटिबायोटिक","अँटिबायोटिकआतून","अँटिबायोटिक","Suffix","A","7293","003: 0 आतून ."
"अँटिबायोटिक","अँटिबायोटिकआतूनचा","अँटिबायोटिक","Suffix","A","7293","004: 0 आतूनचा ."
"अँटिबायोटिक","अँटिबायोटिकआतूनची","अँटिबायोटिक","Suffix","A","7293","005: 0 आतूनची ."
"अँटिबायोटिक","अँटिबायोटिकआतूनचे","अँटिबायोटिक","Suffix","A","7293","006: 0 आतूनचे ."
"अँटिबायोटिक","अँटिबायोटिकआतूनच्या","अँटिबायोटिक","Suffix","A","7293","007: 0 आतूनच्या ."
"अँटिबायोटिक","अँटिबायोटिकआतूनला","अँटिबायोटिक","Suffix","A","7293","008: 0 आतूनला ."

This create table statement is all I need...

create external table myptg (
root_word varchar(255),
derived_word varchar(255),
stemmed_word varchar(255),
type varchar(255),
code varchar(255),
position varchar(255),
rule varchar(255)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = '\;',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://ptg1/mc/'
TBLPROPERTIES ("skip.header.line.count"="1");

I can create a supporting table like this...

create external table gamabhana (derived_word varchar(255))
LOCATION 's3://ptg1/mc2/'
TBLPROPERTIES ("skip.header.line.count"="1");

A new table can be created using the syntax something like this...

create external table anoop (
serial_number int,
root_word varchar(255),
stem1_word varchar(255),
stem2_word varchar(255),
stem3_word varchar(255),
stem4_word varchar(255),
stem5_word varchar(255),
stem6_word varchar(255),
stem7_word varchar(255)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim' = ',',
'escape.delim' = '\\',
'line.delim' = '\n'
)
LOCATION 's3://ptg1/mc3/'
TBLPROPERTIES ("skip.header.line.count"="1");

And then run a join statement like this...

create table gamabhana_match as
select a.derived_word, b.root_word, b.stemmed_word, b.type, b.code, b.position, b.rule, c.stem1_word, c.stem2_word, c.stem3_word, c.stem4_word, c.stem5_word, c.stem6_word, c.stem7_word
from gamabhana as a left join myptg as b
on b.derived_word = a.derived_word
left join anoop as c
on c.derived_word = a.derived_word

It will scan around 2 GB data (in this case) and the cost will be around 1 cent per query. This can also be done in MySQL. But importing data and building indexes is not easy. Unlike Athena, MySQL allows unlimited queries for free!

Athena is good for data that is important and accessed rarely.

Labels: athena, aws, usability

posted by shantanu # Friday, May 28, 2021 0 comments

April 17, 2021

crawl indian language websites

CommonCrawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

https://commoncrawl.org/

Here are 3 easy steps to download the data in any language, for e.g. Marathi or Gujarati

1) Clone the repo git clone https://github.com/qburst/common-crawl-malayalam.git

2) Change language code from mal to mar at the line -
AND content_languages LIKE 'mal%'"

The other codes are: hin (Hindi), mal (Malayalam), mar (Marathi), guj (Gujarati)

Run this shell script after replacing XXX with your AWS access and security key.

./extract_malayalam_warcs.sh XXX XXX 2021-10 s3://nlp-malayalam

3) Change unicode range from mal: range(3328, 3456) to your own desired language range, for e.g. devnagari: range(2304, 2432) Gujarati: range(2688, 2816)

Run this shell script after replacing XXX with your AWS access and security key.

./filter_malayalam.sh XXX XXX s3://nlp-malayalam/2021-10/warcs s3://nlp-malayalam/2021-10/output2

Filtered and unfiltered data will be available in output2 sub-folder of s3 bucket.

_____

Update:

The good French guys from Sorbonne University and inria have extracted sentences from common crawler for all languages and made them available at

https://oscar-corpus.com/

I have no words to thank them. The paper is available here...

http://corpora.ids-mannheim.de/CMLC7-final/CMLC-7_2019-Oritz_et_al.pdf

Without this data, it would have been very difficult to extract reliable data for languages like Marathi, Gujarati.

Labels: athena, aws, python, shell script

posted by shantanu # Saturday, April 17, 2021 0 comments

October 28, 2020

Using Lambda Functions as UDF's in Redshift

Let's assume I have a list of client_codes saved in a redshift table and I need to find the details from an API.

# select client_code from some_table limit 10;
client_code |
--------------+
1001 |
2002 |
9009 |
1009 |
1898 |
5465 |
3244 |
5576 |
4389 |
8756 |
(10 rows)

I need to get the client addresses from a website. For e.g. the first client code is 1001 and address should come from
http://some_site.com/Details?dest=1001

This can not be done at SQL query level. You need to loop through an array using Python, PHP, Java etc. You can also write your scripts in AWS Lambda and use them as UDF (User Defined Functions) in Redshift. For e.g.

# select client_code, client_details(client_code) as c_address from some_table limit 10;
client_code | c_address
--------------+---------------------------------------------
1001 | 21,Tamilnadu,
2002 | 14,Madhya Pradesh & Chattisgarh,
9009 | 7,Gujarat,
1009 | 23,Uttar Pradesh (W) & Uttarakhand
1898 | 11,Karnataka
5465 | 3,Bihar & Jharkhand
3244 | 11,Karnataka
5576 | 6,Delhi
4389 | 13,Kolkata
8756 | 11,Karnataka
(10 rows)

The code of "client_details" Lambda function will look something like this...

import json
import requests
myurl = 'http://some_site.com/Details?dest='

def lambda_handler(event, context):
ret = dict()
res = list()
for argument in event['arguments']:
try:
number = str(argument[0])
page = requests.get(myurl+number[-10:])
res.append((page.content).decode('utf-8'))
ret['success'] = True
except Exception as e:
res.append(None)
ret['success'] = False
ret['error_msg'] = str(e)
ret['results'] = res
return json.dumps(ret)

Notes:
1) We are using "requests" module in this code. Since it is not available in AWS Lambda environment, I have added it using this layer...
# Layer: arn:aws:lambda:us-east-1:770693421928:layer:Klayers-python38-requests:9

2) You will also need to increase the timeout of Lambda upto 15 minutes. The API may take more than 3 seconds (default) to respond.

3) You will also have to update the IAM role associated with your Redshift cluster. (Actions - Manage Role) You can add the policy called "AWSLambdaFullAccess" or grant access to a single function as explained in the documentation.

The lambda function needs to be "linked" to Redshift using the "create function" statement like this...

CREATE OR REPLACE EXTERNAL FUNCTION client_details (number varchar )
RETURNS varchar STABLE
LAMBDA 'client_details'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftCopyUnload';

You need to change the IAM role name and the 12 digit account ID mentioned above in the IAM Role.

You can now use your lambda function in your redshift query for e.g.

# select client_code, client_details(client_code) as c_address from some_table limit 10;

You can read more...

# https://aws.amazon.com/blogs/big-data/accessing-external-components-using-amazon-redshift-lambda-udfs/

Labels: aws, aws_lambda, redshift

posted by shantanu # Wednesday, October 28, 2020 0 comments

October 24, 2020

Using Bucketing in Amazon Athena

To reduce the data scan cost, AWS Athena provides an option to bucket your data. This optimization technique can perform wonders on reducing cost.

Like partitioning, columns that are frequently used to filter the data are good candidates for bucketing. However, unlike partitioning, with bucketing it’s better to use columns with high cardinality as a bucketing key. For example, Year and Month columns are good candidates for partition keys, whereas userID and sensorID are good examples of bucket keys. By doing this, you make sure that all buckets have a similar number of rows.

Bucketing is a technique that groups data based on specific columns together within a single partition. These columns are known as bucket keys. By grouping related data together into a single bucket (a file within a partition), you significantly reduce the amount of data scanned by Athena, thus improving query performance and reducing cost.

For example, imagine collecting and storing clickstream data. If you frequently filter or aggregate by Sensor ID, then within a single partition it’s better to store all rows for the same sensor together.

CREATE TABLE TargetTable
WITH (
format = 'PARQUET',
external_location = 's3://<s3_bucket_name>/curated/',
partitioned_by = ARRAY['dt'],
bucketed_by = ARRAY['sensorID'],
bucket_count = 3)
AS SELECT *
FROM SourceTable

You can run the select query like this:

select * from TargetTable where dt= '2020-08-04-21' and sensorID = '1096'

Labels: athena, aws

posted by shantanu # Saturday, October 24, 2020 0 comments

August 24, 2020

Manage Athena tables using python module wrangler

# create a boto session

import boto3
my_session = boto3.Session(
aws_access_key_id="xxx",
aws_secret_access_key="xxx",
region_name="us-east-1",
)

# use wrangler to upload parquet files to S3
# use the database created in glue

import awswrangler as wr
wr.s3.to_parquet(
df=df,
path="s3://test1623/noaa/",
dataset=True,
database="awswrangler_test",
table="noaa",
partition_cols=["year"],
boto3_session=my_session,
)

# Read Athena query into pandas dataframe
df = wr.athena.read_sql_query(
"select * from noaa limit 10", database="awswrangler_test", boto3_session=my_session
)

https://aws.amazon.com/blogs/big-data/optimize-python-etl-by-extending-pandas-with-aws-data-wrangler/

Labels: athena, aws

posted by shantanu # Monday, August 24, 2020 0 comments

July 31, 2020

Using Athena to preview large S3 files

Athena can also be used for quickly previewing the csv or text files saved in S3.

# vi symlink.txt

s3://some_bucket/abc_servers_backup/bulk_205_aws/db/xyz_no_details_202006.csv.gz

As you can see, I have saved the entire path in a symbolic links text file and uploading it to S3 using this command...

aws s3 cp symlink.txt s3://132cols/symlink/

I will now use the path of symbolic links file while creating the table as shown below.

CREATE EXTERNAL TABLE test_bulk (

dummy_col string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '~'

ESCAPED BY '\\'

LINES TERMINATED BY '\n'

STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'

LOCATION 's3://132cols/symlink/'

TBLPROPERTIES ('has_encrypted_data'='false');

The important point to note is that I have changed the comma , in "FIELDS TERMINATED BY" parameter to tile ~ or else I would get only the first column. Due to this change, I can see the entire row in a single column called "dummy_col". This tip will be useful when you want to check what's in a large file without downloading it first. This method can also be used to re-check if the file that you have uploaded to S3 is valid and readable.

Labels: athena, aws

posted by shantanu # Friday, July 31, 2020 0 comments

April 12, 2020

Launch spot EC2 instances

The following code will initiate a Linux instance of type m3.medium using spot pricing and associate it to IP address 13.228.39.49 Make sure to use your own elastic IP address and key. Do not forget to change the access_key and secret_key parameters.

!wget https://raw.githubusercontent.com/shantanuo/easyboto/master/easyboto.py

import easyboto
dev=easyboto.connect('access_key', 'secret_key')
dev.placement='us-east-1a'
dev.myaddress='13.228.39.49'
dev.key='dec15abc'
dev.MAX_SPOT_BID= '2.9'
dev.startEc2Spot('ami-0323c3dd2da7fb37d', 'm3.medium')

This will return the instance id and the ssh command that you can use to connect to your instance. The output will look something like...

job instance id: i-029a926e68118d089
ssh -i dec15a.pem ec2-user@13.228.39.49

You can list all instances along with their details like launch time, image_id and save the results as pandas dataframe using showEc2 method like this...

df=dev.showEc2()

Now "df" is a pandas dataframe object. You can sort or groupby the instances just like an excel sheet.

You can delete the instance by providing the instance ID that was generated in the first step using deleteEc2 method.

dev.deleteEc2('i-029a926e68118d089')
_____

You can also use cloudformation template for this purpose. Visit the following link and look for "Linux EC2 Instance on SPOT" section.

https://github.com/shantanuo/cloudformation

Click on "Launch Stack" button. It is nothing but GUI for the python code mentioned above. You will simply have to submit a form for the methods like key and IP address.

Labels: aws, aws_cloudformation, boto, usability

posted by shantanu # Sunday, April 12, 2020 0 comments

April 07, 2020

emailThis service using serverless API

There are times when I find a great article or web page but don't have time to read. I use EmailThis service to save text & images from a website to my email inbox. The concept is very simple. Drag and drop a bookmarklet to the bookmark toolbar and click on it to send the current web-page to your inbox!

https://www.emailthis.me/

But I did not like the premium ads and partial content that the site sends. So I built my own serverless API to get exactly the same functionality using mailgun and Amazon Web Services.

https://www.mailgun.com/

Once you register with mailgun, you will get a URL and API-Key that you need to copy-paste to notepad. You will need to provide this information when you launch the cloudformation template by clicking on this link.

https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=emailThis&templateURL=https://datameetgeobk.s3.amazonaws.com/cftemplates/furl.yaml.txt

Once the resources are created, you can see a URL in output section something like this...

https://ie5n05bqo0.execute-api.us-east-1.amazonaws.com/mycall

Now building the javaScript bookmarklet is easy.

javascript:(function(){location.href='https://ie5n05bqo0.execute-api.us-east-1.amazonaws.com/mycall?email=shantanu.oak@gmail.com&title=emailThis&url='+encodeURIComponent(location.href);})();

Right click on any bookmark and then copy-paste the above link. Make sure that you have changed the URL and email address to your own. Now click on this bookmarklet while you are on an important web page that you need to send to your inbox. Enjoy!

Labels: api_gateway, aws, aws_lambda, usability

posted by shantanu # Tuesday, April 07, 2020 0 comments

March 26, 2020

Packetbeat to elastic server hosted by AWS

These are the 5 steps to follow if you want to push the packet or metric beats to AWS elastic instance.

Make sure that your elastic instance has white-listed the IP address of the server where you are installing packetbeat.

1) Install

2) Configure

3) start the service

1) Installation of beats depends upon processor type + OS + if you are connecting to AWS!

arm + ubuntu + connect to aws

wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-arm64.deb

dpkg -i packetbeat-oss-7.11.0-arm64.deb

arm + ubuntu + connect to non-aws

wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-7.11.0-arm64.deb

dpkg -i packetbeat-7.11.0-arm64.deb

arm + redhat + connect to aws

wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-aarch64.rpm

rpm -ivh packetbeat-7.11.0-aarch64.rpm

arm + redhat + connect to non-AWS

wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-aarch64.rpm

rpm -ivh packetbeat-7.11.0-aarch64.rpm

Intel + ubuntu + connect to aws

wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-amd64.deb

dpkg -i packetbeat-oss-7.11.0-amd64.deb

Intel + ubuntu + connect to non-aws

wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-7.11.0-amd64.deb

dpkg -i packetbeat-7.11.0-amd64.deb

Intel + redhat + connect to aws

wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-x86_64.rpm

rpm -ivh packetbeat-oss-7.11.0-x86_64.rpm

Intel + redhat + connect to non-aws

wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-7.11.0-x86_64.rpm

rpm -ivh packetbeat-7.11.0-x86_64.rpm

2) Open config file and change 2 settings as shown below:
vi /etc/packetbeat/packetbeat.yml
hosts: ["search-audit-ay52ddaf7q7zudixfmluji4osmx.us-east-1.es.amazonaws.com:443"]
protocol: "https"

If you are using a server from local network, mention the address in hosts list. The protocol option can be disabled if you are not using secure http connection.

Optionally, enable "send_response" parameter to capture the query and it's output as shown below:

packetbeat.protocols.mysql:
ports: [3306]
send_response: true
max_row_length: 4000

max row length option will allow you to capture lengthy queries and their large output.

disable xpack if connecting to AWS or using Open Source License

# Set to true to enable the monitoring reporter.

#monitoring.enabled: false

setup.ilm.enabled: false

setup.pack.security.enabled: false

setup.xpack.graph.enabled: false

setup.xpack.watcher.enabled: false

setup.xpack.monitoring.enabled: false

setup.xpack.reporting.enabled: false

3) Start packetbeat service

cd /usr/bin/

./packetbeat -e -c packetbeat.yml

Labels: aws, elastic

posted by shantanu # Thursday, March 26, 2020 0 comments

snapshot and restore elastic data to S3

Here are steps to backup and restore elastic data from AWS elastic instance.

1) Create IAM role
2) Use the boto script to create snapshot repo
3) Use Kibana to take the actual snapshot

All the steps are explained here...

https://forums.aws.amazon.com/message.jspa?messageID=930345#930345

Here is step by step guidance for the same.

1) Create IAM role:
Use the following cloudformation template to create the role and also note the ARN of the role to be used in boto script.

https://github.com/shantanuo/cloudformation/blob/master/updated/esbck.yml

2) Run this boto script:
Change the Access and Secret key. Also change the Elastic endpoint. Add the ARN created in the first step.

from boto.connection import AWSAuthConnection

class ESConnection(AWSAuthConnection):
def __init__(self, region, **kwargs):
super(ESConnection, self).__init__(**kwargs)
self._set_auth_region_name(region)
self._set_auth_service_name("es")

def _required_auth_capability(self):
return ["hmac-v4"]

client = ESConnection(
region="us-east-1",
host="search-training-foz7enh73fbg6lof23z7kbtn3y.us-east-1.es.amazonaws.com",
aws_access_key_id="xxx",
aws_secret_access_key="xxx",
is_secure=False,)

headers = {"Content-Type": "application/json"}
resp = client.make_request(
method="PUT",
headers=headers,
path="/_snapshot/esbck-essnapshotbucket-c9e6d7fy1cbt",
data='{"type": "s3","settings": { "bucket": "esbck-essnapshotbucket-c9e6d7fy1cbt","region": "us-east-1", "role_arn": "arn:aws:iam::1234567890:role/esbck-EsSnapshotRole-GJGMPH4DBMM3"}}')

resp.read()

3) Take the backup from kibana

PUT /_snapshot/esbck-essnapshotbucket-c9e6d7fy1cbt/snapshot_1

GET /_cat/indices

DELETE /cwl-2020.03.26

POST /_snapshot/esbck-essnapshotbucket-c9e6d7fy1cbt/snapshot_1/_restore

GET /_snapshot/

GET /_cat/snapshots/esbck-essnapshotbucket-c9e6d7fy1cbt

Labels: aws, elastic

posted by shantanu # Thursday, March 26, 2020 0 comments

March 12, 2020

Analyze S3 storage usage using inventory

1) Enable inventory for the bucket
2) Create a table using athena to read the inventory data
3) Run a select query

1) Enable inventory using boto as shown in this stack thread:

https://stackoverflow.com/questions/60615911/use-boto-to-enable-inventory

2) Replace $InventoryBucket, $InventoryPrefix, $Bucket, and $InventoryName with the configuration.

CREATE EXTERNAL TABLE inventory(
bucket string,
key string,
size bigint,
last_modified_date timestamp,
e_tag string,
storage_class string,
is_multipart_uploaded boolean,
replication_status string,
encryption_status string
)
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://$InventoryBucket/$InventoryPrefix/$Bucket/$InventoryName/hive';

The location URL will look something like this...

LOCATION 's3://testme16/invent/testme16/myinventory/hive';

Repair table to read partitions:

MSCK REPAIR TABLE inventory;

3) Replace $Yesterday with yesterday’s timestamp (e.g., 2020-03-03-00-00):

SELECT prefix, SUM(size)/1000/1000/1000 AS total_size FROM (
SELECT regexp_extract(i.key, '([^\/]*\/).*', 1) AS prefix, i.size
FROM inventory AS i WHERE i.dt = '$Yesterday'
) GROUP BY prefix ORDER BY total_size DESC;

https://cloudonaut.io/how-to-analyze-and-reduce-s3-storage-usage/

Labels: athena, aws

posted by shantanu # Thursday, March 12, 2020 0 comments

Shantanu's Blog

July 01, 2025

disable dynamoDB table

June 02, 2025

Manage AWS resources using command line

December 03, 2024

Avoid uploading a file to S3 again

September 28, 2024

Firefox and Libreoffice in your browser

June 27, 2024

Disable dynamoDB table access

April 29, 2024

Partitioning and bucketing S3 data

September 27, 2023

Enable S3 versioning

November 01, 2022

Check for open ports

July 06, 2021

Quick test tables in Athena

May 28, 2021

Athena and Unicode text

April 17, 2021

crawl indian language websites

October 28, 2020

Using Lambda Functions as UDF's in Redshift

October 24, 2020

Using Bucketing in Amazon Athena

August 24, 2020

Manage Athena tables using python module wrangler

July 31, 2020

Using Athena to preview large S3 files

April 12, 2020

Launch spot EC2 instances

April 07, 2020

emailThis service using serverless API

March 26, 2020

Packetbeat to elastic server hosted by AWS

snapshot and restore elastic data to S3

March 12, 2020

Analyze S3 storage usage using inventory

Archives