Shantanu's Blog
Database Consultant
July 01, 2025
disable dynamoDB table
I have DynamoDB table called "autocorrect" and I need to stop all traffic to/ from this table. I can use resouce-based policy like this...
{
"Version": "2012-10-17",
"Id": "PolicyId",
"Statement": [
{
"Sid": "AccessDisabledTemporary",
"Effect": "Deny",
"Principal": "*",
"Action": "dynamodb:*",
"Resource": "arn:aws:dynamodb:us-east-1:XXXX85053566:table/autocorrect"
}
]
}
Labels: aws
June 02, 2025
Manage AWS resources using command line
1) add access key and secret key of a read-only user
aws configure
2) I need to install amazon Q using the instructions found on this page...
https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/command-line-installing-ssh-setup-autocomplete.html
3) I can now start the q service using the command...
./q/bin/q
Use natural language instructions like "list all dynamoDB tables" or "list all S3 buckets"
4) For advance users, I can create MCP server and save database credentials like username and password so that Q can query database and return results.
https://awslabs.github.io/mcp/servers/dynamodb-mcp-server/
Labels: aws, usability
December 03, 2024
Avoid uploading a file to S3 again
Let's assume I uploaded a file to S3:
aws s3 cp dictionaries.xcu s3://cf-templates-us-east-1/
I need to update that file only if it does not already exist. In that case I use --if-none-match parameter as shown below:
aws s3api put-object --bucket cf-templates-us-east-1 --key dictionaries.xcu --body dictionaries.xcu --if-none-match "*"
Returns "An error occurred (PreconditionFailed)"
This will help me while trying to upload a large file again.
_____
Following features are available for S3 Express One Zone:
1) In directory buckets, clients can perform conditional delete checks on an object’s last modified time, size, and Etag using the x-amz-if-match-last-modified-time, x-amz-if-match-size, and HTTP if-match headers.
2) Append data to a file:
aws s3api put-object --bucket cf-templates-us-east-1 --key dictionaries.xcu --body dictionaries.xcu --write-offset-bytes file001.bin
Or use python:
s3.put_object(Bucket='amzn-s3-demo-bucket--use2-az2--x-s3', Key='2024-11-05-sdk-test', Body=b'123456789', WriteOffsetBytes=9)
It can not replace your database or Messaging Queues because only a few thousand updates are possible for each object.
3) You can configure S3 Lifecycle rules for S3 Express One Zone to expire objects on your behalf. For example, you can create an S3 Lifecycle rule that expires all objects smaller than 512 KB after 3 days and another rule that expires all objects in a prefix after 10 days.
Labels: aws
September 28, 2024
Firefox and Libreoffice in your browser
Kasm VNC is a modern open source VNC server.
Quickly connect to your Linux server's desktop from any web browser.No client software install required.1) Firefox using VNCdocker run -d \--name=firefox \-e PUID=1000 \-e PGID=1000 \-e TZ=Etc/UTC \-p 3000:3000 \-p 3001:3001 \-v /path/to/config2:/config \--shm-size="1gb" \--restart unless-stopped \lscr.io/linuxserver/firefox:latest2) Libreoffice using VNCdocker run -d \ --name=libreoffice \ --security-opt seccomp=unconfined `#optional` \ -e PUID=1000 \ -e PGID=1000 \ -e TZ=Etc/UTC \ -p 3000:3000 \ -p 3001:3001 \ -v /path/to/config:/config \ --restart unless-stopped \ lscr.io/linuxserver/libreoffice:latest
Labels: aws, docker, firefox, libreoffice, linux tips, usability
June 27, 2024
Disable dynamoDB table access
I can disable all access to a dynamoDB table using resource based policy. Here is an example:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Deny", "Principal": { "AWS": "*" }, "Action": "dynamodb:*", "Resource": "arn:aws:dynamodb:us-east-1:XXX885053566:table/sandhiDupe" } ]}
There are many other advantages of managing access at resource level.
Labels: aws, aws_lambda
April 29, 2024
Partitioning and bucketing S3 data
Let's assume we need to optimize this query. The data is stored in S3 (around a few TB)
SELECT * FROM "bucketing_blog"."noaa_remote_original"
WHERE
report_type = 'CRN05'
AND ( station = '99999904237'
OR station = '99999953132'
OR station = '99999903061'
OR station = '99999963856'
OR station = '99999994644'
);
There are around 14325 unique stations and report_type column contain around 13 types.
1) The first thing that can be tried is to create a table with partitioning (using report_type)
CREATE TABLE "bucketing_blog"."athena_non_bucketed"
WITH (
external_location = 's3://<your-s3-location>/athena-non-bucketed/',
partitioned_by = ARRAY['report_type'],
format = 'PARQUET',
write_compression = 'SNAPPY'
)
AS
SELECT * FROM "bucketing_blog"."noaa_remote_original";
If you need faster and cheaper query, use buckets
CREATE TABLE "bucketing_blog"."athena_bucketed"
WITH (
external_location = 's3://<your-s3-location>/athena-bucketed/',
partitioned_by = ARRAY['report_type'],
bucketed_by = ARRAY['station'],
bucket_count = 16,
format = 'PARQUET',
write_compression = 'SNAPPY'
)
AS
SELECT * FROM "bucketing_blog"."noaa_remote_original"
I created 16 buckets because that is the maximum number of stations that may appear in the SQL query.
https://aws.amazon.com/blogs/big-data/optimize-data-layout-by-bucketing-with-amazon-athena-and-aws-glue-to-accelerate-downstream-queries/
Labels: athena, aws
September 27, 2023
Enable S3 versioning
It is highly recommended to enable versioning for all your S3 buckets. In case you delete a file by mistake, you can find the version ID of the older version and donload that copy.
1) Find the version ID:
aws s3api list-object-versions --bucket cdk-hnb --prefix tt/archive.tar
2) Donload that version:
aws s3api get-object --bucket cdk-hnb --key tt/archive.tar archive.tar --version-id mqfya
3) List all versions:
aws s3api list-object-versions --bucket cdk-hnb
If versioning is enabled, you can not remove the bucket easily. You first need to "empty" the bucket and then delete it.
Labels: aws
November 01, 2022
Check for open ports
This code will check if there is any port open and send an alert to the subscribers of SNS topic.
import boto3, json
ec2 = boto3.client('ec2' , region_name='us-east-1')
for security_group in ec2.describe_security_groups()['SecurityGroups']:
for i in range(len(security_group['IpPermissions'][0]['IpRanges'])):
for k,v in security_group['IpPermissions'][0]['IpRanges'][i].items():
print (k, v)
if '0.0.0.0' in v:
message = {"alert": "open port found "}
sns_client = boto3.client("sns", region_name="us-east-1")
response = sns_client.publish(TargetArn='arn:aws:sns:us-east-1:102378362623:NotifyMe',
Message=json.dumps({'default': json.dumps(message)}), MessageStructure='json')
You may need to change the region name and SNS topic ARN address in the code mentioned above.
This code can be written as Lambda function and run every day.
Labels: aws, aws_lambda, boto
July 06, 2021
Quick test tables in Athena
If you quickly ant to test data using athena query, use with syntax as shown below:
WITH countries(country_code) AS (VALUES 'pol', 'CAN', 'USA')
SELECT upper(country_code) AS country_code
FROM countries
In this example aggregate expression uses "OVER" function.
WITH students_results(student_id, result) AS (VALUES
('student_1', 17),
('student_2', 16),
('student_3', 18),
('student_4', 18),
('student_5', 10),
('student_6', 20),
('student_7', 16))
SELECT
student_id,
result,
count(*) OVER (
ORDER BY result
) AS close_better_scores_count
FROM students_results
Labels: athena, aws
May 28, 2021
Athena and Unicode text
Athena supports unicode characters very well. For e.g. if the datafile looks like this...
"Root_word";"Word";"Primary";"Type";"Code";"Position";"Rule"
"अँटिबायोटिक","अँटिबायोटिक","अँटिबायोटिक","Primary","","",""
"अँटिबायोटिक","अँटिबायोटिकअंती","अँटिबायोटिक","Suffix","A","7293","001: 0 अंती ."
"अँटिबायोटिक","अँटिबायोटिकअर्थी","अँटिबायोटिक","Suffix","A","7293","002: 0 अर्थी ."
"अँटिबायोटिक","अँटिबायोटिकआतून","अँटिबायोटिक","Suffix","A","7293","003: 0 आतून ."
"अँटिबायोटिक","अँटिबायोटिकआतूनचा","अँटिबायोटिक","Suffix","A","7293","004: 0 आतूनचा ."
"अँटिबायोटिक","अँटिबायोटिकआतूनची","अँटिबायोटिक","Suffix","A","7293","005: 0 आतूनची ."
"अँटिबायोटिक","अँटिबायोटिकआतूनचे","अँटिबायोटिक","Suffix","A","7293","006: 0 आतूनचे ."
"अँटिबायोटिक","अँटिबायोटिकआतूनच्या","अँटिबायोटिक","Suffix","A","7293","007: 0 आतूनच्या ."
"अँटिबायोटिक","अँटिबायोटिकआतूनला","अँटिबायोटिक","Suffix","A","7293","008: 0 आतूनला ."
This create table statement is all I need...
create external table myptg (
root_word varchar(255),
derived_word varchar(255),
stemmed_word varchar(255),
type varchar(255),
code varchar(255),
position varchar(255),
rule varchar(255)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = '\;',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://ptg1/mc/'
TBLPROPERTIES ("skip.header.line.count"="1");
I can create a supporting table like this...
create external table gamabhana (derived_word varchar(255))
LOCATION 's3://ptg1/mc2/'
TBLPROPERTIES ("skip.header.line.count"="1");
A new table can be created using the syntax something like this...
create external table anoop (
serial_number int,
root_word varchar(255),
stem1_word varchar(255),
stem2_word varchar(255),
stem3_word varchar(255),
stem4_word varchar(255),
stem5_word varchar(255),
stem6_word varchar(255),
stem7_word varchar(255)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim' = ',',
'escape.delim' = '\\',
'line.delim' = '\n'
)
LOCATION 's3://ptg1/mc3/'
TBLPROPERTIES ("skip.header.line.count"="1");
And then run a join statement like this...
create table gamabhana_match as
select a.derived_word, b.root_word, b.stemmed_word, b.type, b.code, b.position, b.rule, c.stem1_word, c.stem2_word, c.stem3_word, c.stem4_word, c.stem5_word, c.stem6_word, c.stem7_word
from gamabhana as a left join myptg as b
on b.derived_word = a.derived_word
left join anoop as c
on c.derived_word = a.derived_word
It will scan around 2 GB data (in this case) and the cost will be around 1 cent per query. This can also be done in MySQL. But importing data and building indexes is not easy. Unlike Athena, MySQL allows unlimited queries for free!
Athena is good for data that is important and accessed rarely.
Labels: athena, aws, usability
April 17, 2021
crawl indian language websites
CommonCrawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.
https://commoncrawl.org/
Here are 3 easy steps to download the data in any language, for e.g. Marathi or Gujarati
1) Clone the repo git clone https://github.com/qburst/common-crawl-malayalam.git
2) Change language code from mal to mar at the line -
AND content_languages LIKE 'mal%'"
The other codes are: hin (Hindi), mal (Malayalam), mar (Marathi), guj (Gujarati)
Run this shell script after replacing XXX with your AWS access and security key.
./extract_malayalam_warcs.sh XXX XXX 2021-10 s3://nlp-malayalam
3) Change unicode range from mal: range(3328, 3456) to your own desired language range, for e.g. devnagari: range(2304, 2432) Gujarati: range(2688, 2816)
Run this shell script after replacing XXX with your AWS access and security key.
./filter_malayalam.sh XXX XXX s3://nlp-malayalam/2021-10/warcs s3://nlp-malayalam/2021-10/output2
Filtered and unfiltered data will be available in output2 sub-folder of s3 bucket.
_____
Update:
The good French guys from Sorbonne University and inria have extracted sentences from common crawler for all languages and made them available at
https://oscar-corpus.com/
I have no words to thank them. The paper is available here...
http://corpora.ids-mannheim.de/CMLC7-final/CMLC-7_2019-Oritz_et_al.pdf
Without this data, it would have been very difficult to extract reliable data for languages like Marathi, Gujarati.
Labels: athena, aws, python, shell script
October 28, 2020
Using Lambda Functions as UDF's in Redshift
Let's assume I have a list of client_codes saved in a redshift table and I need to find the details from an API.
# select client_code from some_table limit 10;
client_code |
--------------+
1001 |
2002 |
9009 |
1009 |
1898 |
5465 |
3244 |
5576 |
4389 |
8756 |
(10 rows)
I need to get the client addresses from a website. For e.g. the first client code is 1001 and address should come from
http://some_site.com/Details?dest=1001
This can not be done at SQL query level. You need to loop through an array using Python, PHP, Java etc. You can also write your scripts in AWS Lambda and use them as UDF (User Defined Functions) in Redshift. For e.g.
# select client_code, client_details(client_code) as c_address from some_table limit 10;
client_code | c_address
--------------+---------------------------------------------
1001 | 21,Tamilnadu,
2002 | 14,Madhya Pradesh & Chattisgarh,
9009 | 7,Gujarat,
1009 | 23,Uttar Pradesh (W) & Uttarakhand
1898 | 11,Karnataka
5465 | 3,Bihar & Jharkhand
3244 | 11,Karnataka
5576 | 6,Delhi
4389 | 13,Kolkata
8756 | 11,Karnataka
(10 rows)
The code of "client_details" Lambda function will look something like this...
import json
import requests
myurl = 'http://some_site.com/Details?dest='
def lambda_handler(event, context):
ret = dict()
res = list()
for argument in event['arguments']:
try:
number = str(argument[0])
page = requests.get(myurl+number[-10:])
res.append((page.content).decode('utf-8'))
ret['success'] = True
except Exception as e:
res.append(None)
ret['success'] = False
ret['error_msg'] = str(e)
ret['results'] = res
return json.dumps(ret)
Notes:
1) We are using "requests" module in this code. Since it is not available in AWS Lambda environment, I have added it using this layer...
# Layer: arn:aws:lambda:us-east-1:770693421928:layer:Klayers-python38-requests:9
2) You will also need to increase the timeout of Lambda upto 15 minutes. The API may take more than 3 seconds (default) to respond.
3) You will also have to update the IAM role associated with your Redshift cluster. (Actions - Manage Role) You can add the policy called "AWSLambdaFullAccess" or grant access to a single function as explained in the documentation.
The lambda function needs to be "linked" to Redshift using the "create function" statement like this...
CREATE OR REPLACE EXTERNAL FUNCTION client_details (number varchar )
RETURNS varchar STABLE
LAMBDA 'client_details'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftCopyUnload';
You need to change the IAM role name and the 12 digit account ID mentioned above in the IAM Role.
You can now use your lambda function in your redshift query for e.g.
# select client_code, client_details(client_code) as c_address from some_table limit 10;
You can read more...
# https://aws.amazon.com/blogs/big-data/accessing-external-components-using-amazon-redshift-lambda-udfs/
Labels: aws, aws_lambda, redshift
October 24, 2020
Using Bucketing in Amazon Athena
To reduce the data scan cost, AWS Athena provides an option to bucket your data. This optimization technique can perform wonders on reducing cost.
Like partitioning, columns that are frequently used to filter the data are good candidates for bucketing. However, unlike partitioning, with bucketing it’s better to use columns with high cardinality as a bucketing key. For example, Year and Month columns are good candidates for partition keys, whereas userID and sensorID are good examples of bucket keys. By doing this, you make sure that all buckets have a similar number of rows.
Bucketing is a technique that groups data based on specific columns together within a single partition. These columns are known as bucket keys. By grouping related data together into a single bucket (a file within a partition), you significantly reduce the amount of data scanned by Athena, thus improving query performance and reducing cost.
For example, imagine collecting and storing clickstream data. If you frequently filter or aggregate by Sensor ID, then within a single partition it’s better to store all rows for the same sensor together.
CREATE TABLE TargetTable
WITH (
format = 'PARQUET',
external_location = 's3://<s3_bucket_name>/curated/',
partitioned_by = ARRAY['dt'],
bucketed_by = ARRAY['sensorID'],
bucket_count = 3)
AS SELECT *
FROM SourceTable
You can run the select query like this:
select * from TargetTable where dt= '2020-08-04-21' and sensorID = '1096'
Labels: athena, aws
August 24, 2020
Manage Athena tables using python module wrangler
# create a boto session
import boto3
my_session = boto3.Session(
aws_access_key_id="xxx",
aws_secret_access_key="xxx",
region_name="us-east-1",
)
# use wrangler to upload parquet files to S3
# use the database created in glue
import awswrangler as wr
wr.s3.to_parquet(
df=df,
path="s3://test1623/noaa/",
dataset=True,
database="awswrangler_test",
table="noaa",
partition_cols=["year"],
boto3_session=my_session,
)
# Read Athena query into pandas dataframe
df = wr.athena.read_sql_query(
"select * from noaa limit 10", database="awswrangler_test", boto3_session=my_session
)
https://aws.amazon.com/blogs/big-data/optimize-python-etl-by-extending-pandas-with-aws-data-wrangler/
Labels: athena, aws
July 31, 2020
Using Athena to preview large S3 files
Athena can also be used for quickly previewing the csv or text files saved in S3.
# vi symlink.txt
s3://some_bucket/abc_servers_backup/bulk_205_aws/db/xyz_no_details_202006.csv.gz
As you can see, I have saved the entire path in a symbolic links text file and uploading it to S3 using this command...
aws s3 cp symlink.txt s3://132cols/symlink/
I will now use the path of symbolic links file while creating the table as shown below.
CREATE EXTERNAL TABLE test_bulk (
dummy_col string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '~'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://132cols/symlink/'
TBLPROPERTIES ('has_encrypted_data'='false');
The important point to note is that I have changed the comma , in "FIELDS TERMINATED BY" parameter to tile ~ or else I would get only the first column. Due to this change, I can see the entire row in a single column called "dummy_col". This tip will be useful when you want to check what's in a large file without downloading it first. This method can also be used to re-check if the file that you have uploaded to S3 is valid and readable.
Labels: athena, aws
April 12, 2020
Launch spot EC2 instances
The following code will initiate a Linux instance of type m3.medium using spot pricing and associate it to IP address 13.228.39.49 Make sure to use your own elastic IP address and key. Do not forget to change the access_key and secret_key parameters.
!wget https://raw.githubusercontent.com/shantanuo/easyboto/master/easyboto.py
import easyboto
dev=easyboto.connect('access_key', 'secret_key')
dev.placement='us-east-1a'
dev.myaddress='13.228.39.49'
dev.key='dec15abc'
dev.MAX_SPOT_BID= '2.9'
dev.startEc2Spot('ami-0323c3dd2da7fb37d', 'm3.medium')
This will return the instance id and the ssh command that you can use to connect to your instance. The output will look something like...
job instance id: i-029a926e68118d089
ssh -i dec15a.pem ec2-user@13.228.39.49
You can list all instances along with their details like launch time, image_id and save the results as pandas dataframe using showEc2 method like this...
df=dev.showEc2()
Now "df" is a pandas dataframe object. You can sort or groupby the instances just like an excel sheet.
You can delete the instance by providing the instance ID that was generated in the first step using deleteEc2 method.
dev.deleteEc2('i-029a926e68118d089')
_____
You can also use cloudformation template for this purpose. Visit the following link and look for "Linux EC2 Instance on SPOT" section.
https://github.com/shantanuo/cloudformation
Click on "Launch Stack" button. It is nothing but GUI for the python code mentioned above. You will simply have to submit a form for the methods like key and IP address.
Labels: aws, aws_cloudformation, boto, usability
April 07, 2020
emailThis service using serverless API
There are times when I find a great article or web page but don't have time to read. I use EmailThis service to save text & images from a website to my email inbox. The concept is very simple. Drag and drop a bookmarklet to the bookmark toolbar and click on it to send the current web-page to your inbox!
https://www.emailthis.me/
But I did not like the premium ads and partial content that the site sends. So I built my own serverless API to get exactly the same functionality using mailgun and Amazon Web Services.
https://www.mailgun.com/
Once you register with mailgun, you will get a URL and API-Key that you need to copy-paste to notepad. You will need to provide this information when you launch the cloudformation template by clicking on this link.
https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=emailThis&templateURL=https://datameetgeobk.s3.amazonaws.com/cftemplates/furl.yaml.txt
Once the resources are created, you can see a URL in output section something like this...
https://ie5n05bqo0.execute-api.us-east-1.amazonaws.com/mycall
Now building the javaScript bookmarklet is easy.
javascript:(function(){location.href='https://ie5n05bqo0.execute-api.us-east-1.amazonaws.com/mycall?email=shantanu.oak@gmail.com&title=emailThis&url='+encodeURIComponent(location.href);})();
Right click on any bookmark and then copy-paste the above link. Make sure that you have changed the URL and email address to your own. Now click on this bookmarklet while you are on an important web page that you need to send to your inbox. Enjoy!
Labels: api_gateway, aws, aws_lambda, usability
March 26, 2020
Packetbeat to elastic server hosted by AWS
These are the 5 steps to follow if you want to push the packet or metric beats to AWS elastic instance.
Make sure that your elastic instance has white-listed the IP address of the server where you are installing packetbeat.
1) Install
2) Configure
3) start the service
1) Installation of beats depends upon processor type + OS + if you are connecting to AWS!
arm + ubuntu + connect to aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-arm64.deb
dpkg -i packetbeat-oss-7.11.0-arm64.deb
arm + ubuntu + connect to non-aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-7.11.0-arm64.deb
dpkg -i packetbeat-7.11.0-arm64.deb
arm + redhat + connect to aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-aarch64.rpm
rpm -ivh packetbeat-7.11.0-aarch64.rpm
arm + redhat + connect to non-AWS
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-aarch64.rpm
rpm -ivh packetbeat-7.11.0-aarch64.rpm
Intel + ubuntu + connect to aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-amd64.deb
dpkg -i packetbeat-oss-7.11.0-amd64.deb
Intel + ubuntu + connect to non-aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-7.11.0-amd64.deb
dpkg -i packetbeat-7.11.0-amd64.deb
Intel + redhat + connect to aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-x86_64.rpm
rpm -ivh packetbeat-oss-7.11.0-x86_64.rpm
Intel + redhat + connect to non-aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-7.11.0-x86_64.rpm
rpm -ivh packetbeat-7.11.0-x86_64.rpm
2) Open config file and change 2 settings as shown below:
vi /etc/packetbeat/packetbeat.yml
hosts: ["search-audit-ay52ddaf7q7zudixfmluji4osmx.us-east-1.es.amazonaws.com:443"]
protocol: "https"
If you are using a server from local network, mention the address in hosts list. The protocol option can be disabled if you are not using secure http connection.
Optionally, enable "send_response" parameter to capture the query and it's output as shown below:
packetbeat.protocols.mysql:
ports: [3306]
send_response: true
max_row_length: 4000
max row length option will allow you to capture lengthy queries and their large output.
disable xpack if connecting to AWS or using Open Source License
# Set to true to enable the monitoring reporter.
#monitoring.enabled: false
setup.ilm.enabled: false
setup.pack.security.enabled: false
setup.xpack.graph.enabled: false
setup.xpack.watcher.enabled: false
setup.xpack.monitoring.enabled: false
setup.xpack.reporting.enabled: false
3) Start packetbeat service
cd /usr/bin/
./packetbeat -e -c packetbeat.yml
Labels: aws, elastic
snapshot and restore elastic data to S3
Here are steps to backup and restore elastic data from AWS elastic instance.
1) Create IAM role
2) Use the boto script to create snapshot repo
3) Use Kibana to take the actual snapshot
All the steps are explained here...
https://forums.aws.amazon.com/message.jspa?messageID=930345#930345
Here is step by step guidance for the same.
1) Create IAM role:
Use the following cloudformation template to create the role and also note the ARN of the role to be used in boto script.
https://github.com/shantanuo/cloudformation/blob/master/updated/esbck.yml
2) Run this boto script:
Change the Access and Secret key. Also change the Elastic endpoint. Add the ARN created in the first step.
from boto.connection import AWSAuthConnection
class ESConnection(AWSAuthConnection):
def __init__(self, region, **kwargs):
super(ESConnection, self).__init__(**kwargs)
self._set_auth_region_name(region)
self._set_auth_service_name("es")
def _required_auth_capability(self):
return ["hmac-v4"]
client = ESConnection(
region="us-east-1",
host="search-training-foz7enh73fbg6lof23z7kbtn3y.us-east-1.es.amazonaws.com",
aws_access_key_id="xxx",
aws_secret_access_key="xxx",
is_secure=False,)
headers = {"Content-Type": "application/json"}
resp = client.make_request(
method="PUT",
headers=headers,
path="/_snapshot/esbck-essnapshotbucket-c9e6d7fy1cbt",
data='{"type": "s3","settings": { "bucket": "esbck-essnapshotbucket-c9e6d7fy1cbt","region": "us-east-1", "role_arn": "arn:aws:iam::1234567890:role/esbck-EsSnapshotRole-GJGMPH4DBMM3"}}')
resp.read()
3) Take the backup from kibana
PUT /_snapshot/esbck-essnapshotbucket-c9e6d7fy1cbt/snapshot_1
GET /_cat/indices
DELETE /cwl-2020.03.26
POST /_snapshot/esbck-essnapshotbucket-c9e6d7fy1cbt/snapshot_1/_restore
GET /_snapshot/
GET /_cat/snapshots/esbck-essnapshotbucket-c9e6d7fy1cbt
Labels: aws, elastic
March 12, 2020
Analyze S3 storage usage using inventory
1) Enable inventory for the bucket
2) Create a table using athena to read the inventory data
3) Run a select query
1) Enable inventory using boto as shown in this stack thread:
https://stackoverflow.com/questions/60615911/use-boto-to-enable-inventory
2) Replace $InventoryBucket, $InventoryPrefix, $Bucket, and $InventoryName with the configuration.
CREATE EXTERNAL TABLE inventory(
bucket string,
key string,
size bigint,
last_modified_date timestamp,
e_tag string,
storage_class string,
is_multipart_uploaded boolean,
replication_status string,
encryption_status string
)
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://$InventoryBucket/$InventoryPrefix/$Bucket/$InventoryName/hive';
The location URL will look something like this...
LOCATION 's3://testme16/invent/testme16/myinventory/hive';
Repair table to read partitions:
MSCK REPAIR TABLE inventory;
3) Replace $Yesterday with yesterday’s timestamp (e.g., 2020-03-03-00-00):
SELECT prefix, SUM(size)/1000/1000/1000 AS total_size FROM (
SELECT regexp_extract(i.key, '([^\/]*\/).*', 1) AS prefix, i.size
FROM inventory AS i WHERE i.dt = '$Yesterday'
) GROUP BY prefix ORDER BY total_size DESC;
https://cloudonaut.io/how-to-analyze-and-reduce-s3-storage-usage/
Labels: athena, aws
Archives
June 2001
July 2001
January 2003
May 2003
September 2003
October 2003
December 2003
January 2004
February 2004
March 2004
April 2004
May 2004
June 2004
July 2004
August 2004
September 2004
October 2004
November 2004
December 2004
January 2005
February 2005
March 2005
April 2005
May 2005
June 2005
July 2005
August 2005
September 2005
October 2005
November 2005
December 2005
January 2006
February 2006
March 2006
April 2006
May 2006
June 2006
July 2006
August 2006
September 2006
October 2006
November 2006
December 2006
January 2007
February 2007
March 2007
April 2007
June 2007
July 2007
August 2007
September 2007
October 2007
November 2007
December 2007
January 2008
February 2008
March 2008
April 2008
July 2008
August 2008
September 2008
October 2008
November 2008
December 2008
January 2009
February 2009
March 2009
April 2009
May 2009
June 2009
July 2009
August 2009
September 2009
October 2009
November 2009
December 2009
January 2010
February 2010
March 2010
April 2010
May 2010
June 2010
July 2010
August 2010
September 2010
October 2010
November 2010
December 2010
January 2011
February 2011
March 2011
April 2011
May 2011
June 2011
July 2011
August 2011
September 2011
October 2011
November 2011
December 2011
January 2012
February 2012
March 2012
April 2012
May 2012
June 2012
July 2012
August 2012
October 2012
November 2012
December 2012
January 2013
February 2013
March 2013
April 2013
May 2013
June 2013
July 2013
September 2013
October 2013
January 2014
March 2014
April 2014
May 2014
July 2014
August 2014
September 2014
October 2014
November 2014
December 2014
January 2015
February 2015
March 2015
April 2015
May 2015
June 2015
July 2015
August 2015
September 2015
January 2016
February 2016
March 2016
April 2016
May 2016
June 2016
July 2016
August 2016
September 2016
October 2016
November 2016
December 2016
January 2017
February 2017
April 2017
May 2017
June 2017
July 2017
August 2017
September 2017
October 2017
November 2017
December 2017
February 2018
March 2018
April 2018
May 2018
June 2018
July 2018
August 2018
September 2018
October 2018
November 2018
December 2018
January 2019
February 2019
March 2019
April 2019
May 2019
July 2019
August 2019
September 2019
October 2019
November 2019
December 2019
January 2020
February 2020
March 2020
April 2020
May 2020
July 2020
August 2020
September 2020
October 2020
December 2020
January 2021
April 2021
May 2021
July 2021
September 2021
March 2022
October 2022
November 2022
March 2023
April 2023
July 2023
September 2023
October 2023
November 2023
April 2024
May 2024
June 2024
August 2024
September 2024
October 2024
November 2024
December 2024
January 2025
February 2025
April 2025
June 2025
July 2025
August 2025
