Shantanu's Blog

Database Consultant

July 01, 2025

 

disable dynamoDB table

I have DynamoDB table called "autocorrect" and I need to stop all traffic to/ from this table. I can use resouce-based policy like this...

{
  "Version": "2012-10-17",
  "Id": "PolicyId",
  "Statement": [
    {
      "Sid": "AccessDisabledTemporary",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "dynamodb:*",
      "Resource": "arn:aws:dynamodb:us-east-1:XXXX85053566:table/autocorrect"
    }
  ]
}

Labels:


June 02, 2025

 

Manage AWS resources using command line

1) add access key and secret key of a read-only user

aws configure

2) I need to install amazon Q using the instructions found on this page...

https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/command-line-installing-ssh-setup-autocomplete.html

3) I can now start the q service using the command...

./q/bin/q

Use natural language instructions like "list all dynamoDB tables" or "list all S3 buckets"

4) For advance users, I can create MCP server and save database credentials like username and password so that Q can query database and return results.

https://awslabs.github.io/mcp/servers/dynamodb-mcp-server/

Labels: ,


December 03, 2024

 

Avoid uploading a file to S3 again

Let's assume I uploaded a file to S3:

aws s3 cp dictionaries.xcu s3://cf-templates-us-east-1/

I need to update that file only if it does not already exist. In that case I use --if-none-match parameter as shown below:

aws s3api put-object --bucket cf-templates-us-east-1 --key dictionaries.xcu --body dictionaries.xcu --if-none-match "*"   

Returns "An error occurred (PreconditionFailed)"

This will help me while trying to upload a large file again.

_____

Following features are available for S3 Express One Zone:

1) In directory buckets, clients can perform conditional delete checks on an object’s last modified time, size, and Etag using the x-amz-if-match-last-modified-time, x-amz-if-match-size, and HTTP if-match headers. 

2) Append data to a file:

aws s3api put-object --bucket cf-templates-us-east-1 --key dictionaries.xcu --body dictionaries.xcu --write-offset-bytes file001.bin

Or use python:

s3.put_object(Bucket='amzn-s3-demo-bucket--use2-az2--x-s3', Key='2024-11-05-sdk-test', Body=b'123456789', WriteOffsetBytes=9)

It can not replace your database or Messaging Queues because only a few thousand updates are possible for each object.

3) You can configure S3 Lifecycle rules for S3 Express One Zone to expire objects on your behalf. For example, you can create an S3 Lifecycle rule that expires all objects smaller than 512 KB after 3 days and another rule that expires all objects in a prefix after 10 days.


Labels:


September 28, 2024

 

Firefox and Libreoffice in your browser

Kasm VNC is a modern open source VNC server.

Quickly connect to your Linux server's desktop from any web browser.
No client software install required.

1) Firefox using VNC

docker run -d \
--name=firefox \
-e PUID=1000 \
-e PGID=1000 \
-e TZ=Etc/UTC \
-p 3000:3000 \
-p 3001:3001 \
-v /path/to/config2:/config \
--shm-size="1gb" \
--restart unless-stopped \
lscr.io/linuxserver/firefox:latest

2) Libreoffice using VNC

docker run -d \
  --name=libreoffice \
  --security-opt seccomp=unconfined `#optional` \
  -e PUID=1000 \
  -e PGID=1000 \
  -e TZ=Etc/UTC \
  -p 3000:3000 \
  -p 3001:3001 \
  -v /path/to/config:/config \
  --restart unless-stopped \
  lscr.io/linuxserver/libreoffice:latest

Labels: , , , , ,


June 27, 2024

 

Disable dynamoDB table access

I can disable all access to a dynamoDB table using resource based policy. Here is an example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Principal": {
        "AWS": "*"
      },
      "Action": "dynamodb:*",
      "Resource": "arn:aws:dynamodb:us-east-1:XXX885053566:table/sandhiDupe"
    }
  ]
}

There are many other advantages of managing access at resource level.

Labels: ,


April 29, 2024

 

Partitioning and bucketing S3 data

Let's assume we need to optimize this query. The data is stored in S3 (around a few TB)

SELECT * FROM "bucketing_blog"."noaa_remote_original"
WHERE
    report_type = 'CRN05'
    AND ( station = '99999904237'
        OR station = '99999953132'
        OR station = '99999903061'
        OR station = '99999963856'
        OR station = '99999994644'
    );

There are around 14325 unique stations and report_type column contain around 13 types.

1) The first thing that can be tried is to create a table with partitioning (using report_type)

CREATE TABLE "bucketing_blog"."athena_non_bucketed"
WITH (
    external_location = 's3://<your-s3-location>/athena-non-bucketed/',
    partitioned_by = ARRAY['report_type'],
    format = 'PARQUET',
    write_compression = 'SNAPPY'
)
AS
SELECT * FROM "bucketing_blog"."noaa_remote_original";

If you need faster and cheaper query, use buckets

CREATE TABLE "bucketing_blog"."athena_bucketed"
WITH (
    external_location = 's3://<your-s3-location>/athena-bucketed/',
    partitioned_by = ARRAY['report_type'],
    bucketed_by = ARRAY['station'],
    bucket_count = 16,
    format = 'PARQUET',
    write_compression = 'SNAPPY'
)
AS
SELECT * FROM "bucketing_blog"."noaa_remote_original"

I created 16 buckets because that is the maximum number of stations that may appear in the SQL query.

https://aws.amazon.com/blogs/big-data/optimize-data-layout-by-bucketing-with-amazon-athena-and-aws-glue-to-accelerate-downstream-queries/

Labels: ,


September 27, 2023

 

Enable S3 versioning

It is highly recommended to enable versioning for all your S3 buckets. In case you delete a file by mistake, you can find the version ID of the older version and donload that copy.

1) Find the version ID:
aws s3api list-object-versions --bucket cdk-hnb  --prefix tt/archive.tar

2) Donload that version:
aws s3api get-object --bucket cdk-hnb --key tt/archive.tar archive.tar --version-id mqfya

3) List all versions:
aws s3api list-object-versions --bucket cdk-hnb

If versioning is enabled, you can not remove the bucket easily. You first need to "empty" the bucket and then delete it.

Labels:


November 01, 2022

 

Check for open ports

This code will check if there is any port open and send an alert to the subscribers of SNS topic.

import boto3, json
ec2 = boto3.client('ec2' , region_name='us-east-1')
for security_group in ec2.describe_security_groups()['SecurityGroups']:
  for i in range(len(security_group['IpPermissions'][0]['IpRanges'])):
    for k,v in security_group['IpPermissions'][0]['IpRanges'][i].items():
        print (k, v)
        if '0.0.0.0' in v:
          message = {"alert": "open port found "}
          sns_client = boto3.client("sns", region_name="us-east-1")
          response = sns_client.publish(TargetArn='arn:aws:sns:us-east-1:102378362623:NotifyMe',
                                                                   Message=json.dumps({'default': json.dumps(message)}), MessageStructure='json')      

You may need to change the region name and SNS topic ARN address in the code mentioned above.
This code can be written as Lambda function and run every day.

Labels: , ,


July 06, 2021

 

Quick test tables in Athena

If you quickly ant to test data using athena query, use with syntax as shown below:


WITH countries(country_code) AS (VALUES 'pol', 'CAN', 'USA')
SELECT upper(country_code) AS country_code
    FROM countries
 
In this example aggregate expression uses "OVER" function.
 
WITH students_results(student_id, result) AS (VALUES
    ('student_1', 17),
    ('student_2', 16),
    ('student_3', 18),
    ('student_4', 18),
    ('student_5', 10),
    ('student_6', 20),
    ('student_7', 16))
SELECT
    student_id,
    result,
    count(*) OVER (
        ORDER BY result
 ) AS close_better_scores_count
FROM students_results

Labels: ,


May 28, 2021

 

Athena and Unicode text

Athena supports unicode characters very well. For e.g. if the datafile looks like this...

"Root_word";"Word";"Primary";"Type";"Code";"Position";"Rule"
"अँटिबायोटिक","अँटिबायोटिक","अँटिबायोटिक","Primary","","",""
"अँटिबायोटिक","अँटिबायोटिकअंती","अँटिबायोटिक","Suffix","A","7293","001: 0 अंती ."
"अँटिबायोटिक","अँटिबायोटिकअर्थी","अँटिबायोटिक","Suffix","A","7293","002: 0 अर्थी ."
"अँटिबायोटिक","अँटिबायोटिकआतून","अँटिबायोटिक","Suffix","A","7293","003: 0 आतून ."
"अँटिबायोटिक","अँटिबायोटिकआतूनचा","अँटिबायोटिक","Suffix","A","7293","004: 0 आतूनचा ."
"अँटिबायोटिक","अँटिबायोटिकआतूनची","अँटिबायोटिक","Suffix","A","7293","005: 0 आतूनची ."
"अँटिबायोटिक","अँटिबायोटिकआतूनचे","अँटिबायोटिक","Suffix","A","7293","006: 0 आतूनचे ."
"अँटिबायोटिक","अँटिबायोटिकआतूनच्या","अँटिबायोटिक","Suffix","A","7293","007: 0 आतूनच्या ."
"अँटिबायोटिक","अँटिबायोटिकआतूनला","अँटिबायोटिक","Suffix","A","7293","008: 0 आतूनला ."

This create table statement is all I need...

create external table myptg (
root_word varchar(255),
derived_word varchar(255),
stemmed_word varchar(255),
type varchar(255),
code varchar(255),
position varchar(255),
rule varchar(255)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  'separatorChar' = '\;',
  'quoteChar' = '\"',
  'escapeChar' = '\\'
)
LOCATION 's3://ptg1/mc/'
TBLPROPERTIES ("skip.header.line.count"="1");

I can create a supporting table like this...

create external table gamabhana (derived_word varchar(255))
LOCATION 's3://ptg1/mc2/'
TBLPROPERTIES ("skip.header.line.count"="1");

A new table can be created using the syntax something like this...
 
create external table anoop (
serial_number int,
root_word varchar(255),
stem1_word varchar(255),
stem2_word varchar(255),
stem3_word varchar(255),
stem4_word varchar(255),
stem5_word varchar(255),
stem6_word varchar(255),
stem7_word varchar(255)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'field.delim' = ',',
  'escape.delim' = '\\',
  'line.delim' = '\n'
)
LOCATION 's3://ptg1/mc3/'
TBLPROPERTIES ("skip.header.line.count"="1");


And then run a join statement like this...

create table gamabhana_match as
select a.derived_word, b.root_word, b.stemmed_word, b.type, b.code, b.position, b.rule, c.stem1_word, c.stem2_word, c.stem3_word, c.stem4_word, c.stem5_word, c.stem6_word, c.stem7_word
from gamabhana as a left join myptg as b
on b.derived_word = a.derived_word
left join anoop as c
on c.derived_word = a.derived_word

It will scan around 2 GB data (in this case) and the cost will be around 1 cent per query. This can also be done in MySQL. But importing data and building indexes is not easy. Unlike Athena, MySQL allows unlimited queries for free!

Athena is good for data that is important and accessed rarely.

Labels: , ,


April 17, 2021

 

crawl indian language websites

CommonCrawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

https://commoncrawl.org/

Here are 3 easy steps to download the data in any language, for e.g. Marathi or Gujarati

1) Clone the repo git clone https://github.com/qburst/common-crawl-malayalam.git

2) Change language code from mal to mar at the line -
AND content_languages LIKE 'mal%'" 

The other codes are: hin (Hindi), mal (Malayalam), mar (Marathi), guj (Gujarati)

Run this shell script after replacing XXX with your AWS access and security key.

./extract_malayalam_warcs.sh XXX XXX 2021-10 s3://nlp-malayalam

3) Change unicode range from mal: range(3328, 3456) to your own desired language range, for e.g. devnagari: range(2304, 2432) Gujarati: range(2688, 2816)

Run this shell script after replacing XXX with your AWS access and security key.

./filter_malayalam.sh XXX XXX s3://nlp-malayalam/2021-10/warcs s3://nlp-malayalam/2021-10/output2

Filtered and unfiltered data will be available in output2 sub-folder of s3 bucket.

_____

Update:

The good French guys from Sorbonne University and inria have extracted sentences from common crawler for all languages and made them available at 

https://oscar-corpus.com/

I have no words to thank them. The paper is available here... 

http://corpora.ids-mannheim.de/CMLC7-final/CMLC-7_2019-Oritz_et_al.pdf

Without this data, it would have been very difficult to extract reliable data for languages like Marathi, Gujarati.


Labels: , , ,


October 28, 2020

 

Using Lambda Functions as UDF's in Redshift

Let's assume I have a list of client_codes saved in a redshift table and I need to find the details from an API.

# select client_code from some_table limit 10;
  client_code   |
--------------+
 1001 |  
 2002 |  
 9009 |  
 1009 |  
 1898 |  
 5465 |  
 3244 |  
 5576 |  
 4389 |  
 8756 |  
(10 rows)

I need to get the client addresses from a website. For e.g. the first client code is 1001 and address should come from
http://some_site.com/Details?dest=1001

This can not be done at SQL query level. You need to loop through an array using Python, PHP, Java etc. You can also write your scripts in AWS Lambda and use them as UDF (User Defined Functions) in Redshift. For e.g.

# select client_code, client_details(client_code) as c_address from some_table limit 10;
  client_code   |                  c_address
--------------+---------------------------------------------
 1001 | 21,Tamilnadu,
 2002 | 14,Madhya Pradesh & Chattisgarh,
 9009 | 7,Gujarat,
 1009 | 23,Uttar Pradesh (W) & Uttarakhand
 1898 | 11,Karnataka
 5465 | 3,Bihar & Jharkhand
 3244 | 11,Karnataka
 5576 | 6,Delhi
 4389 | 13,Kolkata
 8756 | 11,Karnataka
(10 rows)

The code of "client_details" Lambda function will look something like this...

import json
import requests
myurl = 'http://some_site.com/Details?dest='

def lambda_handler(event, context):
  ret = dict()
  res = list()
  for argument in event['arguments']:
      try:
        number = str(argument[0])
        page = requests.get(myurl+number[-10:])
        res.append((page.content).decode('utf-8'))
        ret['success'] = True
      except Exception as e:
        res.append(None)
        ret['success'] = False
        ret['error_msg'] = str(e)
      ret['results'] = res
  return json.dumps(ret)

Notes:
1) We are using "requests" module in this code. Since it is not available in AWS Lambda environment, I have added it using this layer...
# Layer: arn:aws:lambda:us-east-1:770693421928:layer:Klayers-python38-requests:9

2) You will also need to increase the timeout of Lambda upto 15 minutes. The API may take more than 3 seconds (default) to respond.

3) You will also have to update the IAM role associated with your Redshift cluster. (Actions - Manage Role) You can add the policy called "AWSLambdaFullAccess" or grant access to a single function as explained in the documentation.

The lambda function needs to be "linked" to Redshift using the "create function" statement like this...

CREATE OR REPLACE EXTERNAL FUNCTION client_details (number varchar )
RETURNS varchar STABLE
LAMBDA 'client_details'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftCopyUnload';

You need to change the IAM role name and the 12 digit account ID mentioned above in the IAM Role. 

You can now use your lambda function in your redshift query for e.g.

# select client_code, client_details(client_code) as c_address from some_table limit 10;

You can read more...

# https://aws.amazon.com/blogs/big-data/accessing-external-components-using-amazon-redshift-lambda-udfs/

Labels: , ,


October 24, 2020

 

Using Bucketing in Amazon Athena

To reduce the data scan cost, AWS Athena provides an option to bucket your data. This optimization technique can perform wonders on reducing cost.

Like partitioning, columns that are frequently used to filter the data are good candidates for bucketing. However, unlike partitioning, with bucketing it’s better to use columns with high cardinality as a bucketing key. For example, Year and Month columns are good candidates for partition keys, whereas userID and sensorID are good examples of bucket keys. By doing this, you make sure that all buckets have a similar number of rows.

Bucketing is a technique that groups data based on specific columns together within a single partition. These columns are known as bucket keys. By grouping related data together into a single bucket (a file within a partition), you significantly reduce the amount of data scanned by Athena, thus improving query performance and reducing cost.

For example, imagine collecting and storing clickstream data. If you frequently filter or aggregate by Sensor ID, then within a single partition it’s better to store all rows for the same sensor together.

CREATE TABLE TargetTable
WITH (
      format = 'PARQUET',
      external_location = 's3://<s3_bucket_name>/curated/',
      partitioned_by = ARRAY['dt'],
      bucketed_by = ARRAY['sensorID'],
      bucket_count = 3)
AS SELECT *
FROM SourceTable

You can run the select query like this:

select * from TargetTable where dt= '2020-08-04-21' and sensorID = '1096'

Labels: ,


August 24, 2020

 

Manage Athena tables using python module wrangler

# create a boto session

import boto3
my_session = boto3.Session(
    aws_access_key_id="xxx",
    aws_secret_access_key="xxx",
    region_name="us-east-1",
)

# use wrangler to upload parquet files to S3
# use the database created in glue

import awswrangler as wr
wr.s3.to_parquet(
    df=df,
    path="s3://test1623/noaa/",
    dataset=True,
    database="awswrangler_test",
    table="noaa",
    partition_cols=["year"],
    boto3_session=my_session,
)

# Read Athena query into pandas dataframe
df = wr.athena.read_sql_query(
    "select * from noaa limit 10", database="awswrangler_test", boto3_session=my_session
)

https://aws.amazon.com/blogs/big-data/optimize-python-etl-by-extending-pandas-with-aws-data-wrangler/

Labels: ,


July 31, 2020

 

Using Athena to preview large S3 files

Athena can also be used for quickly previewing the csv or text files saved in S3. 

# vi symlink.txt
s3://some_bucket/abc_servers_backup/bulk_205_aws/db/xyz_no_details_202006.csv.gz

As you can see, I have saved the entire path in a symbolic links text file and uploading it to S3 using this command...

aws s3 cp symlink.txt s3://132cols/symlink/

I will now use the path of symbolic links file while creating the table as shown below.

CREATE EXTERNAL TABLE test_bulk (
  dummy_col string)
    ROW FORMAT DELIMITED
      FIELDS TERMINATED BY '~'
      ESCAPED BY '\\'
      LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://132cols/symlink/'
TBLPROPERTIES ('has_encrypted_data'='false');

The important point to note is that I have changed the comma , in "FIELDS TERMINATED BY" parameter to tile ~ or else I would get only the first column. Due to this change, I can see the entire row in a single column called "dummy_col". This tip will be useful when you want to check what's in a large file without downloading it first. This method can also be used to re-check if the file that you have uploaded to S3 is valid and readable.

Labels: ,


April 12, 2020

 

Launch spot EC2 instances

The following code will initiate a Linux instance of type m3.medium using spot pricing and associate it to IP address 13.228.39.49 Make sure to use your own elastic IP address and key. Do not forget to change the access_key and secret_key parameters.

!wget https://raw.githubusercontent.com/shantanuo/easyboto/master/easyboto.py
 
import easyboto
dev=easyboto.connect('access_key', 'secret_key')
dev.placement='us-east-1a'
dev.myaddress='13.228.39.49'
dev.key='dec15abc'
dev.MAX_SPOT_BID= '2.9'
dev.startEc2Spot('ami-0323c3dd2da7fb37d', 'm3.medium')

This will return the instance id and the ssh command that you can use to connect to your instance. The output will look something like...

job instance id: i-029a926e68118d089
ssh -i dec15a.pem ec2-user@13.228.39.49

You can list all instances along with their details like launch time, image_id  and save the results as pandas dataframe using showEc2 method like this...

df=dev.showEc2()

Now "df" is a pandas dataframe object. You can sort or groupby the instances just like an excel sheet.

You can delete the instance by providing the instance ID that was generated in the first step using deleteEc2 method.

dev.deleteEc2('i-029a926e68118d089')
_____

You can also use cloudformation template for this purpose. Visit the following link and look for "Linux EC2 Instance on SPOT" section.

https://github.com/shantanuo/cloudformation

Click on "Launch Stack" button. It is nothing but GUI for the python code mentioned above. You will simply have to submit a form for the methods like key and IP address.

Labels: , , ,


April 07, 2020

 

emailThis service using serverless API

There are times when I find a great article or web page but don't have time to read. I use EmailThis service to save text & images from a website to my email inbox. The concept is very simple. Drag and drop a bookmarklet to the bookmark toolbar and click on it to send the current web-page to your inbox!

https://www.emailthis.me/

But I did not like the premium ads and partial content that the site sends. So I built my own serverless API to get exactly the same functionality using mailgun and Amazon Web Services.

https://www.mailgun.com/

Once you register with mailgun, you will get a URL and API-Key that you need to copy-paste to notepad. You will need to provide this information when you launch the cloudformation template by clicking on this link.


https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=emailThis&templateURL=https://datameetgeobk.s3.amazonaws.com/cftemplates/furl.yaml.txt

Once the resources are created, you can see a URL in output section something like this...

https://ie5n05bqo0.execute-api.us-east-1.amazonaws.com/mycall

Now building the javaScript bookmarklet is easy.

javascript:(function(){location.href='https://ie5n05bqo0.execute-api.us-east-1.amazonaws.com/mycall?email=shantanu.oak@gmail.com&title=emailThis&url='+encodeURIComponent(location.href);})();

Right click on any bookmark and then copy-paste the above link. Make sure that you have changed the URL and email address to your own. Now click on this bookmarklet while you are on an important web page that you need to send to your inbox. Enjoy!

Labels: , , ,


March 26, 2020

 

Packetbeat to elastic server hosted by AWS

These are the 5 steps to follow if you want to push the packet or metric beats to AWS elastic instance.

Make sure that your elastic instance has white-listed the IP address of the server where you are installing packetbeat.

1) Install
2) Configure
3) start the service

1) Installation of beats depends upon processor type + OS + if you are connecting to AWS!

arm + ubuntu + connect to aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-arm64.deb
dpkg -i packetbeat-oss-7.11.0-arm64.deb

arm + ubuntu + connect to non-aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-7.11.0-arm64.deb
dpkg -i packetbeat-7.11.0-arm64.deb

arm + redhat + connect to aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-aarch64.rpm
rpm -ivh packetbeat-7.11.0-aarch64.rpm

arm + redhat + connect to non-AWS
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-aarch64.rpm
rpm -ivh packetbeat-7.11.0-aarch64.rpm

Intel + ubuntu + connect to aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-amd64.deb
dpkg -i packetbeat-oss-7.11.0-amd64.deb

Intel + ubuntu + connect to non-aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-7.11.0-amd64.deb
dpkg -i packetbeat-7.11.0-amd64.deb

Intel + redhat + connect to aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-x86_64.rpm
rpm -ivh packetbeat-oss-7.11.0-x86_64.rpm

Intel + redhat + connect to non-aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-7.11.0-x86_64.rpm
rpm -ivh packetbeat-7.11.0-x86_64.rpm


2) Open config file and change 2 settings as shown below:
vi /etc/packetbeat/packetbeat.yml
hosts: ["search-audit-ay52ddaf7q7zudixfmluji4osmx.us-east-1.es.amazonaws.com:443"]
protocol: "https"

If you are using a server from local network, mention the address in hosts list. The protocol option can be disabled if you are not using secure http connection.

Optionally,  enable "send_response" parameter to capture the query and it's output as shown below:

packetbeat.protocols.mysql:
  ports: [3306]
  send_response: true
  max_row_length: 4000

max row length option will allow you to capture lengthy queries and their large output.

disable xpack if connecting to AWS or using Open Source License

# Set to true to enable the monitoring reporter.
#monitoring.enabled: false
setup.ilm.enabled: false
setup.pack.security.enabled: false
setup.xpack.graph.enabled: false
setup.xpack.watcher.enabled: false
setup.xpack.monitoring.enabled: false
setup.xpack.reporting.enabled: false

3) Start packetbeat service
cd /usr/bin/
./packetbeat  -e -c packetbeat.yml

Labels: ,


 

snapshot and restore elastic data to S3

Here are steps to backup and restore elastic data from AWS elastic instance.

1) Create IAM role
2) Use the boto script to create snapshot repo
3) Use Kibana to take the actual snapshot

All the steps are explained here...

https://forums.aws.amazon.com/message.jspa?messageID=930345#930345

Here is step by step guidance for the same.

1) Create IAM role:
Use the following cloudformation template to create the role and also note the ARN of the role to be used in boto script.

https://github.com/shantanuo/cloudformation/blob/master/updated/esbck.yml

2) Run this boto script:
Change the Access and Secret key. Also change the Elastic endpoint. Add the ARN created in the first step.

from boto.connection import AWSAuthConnection

class ESConnection(AWSAuthConnection):
    def __init__(self, region, **kwargs):
        super(ESConnection, self).__init__(**kwargs)
        self._set_auth_region_name(region)
        self._set_auth_service_name("es")

    def _required_auth_capability(self):
        return ["hmac-v4"]

client = ESConnection(
    region="us-east-1",
    host="search-training-foz7enh73fbg6lof23z7kbtn3y.us-east-1.es.amazonaws.com",
    aws_access_key_id="xxx",
    aws_secret_access_key="xxx",
    is_secure=False,)

headers = {"Content-Type": "application/json"}
resp = client.make_request(
    method="PUT",
    headers=headers,
    path="/_snapshot/esbck-essnapshotbucket-c9e6d7fy1cbt",
    data='{"type": "s3","settings": { "bucket": "esbck-essnapshotbucket-c9e6d7fy1cbt","region": "us-east-1", "role_arn": "arn:aws:iam::1234567890:role/esbck-EsSnapshotRole-GJGMPH4DBMM3"}}')

resp.read()

3) Take the backup from kibana

PUT /_snapshot/esbck-essnapshotbucket-c9e6d7fy1cbt/snapshot_1

GET /_cat/indices

DELETE /cwl-2020.03.26

POST /_snapshot/esbck-essnapshotbucket-c9e6d7fy1cbt/snapshot_1/_restore

GET /_snapshot/

GET /_cat/snapshots/esbck-essnapshotbucket-c9e6d7fy1cbt

Labels: ,


March 12, 2020

 

Analyze S3 storage usage using inventory

1) Enable inventory for the bucket
2) Create a table using athena to read the inventory data
3) Run a select query

1) Enable inventory using boto as shown in this stack thread:

https://stackoverflow.com/questions/60615911/use-boto-to-enable-inventory

2) Replace $InventoryBucket, $InventoryPrefix, $Bucket, and $InventoryName with the configuration.

CREATE EXTERNAL TABLE inventory(
  bucket string,
  key string,
  size bigint,
  last_modified_date timestamp,
  e_tag string,
  storage_class string,
  is_multipart_uploaded boolean,
  replication_status string,
  encryption_status string
  )
  PARTITIONED BY (dt string)
  ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
  OUTPUTFORMAT  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
  LOCATION 's3://$InventoryBucket/$InventoryPrefix/$Bucket/$InventoryName/hive';

The location URL will look something like this...

  LOCATION 's3://testme16/invent/testme16/myinventory/hive';

Repair table to read partitions:

MSCK REPAIR TABLE inventory;

3) Replace $Yesterday with yesterday’s timestamp (e.g., 2020-03-03-00-00):

SELECT prefix, SUM(size)/1000/1000/1000 AS total_size FROM (
  SELECT regexp_extract(i.key, '([^\/]*\/).*', 1) AS prefix, i.size
  FROM inventory AS i WHERE i.dt = '$Yesterday'
) GROUP BY prefix ORDER BY total_size DESC;

https://cloudonaut.io/how-to-analyze-and-reduce-s3-storage-usage/

Labels: ,


Archives

June 2001   July 2001   January 2003   May 2003   September 2003   October 2003   December 2003   January 2004   February 2004   March 2004   April 2004   May 2004   June 2004   July 2004   August 2004   September 2004   October 2004   November 2004   December 2004   January 2005   February 2005   March 2005   April 2005   May 2005   June 2005   July 2005   August 2005   September 2005   October 2005   November 2005   December 2005   January 2006   February 2006   March 2006   April 2006   May 2006   June 2006   July 2006   August 2006   September 2006   October 2006   November 2006   December 2006   January 2007   February 2007   March 2007   April 2007   June 2007   July 2007   August 2007   September 2007   October 2007   November 2007   December 2007   January 2008   February 2008   March 2008   April 2008   July 2008   August 2008   September 2008   October 2008   November 2008   December 2008   January 2009   February 2009   March 2009   April 2009   May 2009   June 2009   July 2009   August 2009   September 2009   October 2009   November 2009   December 2009   January 2010   February 2010   March 2010   April 2010   May 2010   June 2010   July 2010   August 2010   September 2010   October 2010   November 2010   December 2010   January 2011   February 2011   March 2011   April 2011   May 2011   June 2011   July 2011   August 2011   September 2011   October 2011   November 2011   December 2011   January 2012   February 2012   March 2012   April 2012   May 2012   June 2012   July 2012   August 2012   October 2012   November 2012   December 2012   January 2013   February 2013   March 2013   April 2013   May 2013   June 2013   July 2013   September 2013   October 2013   January 2014   March 2014   April 2014   May 2014   July 2014   August 2014   September 2014   October 2014   November 2014   December 2014   January 2015   February 2015   March 2015   April 2015   May 2015   June 2015   July 2015   August 2015   September 2015   January 2016   February 2016   March 2016   April 2016   May 2016   June 2016   July 2016   August 2016   September 2016   October 2016   November 2016   December 2016   January 2017   February 2017   April 2017   May 2017   June 2017   July 2017   August 2017   September 2017   October 2017   November 2017   December 2017   February 2018   March 2018   April 2018   May 2018   June 2018   July 2018   August 2018   September 2018   October 2018   November 2018   December 2018   January 2019   February 2019   March 2019   April 2019   May 2019   July 2019   August 2019   September 2019   October 2019   November 2019   December 2019   January 2020   February 2020   March 2020   April 2020   May 2020   July 2020   August 2020   September 2020   October 2020   December 2020   January 2021   April 2021   May 2021   July 2021   September 2021   March 2022   October 2022   November 2022   March 2023   April 2023   July 2023   September 2023   October 2023   November 2023   April 2024   May 2024   June 2024   August 2024   September 2024   October 2024   November 2024   December 2024   January 2025   February 2025   April 2025   June 2025   July 2025   August 2025  

This page is powered by Blogger. Isn't yours?