Shantanu's Blog

Database Consultant

March 28, 2020

Pandas case study 31

How to create a json document from pandas dataframe

If my data looks like this, how do I create a document per user?

Name Sem Subject Grade
0 John Sem1 Mathematics A
1 Sara Sem1 Biology B
2 John Sem2 Biology A+
3 Sara Sem2 Mathematics B++

In this case, John and Sara are the two users who appear for 2 semisters. I need a nested json document for each user.
John's data will look like this...

{
"John": {
"Sem1": {
"Subject": "Mathematics",
"Grade": "A"
},
"Sem2": {
"Subject": "Biology",
"Grade": "A+"
}
}
}

Here is how to create the dataframe:

df = pd.DataFrame(
{
"Name": ["John", "Sara", "John", "Sara"],
"Sem": ["Sem1", "Sem1", "Sem2", "Sem2"],
"Subject": ["Mathematics", "Biology", "Biology", "Mathematics"],
"Grade": ["A", "B", "A+", "B++"],
}
)

This is obviously a groupby problem. But we also need to export it to dictionary. Dictionary comprehension will make sure to include all users in a single document.

ppdict = {
n: grp.loc[n].to_dict("index")
for n, grp in df.set_index(["Name", "Sem"]).groupby(level="Name")
}

In order to display the data correctly, use json module like this...

import json
print(json.dumps(ppdict, indent=2))

The output will look like this...

{
"John": {
"Sem1": {
"Subject": "Mathematics",
"Grade": "A"
},
"Sem2": {
"Subject": "Biology",
"Grade": "A+"
}
},
"Sara": {
"Sem1": {
"Subject": "Biology",
"Grade": "B"
},
"Sem2": {
"Subject": "Mathematics",
"Grade": "B++"
}
}
}

https://kanoki.org/2020/03/24/convert-pandas-dataframe-to-dictionary/

Labels: pandas

posted by shantanu # Saturday, March 28, 2020 0 comments

March 27, 2020

Redshift log of user queries to S3

Here is an interesting blog post about analyzing redshift queries.

https://thedataguy.in/reconstruct-redshift-stl-querytext-using-aws-athena/

The author has suggested to use tables like STL_QUERYTEXT that saves only 2 to 5 days of data.

If you need to save all the queries and keep them for years, follow these steps:

a) Create a new parameter group and name it something like "with_logs".
b) Set "enable_user_activity_logging" to "true" in that parameter group.
c) Use the newly created parameter group while creating redshift cluster.

Once the logs are generated, you can download them and study the queries executed by the users.

1) Download log files for any given day:
aws s3 sync s3://logredshift/mycompanylogs/AWSLogs/1234567890/redshift/us-east-1/2020/03/27/ .

2) Extract:
gunzip *

3) Remove windows line breaks:
dos2unix *

4) Remove linux line breaks:
cat *useractivitylog* | tr '\n' ' ' | sed "s/$'[0-9]\{4\}$/\r\n\1/g" > mylog.txt

5) Select query text:
cat mylog.txt | awk -F 'LOG:' '{print $2}' | sort -u > to_study.txt

6) Study the queries:
cat to_study.txt | sed '0~1 a\\' | more

This includes the system generated queries as well. Therefore this log may be difficult to analyze.
_____

1) Download and install latest version of pgbadger utility from:
https://github.com/darold/pgbadger/releases

2) create a new directory
mkdir /tmp/todel/
cd /tmp/todel/

3) Download the logs for a month. For e.g. March 2020
aws s3 sync s3://alogredshift/AWSLogs/1234567890/redshift/us-east-1/2020/03/ .

4) Analyze
pgbadger --format redshift `find /tmp/todel/ -name "*tylog*"` --dbname vadb --outfile /tmp/myq1.txt"

Or use docker:

docker run -i --rm -v $(pwd):/workdir -v /tmp/:/tmp/ shantanuo/pgbadger --format redshift find /tmp/todel/ -name "*tylog*" --dbname vdb --exclude-query 'FROM pg_' --outfile /tmp/myq123xx2.txt

Or use --dump-all-queries to get all queries in non-normalized form.

Labels: redshift

posted by shantanu # Friday, March 27, 2020 0 comments

docker compose is really awesome

If you are already using compose then you already know how important it is for docker users. If you want to learn more about it, here are few templates to start with.

$ git clone https://github.com/docker/awesome-compose.git

$ cd awesome-compose
$ cd nginx-flask-mysql

$ docker-compose up -d
$ curl localhost:80

Blog post #1

Blog post #2

Blog post #3

Blog post #4

https://www.docker.com/blog/awesome-compose-app-samples-for-project-dev-kickoff/

Labels: docker

posted by shantanu # Friday, March 27, 2020 0 comments

March 26, 2020

Packetbeat to elastic server hosted by AWS

These are the 5 steps to follow if you want to push the packet or metric beats to AWS elastic instance.

Make sure that your elastic instance has white-listed the IP address of the server where you are installing packetbeat.

1) Install

2) Configure

3) start the service

1) Installation of beats depends upon processor type + OS + if you are connecting to AWS!

arm + ubuntu + connect to aws

wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-arm64.deb

dpkg -i packetbeat-oss-7.11.0-arm64.deb

arm + ubuntu + connect to non-aws

wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-7.11.0-arm64.deb

dpkg -i packetbeat-7.11.0-arm64.deb

arm + redhat + connect to aws

wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-aarch64.rpm

rpm -ivh packetbeat-7.11.0-aarch64.rpm

arm + redhat + connect to non-AWS

wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-aarch64.rpm

rpm -ivh packetbeat-7.11.0-aarch64.rpm

Intel + ubuntu + connect to aws

wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-amd64.deb

dpkg -i packetbeat-oss-7.11.0-amd64.deb

Intel + ubuntu + connect to non-aws

wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-7.11.0-amd64.deb

dpkg -i packetbeat-7.11.0-amd64.deb

Intel + redhat + connect to aws

wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-x86_64.rpm

rpm -ivh packetbeat-oss-7.11.0-x86_64.rpm

Intel + redhat + connect to non-aws

wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-7.11.0-x86_64.rpm

rpm -ivh packetbeat-7.11.0-x86_64.rpm

2) Open config file and change 2 settings as shown below:
vi /etc/packetbeat/packetbeat.yml
hosts: ["search-audit-ay52ddaf7q7zudixfmluji4osmx.us-east-1.es.amazonaws.com:443"]
protocol: "https"

If you are using a server from local network, mention the address in hosts list. The protocol option can be disabled if you are not using secure http connection.

Optionally, enable "send_response" parameter to capture the query and it's output as shown below:

packetbeat.protocols.mysql:
ports: [3306]
send_response: true
max_row_length: 4000

max row length option will allow you to capture lengthy queries and their large output.

disable xpack if connecting to AWS or using Open Source License

# Set to true to enable the monitoring reporter.

#monitoring.enabled: false

setup.ilm.enabled: false

setup.pack.security.enabled: false

setup.xpack.graph.enabled: false

setup.xpack.watcher.enabled: false

setup.xpack.monitoring.enabled: false

setup.xpack.reporting.enabled: false

3) Start packetbeat service

cd /usr/bin/

./packetbeat -e -c packetbeat.yml

Labels: aws, elastic

posted by shantanu # Thursday, March 26, 2020 0 comments

snapshot and restore elastic data to S3

Here are steps to backup and restore elastic data from AWS elastic instance.

1) Create IAM role
2) Use the boto script to create snapshot repo
3) Use Kibana to take the actual snapshot

All the steps are explained here...

https://forums.aws.amazon.com/message.jspa?messageID=930345#930345

Here is step by step guidance for the same.

1) Create IAM role:
Use the following cloudformation template to create the role and also note the ARN of the role to be used in boto script.

https://github.com/shantanuo/cloudformation/blob/master/updated/esbck.yml

2) Run this boto script:
Change the Access and Secret key. Also change the Elastic endpoint. Add the ARN created in the first step.

from boto.connection import AWSAuthConnection

class ESConnection(AWSAuthConnection):
def __init__(self, region, **kwargs):
super(ESConnection, self).__init__(**kwargs)
self._set_auth_region_name(region)
self._set_auth_service_name("es")

def _required_auth_capability(self):
return ["hmac-v4"]

client = ESConnection(
region="us-east-1",
host="search-training-foz7enh73fbg6lof23z7kbtn3y.us-east-1.es.amazonaws.com",
aws_access_key_id="xxx",
aws_secret_access_key="xxx",
is_secure=False,)

headers = {"Content-Type": "application/json"}
resp = client.make_request(
method="PUT",
headers=headers,
path="/_snapshot/esbck-essnapshotbucket-c9e6d7fy1cbt",
data='{"type": "s3","settings": { "bucket": "esbck-essnapshotbucket-c9e6d7fy1cbt","region": "us-east-1", "role_arn": "arn:aws:iam::1234567890:role/esbck-EsSnapshotRole-GJGMPH4DBMM3"}}')

resp.read()

3) Take the backup from kibana

PUT /_snapshot/esbck-essnapshotbucket-c9e6d7fy1cbt/snapshot_1

GET /_cat/indices

DELETE /cwl-2020.03.26

POST /_snapshot/esbck-essnapshotbucket-c9e6d7fy1cbt/snapshot_1/_restore

GET /_snapshot/

GET /_cat/snapshots/esbck-essnapshotbucket-c9e6d7fy1cbt

Labels: aws, elastic

posted by shantanu # Thursday, March 26, 2020 0 comments

March 14, 2020

Pandas case study 30

Effective visualization in pandas in just 7 lines of code.

url = "https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD&bom=true&format=true&delimiter=%3B"

from urllib.request import urlretrieve
urlretrieve(url, "data.csv")

import pandas as pd
df = pd.read_csv("data.csv", delimiter=";", index_col="Date", parse_dates=True)

%matplotlib inline
df.resample("w").sum().plot()

# https://www.youtube.com/watch?v=_ZEWDGpM-vM&list=WL&index=98&t=23s

Labels: pandas

posted by shantanu # Saturday, March 14, 2020 0 comments

March 13, 2020

Visual pandas - bamboolib

bamboolib - a GUI for pandas dataframes. Stop googling pandas commands

1) Get your free 14 days trial key.

https://bamboolib.8080labs.com/

2) Install required python package:

pip install bamboolib

jupyter nbextension enable --py qgrid --sys-prefix
jupyter nbextension enable --py widgetsnbextension --sys-prefix
jupyter nbextension install --py bamboolib --sys-prefix
jupyter nbextension enable --py bamboolib --sys-prefix

3) Restart docker container:

4) Start exploring visual pandas!

import bamboolib as bam
import pandas as pd
df = pd.read_csv(bam.titanic_csv)
df

_____

import modules automatically when required!

pip install pyforest
conda install nodejs

python -m pyforest install_extensions

Restart docker container.

Labels: pandas, usability

posted by shantanu # Friday, March 13, 2020 0 comments

March 12, 2020

Analyze S3 storage usage using inventory

1) Enable inventory for the bucket
2) Create a table using athena to read the inventory data
3) Run a select query

1) Enable inventory using boto as shown in this stack thread:

https://stackoverflow.com/questions/60615911/use-boto-to-enable-inventory

2) Replace $InventoryBucket, $InventoryPrefix, $Bucket, and $InventoryName with the configuration.

CREATE EXTERNAL TABLE inventory(
bucket string,
key string,
size bigint,
last_modified_date timestamp,
e_tag string,
storage_class string,
is_multipart_uploaded boolean,
replication_status string,
encryption_status string
)
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://$InventoryBucket/$InventoryPrefix/$Bucket/$InventoryName/hive';

The location URL will look something like this...

LOCATION 's3://testme16/invent/testme16/myinventory/hive';

Repair table to read partitions:

MSCK REPAIR TABLE inventory;

3) Replace $Yesterday with yesterday’s timestamp (e.g., 2020-03-03-00-00):

SELECT prefix, SUM(size)/1000/1000/1000 AS total_size FROM (
SELECT regexp_extract(i.key, '([^\/]*\/).*', 1) AS prefix, i.size
FROM inventory AS i WHERE i.dt = '$Yesterday'
) GROUP BY prefix ORDER BY total_size DESC;

https://cloudonaut.io/how-to-analyze-and-reduce-s3-storage-usage/

Labels: athena, aws

posted by shantanu # Thursday, March 12, 2020 0 comments

March 07, 2020

Using wildcards instead of match_phrase

Instead of using "match_phrase" clause, I can use "wildcard" clause to use regular expression to replace the sub-domains like config and cloudtrail with star *

This will save typing additional clauses and keep the query short.

Old query:

{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"sourceIPAddress": "1.2.3.4"
}
},
{
"match_phrase": {
"sourceIPAddress": "5.6.7.8"
}
},
{
"match_phrase": {
"sourceIPAddress": "config.amazonaws.com"
}
},
{
"match_phrase": {
"sourceIPAddress": "cloudtrail.amazonaws.com"
}
}
],
"minimum_should_match": 1
}
}
}
_____

New Improved query:

{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"sourceIPAddress": "1.2.3.4"
}
},
{
"match_phrase": {
"sourceIPAddress": "5.6.7.8"
}
},
{
"wildcard": {
"sourceIPAddress.keyword": {
"value": "*\\.amazonaws\\.com*"
}
}
}
],
"minimum_should_match": 1
}
}
}

Labels: elastic

posted by shantanu # Saturday, March 07, 2020 0 comments

March 04, 2020

Redshift UDF to remove salutations

There are times when I need to remove the salutations like mr or mrs. from the name column in redshift. I can write a user defined function that will do the needful.

# select f_extract_name2('mr shantanu oak');

f_extract_name2
-----------------
SHANTANU OAK
(1 row)

The function is written in python and source code will look like this...

CREATE OR REPLACE FUNCTION f_extract_name2 (myname varchar(1000) ) RETURNS varchar(1000) IMMUTABLE as $$
try:
remove_list=['MR', 'MR.', 'MRS.', 'MRS', 'MISS', 'MASTER', 'MISS.', 'MASTER.' ]
new_list=list()
for i in myname.upper().split():
if i not in remove_list:
new_list.append(i)

if len(new_list) == 2:
return (" ".join(new_list))
except:
pass

$$ LANGUAGE plpythonu

Labels: python, redshift

posted by shantanu # Wednesday, March 04, 2020 0 comments