How to create a json document from pandas dataframe
If my data looks like this, how do I create a document per user?
Name Sem Subject Grade
0 John Sem1 Mathematics A
1 Sara Sem1 Biology B
2 John Sem2 Biology A+
3 Sara Sem2 Mathematics B++
In this case, John and Sara are the two users who appear for 2 semisters. I need a nested json document for each user.
John's data will look like this...
{
"John": {
"Sem1": {
"Subject": "Mathematics",
"Grade": "A"
},
"Sem2": {
"Subject": "Biology",
"Grade": "A+"
}
}
}
Here is how to create the dataframe:
df = pd.DataFrame(
{
"Name": ["John", "Sara", "John", "Sara"],
"Sem": ["Sem1", "Sem1", "Sem2", "Sem2"],
"Subject": ["Mathematics", "Biology", "Biology", "Mathematics"],
"Grade": ["A", "B", "A+", "B++"],
}
)
This is obviously a groupby problem. But we also need to export it to dictionary. Dictionary comprehension will make sure to include all users in a single document.
ppdict = {
n: grp.loc[n].to_dict("index")
for n, grp in df.set_index(["Name", "Sem"]).groupby(level="Name")
}
In order to display the data correctly, use json module like this...
import json
print(json.dumps(ppdict, indent=2))
The output will look like this...
{
"John": {
"Sem1": {
"Subject": "Mathematics",
"Grade": "A"
},
"Sem2": {
"Subject": "Biology",
"Grade": "A+"
}
},
"Sara": {
"Sem1": {
"Subject": "Biology",
"Grade": "B"
},
"Sem2": {
"Subject": "Mathematics",
"Grade": "B++"
}
}
}
https://kanoki.org/2020/03/24/convert-pandas-dataframe-to-dictionary/
Labels: pandas
Here is an interesting blog post about analyzing redshift queries.
https://thedataguy.in/reconstruct-redshift-stl-querytext-using-aws-athena/
The author has suggested to use tables like STL_QUERYTEXT that saves only 2 to 5 days of data.
If you need to save all the queries and keep them for years, follow these steps:
a) Create a new parameter group and name it something like "with_logs".
b) Set "enable_user_activity_logging" to "true" in that parameter group.
c) Use the newly created parameter group while creating redshift cluster.
Once the logs are generated, you can download them and study the queries executed by the users.
1) Download log files for any given day:
aws s3 sync s3://logredshift/mycompanylogs/AWSLogs/1234567890/redshift/us-east-1/2020/03/27/ .
2) Extract:
gunzip *
3) Remove windows line breaks:
dos2unix *
4) Remove linux line breaks:
cat *useractivitylog* | tr '\n' ' ' | sed "s/\('[0-9]\{4\}\)/\r\n\1/g" > mylog.txt
5) Select query text:
cat mylog.txt | awk -F 'LOG:' '{print $2}' | sort -u > to_study.txt
6) Study the queries:
cat to_study.txt | sed '0~1 a\\' | more
This includes the system generated queries as well. Therefore this log may be difficult to analyze.
_____
1) Download and install latest version of pgbadger utility from:
https://github.com/darold/pgbadger/releases
2) create a new directory
mkdir /tmp/todel/
cd /tmp/todel/
3) Download the logs for a month. For e.g. March 2020
aws s3 sync s3://alogredshift/AWSLogs/1234567890/redshift/us-east-1/2020/03/ .
4) Analyze
pgbadger --format redshift `find /tmp/todel/ -name "*tylog*"` --dbname vadb --outfile /tmp/myq1.txt"
Or use docker:
docker run -i --rm -v $(pwd):/workdir -v /tmp/:/tmp/ shantanuo/pgbadger --format redshift find /tmp/todel/ -name "*tylog*" --dbname vdb --exclude-query 'FROM pg_' --outfile /tmp/myq123xx2.txt
Or use --dump-all-queries to get all queries in non-normalized form.
Labels: redshift
If you are already using compose then you already know how important it is for docker users. If you want to learn more about it, here are few templates to start with.
$ git clone https://github.com/docker/awesome-compose.git
$ cd awesome-compose
$ cd nginx-flask-mysql
$ docker-compose up -d
$ curl localhost:80
Blog post #1
Blog post #2
Blog post #3
Blog post #4
https://www.docker.com/blog/awesome-compose-app-samples-for-project-dev-kickoff/
Labels: docker
These are the 5 steps to follow if you want to push the packet or metric beats to AWS elastic instance.
Make sure that your elastic instance has white-listed the IP address of the server where you are installing packetbeat.
1) Install
2) Configure
3) start the service
1) Installation of beats depends upon processor type + OS + if you are connecting to AWS!
arm + ubuntu + connect to aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-arm64.deb
dpkg -i packetbeat-oss-7.11.0-arm64.deb
arm + ubuntu + connect to non-aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-7.11.0-arm64.deb
dpkg -i packetbeat-7.11.0-arm64.deb
arm + redhat + connect to aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-aarch64.rpm
rpm -ivh packetbeat-7.11.0-aarch64.rpm
arm + redhat + connect to non-AWS
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-aarch64.rpm
rpm -ivh packetbeat-7.11.0-aarch64.rpm
Intel + ubuntu + connect to aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-amd64.deb
dpkg -i packetbeat-oss-7.11.0-amd64.deb
Intel + ubuntu + connect to non-aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-7.11.0-amd64.deb
dpkg -i packetbeat-7.11.0-amd64.deb
Intel + redhat + connect to aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-oss-7.11.0-x86_64.rpm
rpm -ivh packetbeat-oss-7.11.0-x86_64.rpm
Intel + redhat + connect to non-aws
wget https://artifacts.elastic.co/downloads/beats/packetbeat/packetbeat-7.11.0-x86_64.rpm
rpm -ivh packetbeat-7.11.0-x86_64.rpm
2) Open config file and change 2 settings as shown below:
vi /etc/packetbeat/packetbeat.yml
hosts: ["search-audit-ay52ddaf7q7zudixfmluji4osmx.us-east-1.es.amazonaws.com:443"]
protocol: "https"
If you are using a server from local network, mention the address in hosts list. The protocol option can be disabled if you are not using secure http connection.
Optionally, enable "send_response" parameter to capture the query and it's output as shown below:
packetbeat.protocols.mysql:
ports: [3306]
send_response: true
max_row_length: 4000
max row length option will allow you to capture lengthy queries and their large output.
disable xpack if connecting to AWS or using Open Source License
# Set to true to enable the monitoring reporter.
#monitoring.enabled: false
setup.ilm.enabled: false
setup.pack.security.enabled: false
setup.xpack.graph.enabled: false
setup.xpack.watcher.enabled: false
setup.xpack.monitoring.enabled: false
setup.xpack.reporting.enabled: false
3) Start packetbeat service
cd /usr/bin/
./packetbeat -e -c packetbeat.yml
Labels: aws, elastic
Here are steps to backup and restore elastic data from AWS elastic instance.
1) Create IAM role
2) Use the boto script to create snapshot repo
3) Use Kibana to take the actual snapshot
All the steps are explained here...
https://forums.aws.amazon.com/message.jspa?messageID=930345#930345
Here is step by step guidance for the same.
1) Create IAM role:
Use the following cloudformation template to create the role and also note the ARN of the role to be used in boto script.
https://github.com/shantanuo/cloudformation/blob/master/updated/esbck.yml
2) Run this boto script:
Change the Access and Secret key. Also change the Elastic endpoint. Add the ARN created in the first step.
from boto.connection import AWSAuthConnection
class ESConnection(AWSAuthConnection):
def __init__(self, region, **kwargs):
super(ESConnection, self).__init__(**kwargs)
self._set_auth_region_name(region)
self._set_auth_service_name("es")
def _required_auth_capability(self):
return ["hmac-v4"]
client = ESConnection(
region="us-east-1",
host="search-training-foz7enh73fbg6lof23z7kbtn3y.us-east-1.es.amazonaws.com",
aws_access_key_id="xxx",
aws_secret_access_key="xxx",
is_secure=False,)
headers = {"Content-Type": "application/json"}
resp = client.make_request(
method="PUT",
headers=headers,
path="/_snapshot/esbck-essnapshotbucket-c9e6d7fy1cbt",
data='{"type": "s3","settings": { "bucket": "esbck-essnapshotbucket-c9e6d7fy1cbt","region": "us-east-1", "role_arn": "arn:aws:iam::1234567890:role/esbck-EsSnapshotRole-GJGMPH4DBMM3"}}')
resp.read()
3) Take the backup from kibana
PUT /_snapshot/esbck-essnapshotbucket-c9e6d7fy1cbt/snapshot_1
GET /_cat/indices
DELETE /cwl-2020.03.26
POST /_snapshot/esbck-essnapshotbucket-c9e6d7fy1cbt/snapshot_1/_restore
GET /_snapshot/
GET /_cat/snapshots/esbck-essnapshotbucket-c9e6d7fy1cbt
Labels: aws, elastic
Effective visualization in pandas in just 7 lines of code.
url = "https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD&bom=true&format=true&delimiter=%3B"
from urllib.request import urlretrieve
urlretrieve(url, "data.csv")
import pandas as pd
df = pd.read_csv("data.csv", delimiter=";", index_col="Date", parse_dates=True)
%matplotlib inline
df.resample("w").sum().plot()
# https://www.youtube.com/watch?v=_ZEWDGpM-vM&list=WL&index=98&t=23s
Labels: pandas
bamboolib - a GUI for pandas dataframes. Stop googling pandas commands
1) Get your free 14 days trial key.
https://bamboolib.8080labs.com/
2) Install required python package:
pip install bamboolib
jupyter nbextension enable --py qgrid --sys-prefix
jupyter nbextension enable --py widgetsnbextension --sys-prefix
jupyter nbextension install --py bamboolib --sys-prefix
jupyter nbextension enable --py bamboolib --sys-prefix
3) Restart docker container:
4) Start exploring visual pandas!
import bamboolib as bam
import pandas as pd
df = pd.read_csv(bam.titanic_csv)
df
_____
import modules automatically when required!
pip install pyforest
conda install nodejs
python -m pyforest install_extensions
Restart docker container.
Labels: pandas, usability
1) Enable inventory for the bucket
2) Create a table using athena to read the inventory data
3) Run a select query
1) Enable inventory using boto as shown in this stack thread:
https://stackoverflow.com/questions/60615911/use-boto-to-enable-inventory
2) Replace $InventoryBucket, $InventoryPrefix, $Bucket, and $InventoryName with the configuration.
CREATE EXTERNAL TABLE inventory(
bucket string,
key string,
size bigint,
last_modified_date timestamp,
e_tag string,
storage_class string,
is_multipart_uploaded boolean,
replication_status string,
encryption_status string
)
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://$InventoryBucket/$InventoryPrefix/$Bucket/$InventoryName/hive';
The location URL will look something like this...
LOCATION 's3://testme16/invent/testme16/myinventory/hive';
Repair table to read partitions:
MSCK REPAIR TABLE inventory;
3) Replace $Yesterday with yesterday’s timestamp (e.g., 2020-03-03-00-00):
SELECT prefix, SUM(size)/1000/1000/1000 AS total_size FROM (
SELECT regexp_extract(i.key, '([^\/]*\/).*', 1) AS prefix, i.size
FROM inventory AS i WHERE i.dt = '$Yesterday'
) GROUP BY prefix ORDER BY total_size DESC;
https://cloudonaut.io/how-to-analyze-and-reduce-s3-storage-usage/
Labels: athena, aws
Instead of using "match_phrase" clause, I can use "wildcard" clause to use regular expression to replace the sub-domains like config and cloudtrail with star *
This will save typing additional clauses and keep the query short.
Old query:
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"sourceIPAddress": "1.2.3.4"
}
},
{
"match_phrase": {
"sourceIPAddress": "5.6.7.8"
}
},
{
"match_phrase": {
"sourceIPAddress": "config.amazonaws.com"
}
},
{
"match_phrase": {
"sourceIPAddress": "cloudtrail.amazonaws.com"
}
}
],
"minimum_should_match": 1
}
}
}
_____
New Improved query:
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"sourceIPAddress": "1.2.3.4"
}
},
{
"match_phrase": {
"sourceIPAddress": "5.6.7.8"
}
},
{
"wildcard": {
"sourceIPAddress.keyword": {
"value": "*\\.amazonaws\\.com*"
}
}
}
],
"minimum_should_match": 1
}
}
}
Labels: elastic
There are times when I need to remove the salutations like mr or mrs. from the name column in redshift. I can write a user defined function that will do the needful.
# select f_extract_name2('mr shantanu oak');
f_extract_name2
-----------------
SHANTANU OAK
(1 row)
The function is written in python and source code will look like this...
CREATE OR REPLACE FUNCTION f_extract_name2 (myname varchar(1000) ) RETURNS varchar(1000) IMMUTABLE as $$
try:
remove_list=['MR', 'MR.', 'MRS.', 'MRS', 'MISS', 'MASTER', 'MISS.', 'MASTER.' ]
new_list=list()
for i in myname.upper().split():
if i not in remove_list:
new_list.append(i)
if len(new_list) == 2:
return (" ".join(new_list))
except:
pass
$$ LANGUAGE plpythonu
Labels: python, redshift