Shantanu's Blog

Database Consultant

May 05, 2024

Pandas as command prompt

You can use pandas at command prompt like this...

curl -s https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip | \
gunzip | \
python3 -c 'import sys, pandas as pd
pd.read_csv(sys.stdin).melt("Date").to_csv(sys.stdout, index=False)'

curl -s https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip | \
gunzip | \
python3 -c 'import sys, pandas as pd
pd.read_csv(sys.stdin).iloc[:, :-1].melt("Date")\
.to_csv(sys.stdout, index=False)'

Labels: pandas, shell script

posted by shantanu # Sunday, May 05, 2024 0 comments

July 21, 2023

langchain for pandas

langchain is a module to query pandas dataframe using Natural Language. It uses chatGPT to build pandas commands!

!pip install langchain
import os
os.environ["OPENAI_API_KEY"] = "XXXX"

from langchain.agents import create_pandas_dataframe_agent
from langchain.llms import OpenAI
import pandas as pd

pd_agent = create_pandas_dataframe_agent(OpenAI(temperature=0), df, verbose=True)

pd_agent.run("Find the total sales for each product line in the year 2003")

_____

Something similar...

# https://github.com/gventuri/pandas-ai

!pip install pandasai
from pandasai import SmartDataframe, SmartDatalake
from pandasai.llm import OpenAI
llm = OpenAI(api_token="YOUR TOKEN")

sdf = SmartDataframe(df, config={"llm": llm})
sdf.chat("Return the top 5 countries by GDP")
sdf.chat("Plot a chart of the gdp by country")

print(sdf.last_code_generated)

If you have more than one dataframe, then use SmartDatalake method and supply a list of dataframes. For e.g.

sdf = SmartDatalake([df, df2, df3], config={"llm": llm})

Labels: machine_learning, openai, pandas, usability

posted by shantanu # Friday, July 21, 2023 0 comments

March 16, 2022

Interactive pandas dataframe

iTables is an important utility for pandas dataframe. It will make the df or series interactive.

https://github.com/mwouts/itables

Install the package with:

pip install itables

Activate the interactive mode:

from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)

or use itables.show to show just one Series or DataFrame as an interactive table.

_____

1) At the moment itables does not have an offline mode. While the table data is embedded in the notebook, the jquery and datatables.net are loaded from a CDN.

2) When the data in a table is larger than maxBytes, which is equal to 64KB by default, itables will display only a subset of the table. To show the table in full, modify the value of maxBytes either locally:

show(df, maxBytes=0)

or globally:

import itables.options as opt

opt.maxBytes = 2 ** 20

Labels: pandas, python, usability

posted by shantanu # Wednesday, March 16, 2022 0 comments

July 07, 2021

pandas case study 34

moving average for the variable length window

How do I calculate the moving average where the set of previous rows is not fixed? For e.g. this SQL query will calculate the average for the totalprice column for the last 1 month orders.

SELECT avg(totalprice) OVER (
PARTITION BY custkey
ORDER BY orderdate
RANGE BETWEEN interval '1' month PRECEDING AND CURRENT ROW)
FROM orders

I have this dataframe:

from io import StringIO
import pandas as pd

myst="""cust_1,2020-10-10,100
cust_2,2020-10-10,15
cust_1,2020-10-15,200
cust_1,2020-10-16,240
cust_2,2020-12-20,25
cust_1,2020-12-25,140
cust_2,2021-01-01,5
"""

u_cols=['custkey', 'orderdate', 'totalprice']

myf = StringIO(myst)
import pandas as pd
orders = pd.read_csv(StringIO(myst), sep=',', names = u_cols)
orders['orderdate'] = pd.to_datetime(orders['orderdate'])
df=df.sort_values(list(df.columns))

Answer:

orders['my_average'] = (orders.groupby('custkey')
.apply(lambda d: d.rolling('30D', on='orderdate')['totalprice'].mean())
.reset_index(level=0, drop=True)
.astype(int)
)

https://stackoverflow.com/questions/68268531/window-function-for-moving-average

Labels: pandas

posted by shantanu # Wednesday, July 07, 2021 0 comments

May 18, 2021

Pandas case study 33

Let's assume we have 2 dataframes of english words.
The words may or may not be the same in both tables and the first dataframe has second column called "count". How do I merge these 2 dataframes and still get to know if a word is from first or second dataframe?

from io import StringIO
u_cols = ['word','count']
audit_trail = StringIO('''
test 1
testing 24
again 52
begin 6
''')

oscar = pd.read_csv(audit_trail, sep=" ", names = u_cols)
_____

from io import StringIO
u_cols = ['word' ]
audit_trail = StringIO('''
newer
age
computing
begin
''')

ptg = pd.read_csv(audit_trail, sep=" ", names = u_cols)

Use full outer join from merge method of pandas.
Do not forget to turn indicators on!

ndf=pd.merge(oscar,ptg,on="word", how="outer", indicator=True)

Labels: pandas

posted by shantanu # Tuesday, May 18, 2021 0 comments

April 29, 2021

Manage your csv using clever csv

Command to analyze the csv file and let us know about delimiter/ escaping.

$ clevercsv detect ./imdb.csv
Detected: SimpleDialect(',', '', '\\')

We can import the csv in pandas dataframe without using read_csv method!
$ clevercsv explore -p imdb.csv
>>> df

The code to create pandas dataframe using clever csv
$ clevercsv code -p ./imdb.csv
import clevercsv
df = clevercsv.read_dataframe("./imdb.csv", delimiter=",", quotechar="", escapechar="\\")

If you are using Jupyter Notebook, use this code...
import clevercsv
rows = clevercsv.read_table('./imdb.csv')
df = clevercsv.read_dataframe('./imdb.csv')

Labels: pandas, usability

posted by shantanu # Thursday, April 29, 2021 0 comments

December 13, 2020

get elastic data into pandas dataframe

This 2 lines of code will conncet to Elastic instance hosted in Amazon cloud and download the data from a given index. It will save the data as pandas data frame and also flatten the JSON data to look like an excel spreadsheet.

import eland as ed

df = ed.DataFrame("https://xxx.us-east-1.es.amazonaws.com:443", es_index_pattern="cwl-2020.12.13")

Labels: elastic, pandas

posted by shantanu # Sunday, December 13, 2020 0 comments

September 18, 2020

Remove junk from pandas dataframe

Non-latin unicode characters in pandas dataframe are a big problem. Several hours are lost cleaning the data when some obscure characters are found in the dataframe imported from the csv or excel file. Here is an easy solution...

from unicodedata import normalize
def clean_normalize_whitespace(x):
if isinstance(x, str):
return normalize('NFKC', x).strip()
else:
return x

df_GDP = df_GDP.applymap(clean_normalize_whitespace)
#clean column headings as well
df_GDP.columns = df_GDP.columns.to_series().apply(clean_normalize_whitespace)

I have used NFKC parameter that stands for Normal Form Kompatibility Composition. The other one is NFKD (Decompostion) along with NFD as well as NFC.

https://unicode.org/reports/tr15/#Norm_Forms

Labels: pandas, unicode

posted by shantanu # Friday, September 18, 2020 0 comments

July 13, 2020

Understanding your data

Pandas dataframe stores all the data into a single table that makes it difficult to understand the relationships between columns. For e.g. I will like to know how Area Abbreviation is related to Item code in the following data.

# Download the csv file from kaggle:
https://www.kaggle.com/dorbicycle/world-foodfeed-production

import pandas as pd
food_df = pd.read_csv('FAO.csv' , encoding='latin1')
food_df = food_df.drop(columns=food_df.columns[10:])

I will now import auto normalize class from featuretools. This will detect the internal relationships between columns and show us a nice graph.

from featuretools.autonormalize import autonormalize as an
entityset = an.auto_entityset(food_df)
entityset.plot()

Entities:
Element Code_Item Code_Area Code [Rows: 21477, Columns: 4]
Element Code [Rows: 2, Columns: 2]
Item Code [Rows: 117, Columns: 2]
Area Code [Rows: 174, Columns: 5]
Area Abbreviation [Rows: 169, Columns: 2]
Relationships:
Element Code_Item Code_Area Code.Area Code -> Area Code.Area Code
Element Code_Item Code_Area Code.Item Code -> Item Code.Item Code
Element Code_Item Code_Area Code.Element Code -> Element Code.Element Code
Area Code.Area Abbreviation -> Area Abbreviation.Area Abbreviation

Do not forget to check featuretools module as well. This will add new columns to your dataframe those can be useful in building machine learning module.

import featuretools as ft
fm, features = ft.dfs(entityset=entityset, target_entity='Element Code_Item Code_Area Code')
print (fm)

https://innovation.alteryx.com/automatic-dataset-normalization-for-feature-engineering-in-python/

Labels: machine_learning, pandas, usability

posted by shantanu # Monday, July 13, 2020 0 comments

May 01, 2020

Pandas case study 32

Handling Outliers

Outliers can be removed or adjusted using statistical methods of IQR, Z-Score and Data Smoothing.

1) For calculating IQR (Inter Quartile Range) of a dataset, first calculate it’s 1st Quartile(Q1) and 3rd Quartile(Q3) i.e. 25th and 75 percentile of the data and then subtract Q1 from Q3

import pandas as pd
data = [-2,8,13,19,34,49,50,53,59,64,87,89,1456]
df = pd.DataFrame(data)
df.columns = ['values']
ndf=df.describe().T
ndf['75%'] - ndf['25%']
# returns 45

For finding out the Outlier using IQR we have to define a multiplier which is 1.5 ideally that will decide how far below Q1 and above Q3 will be considered as an Outlier.

higher_limit = ndf['75%'] + 1.5 * 45
lower_limit = ndf['25%'] - 1.5 * 45
df[(df['values'] > higher_limit[0]) | (df['values'] < lower_limit[0])]

2) Z-Score tells how far a point is from the mean of dataset in terms of standard deviation. An absolute value of z score which is above 3 is considered as an outlier.

from scipy import stats
df['z_score']=stats.zscore(df['values'])
df[df['z_score'] > 3]

# returns values z_score
12 1456 3.454979

3) Data smoothing is a process to adjust the spikes and peaks. If your current value if 13 and previous value is 8 and smoothing level is 0.6 then the smoothed value is 11 given by
13*0.6 + (1-0.6)*8

Pandas smoothing function (EWM) can be used to calculate the exponential weighted Moving Average at different alpha levels.

df['ewm_alpha_1']=df['values'].ewm(alpha=0.1).mean()
df['ewm_alpha_3']=df['values'].ewm(alpha=0.3).mean()
df['ewm_alpha_6']=df['values'].ewm(alpha=0.6).mean()
df

https://kanoki.org/2020/04/23/how-to-remove-outliers-in-python/

Labels: machine_learning, pandas

posted by shantanu # Friday, May 01, 2020 0 comments

April 16, 2020

Download and unzip any file using python

Usually I download a file and extract using 2 linux commands like this...

! wget https://github.com/d6t/d6tstack/raw/master/test-data.zip
! unzip -o test-data.zip

But it can also be done using python code as shown below!

import urllib.request
import zipfile

cfg_fname_sample = "test-data.zip"
urllib.request.urlretrieve(
"https://github.com/d6t/d6tstack/raw/master/" + cfg_fname_sample, cfg_fname_sample
)

zip_ref = zipfile.ZipFile(cfg_fname_sample, "r")
zip_ref.extractall(".")
zip_ref.close()

Labels: pandas, python, usability

posted by shantanu # Thursday, April 16, 2020 0 comments

April 10, 2020

Numpy Basics

Some basic numpy methods everyone should be aware of.

import numpy as np

mylist = [[100, 2, 3], [4, 5786, 6]]
a = np.array(mylist)
a

np.ravel(a)

np.append(a, [2])
np.append(a, [10, 11, 12])
np.append(a, [""])
b = np.array([[400], [800]])
np.append(a, b, axis=1)
np.append(a, [[50, 60, 70]], axis=0)

np.insert(a, 2, [1, 2, 34])

a[:1]
a[0][2]
a.size
a.shape

np.where(a == 3)

np.sort(a, axis=1)
np.sort(a, axis=0)

a.tolist()

np.delete(a, 1, axis=0)
np.delete(a, 1, axis=1)

Labels: pandas, python

posted by shantanu # Friday, April 10, 2020 0 comments

March 28, 2020

Pandas case study 31

How to create a json document from pandas dataframe

If my data looks like this, how do I create a document per user?

Name Sem Subject Grade
0 John Sem1 Mathematics A
1 Sara Sem1 Biology B
2 John Sem2 Biology A+
3 Sara Sem2 Mathematics B++

In this case, John and Sara are the two users who appear for 2 semisters. I need a nested json document for each user.
John's data will look like this...

{
"John": {
"Sem1": {
"Subject": "Mathematics",
"Grade": "A"
},
"Sem2": {
"Subject": "Biology",
"Grade": "A+"
}
}
}

Here is how to create the dataframe:

df = pd.DataFrame(
{
"Name": ["John", "Sara", "John", "Sara"],
"Sem": ["Sem1", "Sem1", "Sem2", "Sem2"],
"Subject": ["Mathematics", "Biology", "Biology", "Mathematics"],
"Grade": ["A", "B", "A+", "B++"],
}
)

This is obviously a groupby problem. But we also need to export it to dictionary. Dictionary comprehension will make sure to include all users in a single document.

ppdict = {
n: grp.loc[n].to_dict("index")
for n, grp in df.set_index(["Name", "Sem"]).groupby(level="Name")
}

In order to display the data correctly, use json module like this...

import json
print(json.dumps(ppdict, indent=2))

The output will look like this...

{
"John": {
"Sem1": {
"Subject": "Mathematics",
"Grade": "A"
},
"Sem2": {
"Subject": "Biology",
"Grade": "A+"
}
},
"Sara": {
"Sem1": {
"Subject": "Biology",
"Grade": "B"
},
"Sem2": {
"Subject": "Mathematics",
"Grade": "B++"
}
}
}

https://kanoki.org/2020/03/24/convert-pandas-dataframe-to-dictionary/

Labels: pandas

posted by shantanu # Saturday, March 28, 2020 0 comments

March 14, 2020

Pandas case study 30

Effective visualization in pandas in just 7 lines of code.

url = "https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD&bom=true&format=true&delimiter=%3B"

from urllib.request import urlretrieve
urlretrieve(url, "data.csv")

import pandas as pd
df = pd.read_csv("data.csv", delimiter=";", index_col="Date", parse_dates=True)

%matplotlib inline
df.resample("w").sum().plot()

# https://www.youtube.com/watch?v=_ZEWDGpM-vM&list=WL&index=98&t=23s

Labels: pandas

posted by shantanu # Saturday, March 14, 2020 0 comments

March 13, 2020

Visual pandas - bamboolib

bamboolib - a GUI for pandas dataframes. Stop googling pandas commands

1) Get your free 14 days trial key.

https://bamboolib.8080labs.com/

2) Install required python package:

pip install bamboolib

jupyter nbextension enable --py qgrid --sys-prefix
jupyter nbextension enable --py widgetsnbextension --sys-prefix
jupyter nbextension install --py bamboolib --sys-prefix
jupyter nbextension enable --py bamboolib --sys-prefix

3) Restart docker container:

4) Start exploring visual pandas!

import bamboolib as bam
import pandas as pd
df = pd.read_csv(bam.titanic_csv)
df

_____

import modules automatically when required!

pip install pyforest
conda install nodejs

python -m pyforest install_extensions

Restart docker container.

Labels: pandas, usability

posted by shantanu # Friday, March 13, 2020 0 comments

February 29, 2020

Pandas case study 29

Is there a way to identify leading and trailing NAs in a pandas.DataFrame? Currently I do the following but it seems not straightforward:

df = pd.DataFrame(dict(a=[0.1, 0.2, 0.2],
b=[None, 0.1, None],
c=[0.1, None, 0.1])
lead_na = (df.isnull() == False).cumsum() == 0
trail_na = (df.iloc[::-1].isnull() == False).cumsum().iloc[::-1] == 0
trail_lead_nas = top_na | trail_na

Any ideas how this could be expressed more efficiently?

Answer:

df.ffill().isna() | df.bfill().isna()

https://stackoverflow.com/questions/59820159/identify-leading-and-trailing-nas-in-pandas-dataframe

Labels: pandas

posted by shantanu # Saturday, February 29, 2020 0 comments

Pandas case study 28

I am attempting to generate a dataframe (or series) based on another dataframe, selecting a different column from the first frame dependent on the row using another series. In the below simplified example, I want the frame1 values from 'a' for the first three rows, and 'b for the final two (the picked_values series).

frame1=pd.DataFrame(np.random.randn(10).reshape(5,2),index=range(5),columns=['a','b'])
picked_values=pd.Series(['a','a','a','b','b'])
Frame1

a b
0 0.283519 1.462209
1 -0.352342 1.254098
2 0.731701 0.236017
3 0.022217 -1.469342
4 0.386000 -0.706614
Trying to get to the series:

0 0.283519
1 -0.352342
2 0.731701
3 -1.469342
4 -0.706614

I was hoping values[picked_values] would work, but this ends up with five columns.

Answer:

pd.Series(frame1.lookup(picked_values.index,picked_values))

https://stackoverflow.com/questions/59898266/select-columns-in-a-dataframe-conditional-on-row

Labels: pandas

posted by shantanu # Saturday, February 29, 2020 0 comments

Pandas case study 27

I have a dataframe that looks like below.

dataframe1 =
In AA BB CC
0 10 1 0
1 11 2 3
2 10 6 0
3 9 1 0
4 10 3 1
5 1 2 0

now I want to create a dataframe that gives me the count of modes for each column, for column AA the count is 3 for mode 10, for columns CC the count is 4 for mode 0, but for BB there are two modes 1 and 2, so for BB I want the sum of counts for the modes. so for BB the count is 2+2=4, for mode 1 and 2.

Therefore the final dataframe that I want looks like below.

Columns Counts
AA 3
BB 4
CC 4

How to do it?

Answer:

You can compare columns with modes and count matches by sum:

df = pd.DataFrame({'Columns': df.columns,
'Val':[df[x].isin(df[x].mode()).sum() for x in df]})
print (df)
Columns Val
0 AA 3
1 BB 4
2 CC 4

https://stackoverflow.com/questions/59874756/counting-mode-occurrences-for-all-columns-in-a-dataframe

Labels: pandas

posted by shantanu # Saturday, February 29, 2020 0 comments

Pandas case study 26

I have a list of dictionaries, and I would like to obtain those that have the same value in a key:

my_list_of_dicts = [{
'id': 3,
'name': 'John'
},{
'id': 5,
'name': 'Peter'
},{
'id': 2,
'name': 'Peter'
},{
'id': 6,
'name': 'Mariah'
},{
'id': 7,
'name': 'John'
},{
'id': 1,
'name': 'Louis'
}
]

Answer:

df = pd.DataFrame(my_list_of_dicts)
df[df.name.isin(df[df.name.duplicated()]['name'])].to_json(orient='records')

https://stackoverflow.com/questions/59822973/keep-duplicates-by-key-in-a-list-of-dictionaries/60465827#60465827

Labels: pandas

posted by shantanu # Saturday, February 29, 2020 0 comments

Pandas case study 25

I have this dataframe that I need to re-format for report purpose.

df = pd.DataFrame(data = {'RecordID' : [1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5],
'DisplayLabel' : ['Source','Test','Value 1','Value 2','Value3','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2'],
'Value' : ['Web','Logic','S','I','Complete','Person','Voice','>20','P','Mail','OCR','A','I','Dictation','Understandable','S','I','Web','Logic','R','S']})

I am trying to "unmelt" though not exactly the source and test columns into new dataframe.

Answer 1: mask, pivot and join

mask = df['DisplayLabel'].str.contains('Value')
df2 = df[~mask].pivot(index='RecordID', columns='DisplayLabel', values='Value')

dfpiv = (
df[mask].rename(columns={'DisplayLabel':'Result'})
.set_index('RecordID')
.join(df2)
.reset_index()
)

Answer 2: set_index, unstack, then melt

df.set_index(['RecordID', 'DisplayLabel']).Value.unstack().reset_index() \
.melt(['RecordID', 'Source', 'Test'], var_name='Result', value_name='Value') \
.sort_values('RecordID').dropna(subset=['Value'])

https://stackoverflow.com/questions/59847074/unmelt-only-part-of-a-column-from-pandas-dataframe

Labels: pandas

posted by shantanu # Saturday, February 29, 2020 0 comments

Shantanu's Blog

May 05, 2024

Pandas as command prompt

July 21, 2023

langchain for pandas

March 16, 2022

Interactive pandas dataframe

July 07, 2021

pandas case study 34

May 18, 2021

Pandas case study 33

April 29, 2021

Manage your csv using clever csv

December 13, 2020

get elastic data into pandas dataframe

September 18, 2020

Remove junk from pandas dataframe

July 13, 2020

Understanding your data

May 01, 2020

Pandas case study 32

April 16, 2020

Download and unzip any file using python

April 10, 2020

Numpy Basics

March 28, 2020

Pandas case study 31

March 14, 2020

Pandas case study 30

March 13, 2020

Visual pandas - bamboolib

February 29, 2020

Pandas case study 29

Pandas case study 28

Pandas case study 27

Pandas case study 26

Pandas case study 25

Archives