Shantanu's Blog

Database Consultant

May 05, 2024

 

Pandas as command prompt

You can use pandas at command prompt like this...

curl -s https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip | \
gunzip | \
python3 -c 'import sys, pandas as pd
pd.read_csv(sys.stdin).melt("Date").to_csv(sys.stdout, index=False)'

curl -s https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip | \
gunzip | \
python3 -c 'import sys, pandas as pd
pd.read_csv(sys.stdin).iloc[:, :-1].melt("Date")\
.to_csv(sys.stdout, index=False)'

Labels: ,


July 21, 2023

 

langchain for pandas

langchain is a module to query pandas dataframe using Natural Language. It uses chatGPT to build pandas commands!

!pip install langchain
import os
os.environ["OPENAI_API_KEY"] = "XXXX"

from langchain.agents import create_pandas_dataframe_agent
from langchain.llms import OpenAI
import pandas as pd

pd_agent = create_pandas_dataframe_agent(OpenAI(temperature=0), df, verbose=True)

pd_agent.run("Find the total sales for each product line in the year 2003")

_____

Something similar...


# https://github.com/gventuri/pandas-ai

!pip install pandasai
from pandasai import SmartDataframe, SmartDatalake
from pandasai.llm import OpenAI
llm = OpenAI(api_token="YOUR TOKEN")

sdf = SmartDataframe(df, config={"llm": llm})
sdf.chat("Return the top 5 countries by GDP")
sdf.chat("Plot a chart of the gdp by country")

print(sdf.last_code_generated)

If you have more than one dataframe, then use SmartDatalake method and supply a list of dataframes. For e.g.

sdf = SmartDatalake([df, df2, df3], config={"llm": llm})

Labels: , , ,


March 16, 2022

 

Interactive pandas dataframe

iTables is an important utility for pandas dataframe. It will make the df or series interactive.

https://github.com/mwouts/itables

Install the package with:

pip install itables

Activate the interactive mode:

from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)

or use itables.show to show just one Series or DataFrame as an interactive table.

_____

1) At the moment itables does not have an offline mode. While the table data is embedded in the notebook, the jquery and datatables.net are loaded from a CDN.

2) When the data in a table is larger than maxBytes, which is equal to 64KB by default, itables will display only a subset of the table. To show the table in full, modify the value of maxBytes either locally:

show(df, maxBytes=0)

or globally:

import itables.options as opt

opt.maxBytes = 2 ** 20

Labels: , ,


July 07, 2021

 

pandas case study 34

moving average for the variable length window


How do I calculate the moving average where the set of previous rows is not fixed? For e.g. this SQL query will calculate the average for the totalprice column for the last 1 month orders.

SELECT avg(totalprice) OVER (
    PARTITION BY custkey
    ORDER BY orderdate
    RANGE BETWEEN interval '1' month PRECEDING AND CURRENT ROW)
FROM orders

I have this dataframe:

from io  import StringIO
import pandas as pd

myst="""cust_1,2020-10-10,100
cust_2,2020-10-10,15
cust_1,2020-10-15,200
cust_1,2020-10-16,240
cust_2,2020-12-20,25
cust_1,2020-12-25,140
cust_2,2021-01-01,5
"""

u_cols=['custkey', 'orderdate', 'totalprice']

myf = StringIO(myst)
import pandas as pd
orders = pd.read_csv(StringIO(myst), sep=',', names = u_cols)
orders['orderdate'] = pd.to_datetime(orders['orderdate'])
df=df.sort_values(list(df.columns))


Answer:

orders['my_average'] = (orders.groupby('custkey')
                      .apply(lambda d: d.rolling('30D', on='orderdate')['totalprice'].mean())
                      .reset_index(level=0, drop=True)
                      .astype(int)
                   )

https://stackoverflow.com/questions/68268531/window-function-for-moving-average

Labels:


May 18, 2021

 

Pandas case study 33

Let's assume we have 2 dataframes of english words.
The words may or may not be the same in both tables and the first dataframe has second column called "count". How do I merge these 2 dataframes and still get to know if a word is from first or second dataframe?

from io import StringIO
u_cols = ['word','count']
audit_trail = StringIO('''
test 1
testing 24
again 52
begin 6
''')

oscar = pd.read_csv(audit_trail, sep=" ", names = u_cols)
_____

from io import StringIO
u_cols = ['word' ]
audit_trail = StringIO('''
newer
age
computing
begin
''')

ptg = pd.read_csv(audit_trail, sep=" ", names = u_cols)

Use full outer join from merge method of pandas.
Do not forget to turn indicators on!

ndf=pd.merge(oscar,ptg,on="word", how="outer", indicator=True)

Labels:


April 29, 2021

 

Manage your csv using clever csv

Command to analyze the csv file and let us know about delimiter/ escaping.

$ clevercsv detect ./imdb.csv
Detected: SimpleDialect(',', '', '\\')

We can import the csv in pandas dataframe without using read_csv method!
$ clevercsv explore -p imdb.csv
>>> df

The code to create pandas dataframe using clever csv
$ clevercsv code  -p  ./imdb.csv
import clevercsv
df = clevercsv.read_dataframe("./imdb.csv", delimiter=",", quotechar="", escapechar="\\")

If you are using Jupyter Notebook, use this code...
import clevercsv
rows = clevercsv.read_table('./imdb.csv')
df = clevercsv.read_dataframe('./imdb.csv')

Labels: ,


December 13, 2020

 

get elastic data into pandas dataframe

This 2 lines of code will conncet to Elastic instance hosted in Amazon cloud and download the data from a given index. It will save the data as pandas data frame and also flatten the JSON data to look like an excel spreadsheet.

import eland as ed

df = ed.DataFrame("https://xxx.us-east-1.es.amazonaws.com:443", es_index_pattern="cwl-2020.12.13")

Labels: ,


September 18, 2020

 

Remove junk from pandas dataframe

Non-latin unicode characters in pandas dataframe are a big problem. Several hours are lost cleaning the data when some obscure characters are found in the dataframe imported from the csv or excel file. Here is an easy solution...

from unicodedata import normalize
def clean_normalize_whitespace(x):
    if isinstance(x, str):
        return normalize('NFKC', x).strip()
    else:
        return x

df_GDP = df_GDP.applymap(clean_normalize_whitespace)
#clean column headings as well
df_GDP.columns = df_GDP.columns.to_series().apply(clean_normalize_whitespace)

I have used NFKC parameter that stands for Normal Form Kompatibility Composition. The other one is NFKD (Decompostion) along with NFD as well as NFC.

https://unicode.org/reports/tr15/#Norm_Forms

Labels: ,


July 13, 2020

 

Understanding your data

Pandas dataframe stores all the data into a single table that makes it difficult to understand the relationships between columns. For e.g. I will like to know how Area Abbreviation is related to Item code in the following data.

# Download the csv file from kaggle:
https://www.kaggle.com/dorbicycle/world-foodfeed-production

import pandas as pd
food_df = pd.read_csv('FAO.csv' , encoding='latin1')
food_df = food_df.drop(columns=food_df.columns[10:])

I will now import auto normalize class from featuretools. This will detect the internal relationships between columns and show us a nice graph.

from featuretools.autonormalize import autonormalize as an
entityset = an.auto_entityset(food_df)
entityset.plot()

  Entities:
    Element Code_Item Code_Area Code [Rows: 21477, Columns: 4]
    Element Code [Rows: 2, Columns: 2]
    Item Code [Rows: 117, Columns: 2]
    Area Code [Rows: 174, Columns: 5]
    Area Abbreviation [Rows: 169, Columns: 2]
  Relationships:
    Element Code_Item Code_Area Code.Area Code -> Area Code.Area Code
    Element Code_Item Code_Area Code.Item Code -> Item Code.Item Code
    Element Code_Item Code_Area Code.Element Code -> Element Code.Element Code
    Area Code.Area Abbreviation -> Area Abbreviation.Area Abbreviation

Do not forget to check featuretools module as well. This will add new columns to your dataframe those can be useful in building machine learning module.

import featuretools as ft
fm, features = ft.dfs(entityset=entityset, target_entity='Element Code_Item Code_Area Code')
print (fm)

https://innovation.alteryx.com/automatic-dataset-normalization-for-feature-engineering-in-python/

Labels: , ,


May 01, 2020

 

Pandas case study 32

Handling Outliers

Outliers can be removed or adjusted using statistical methods of IQR, Z-Score and Data Smoothing.

1) For calculating IQR (Inter Quartile Range) of a dataset, first calculate it’s 1st Quartile(Q1) and 3rd Quartile(Q3) i.e. 25th and 75 percentile of the data and then subtract Q1 from Q3

import pandas as pd
data = [-2,8,13,19,34,49,50,53,59,64,87,89,1456]
df = pd.DataFrame(data)
df.columns = ['values']
ndf=df.describe().T
ndf['75%'] - ndf['25%']
# returns 45

 For finding out the Outlier using IQR we have to define a multiplier which is 1.5 ideally that will decide how far below Q1 and above Q3 will be considered as an Outlier.

higher_limit = ndf['75%'] + 1.5 * 45
lower_limit = ndf['25%'] - 1.5 * 45
df[(df['values'] > higher_limit[0]) | (df['values'] < lower_limit[0])]

2) Z-Score tells how far a point is from the mean of dataset in terms of standard deviation. An absolute value of z score which is above 3 is considered as an outlier.

from scipy import stats
df['z_score']=stats.zscore(df['values'])
df[df['z_score'] > 3]

# returns values z_score
12 1456 3.454979

3) Data smoothing is a process to adjust the spikes and peaks. If your current value if 13 and previous value is 8 and smoothing level is 0.6 then the smoothed value is 11 given by
13*0.6 + (1-0.6)*8

Pandas smoothing function (EWM) can be used to calculate the exponential weighted Moving Average at different alpha levels.

df['ewm_alpha_1']=df['values'].ewm(alpha=0.1).mean()
df['ewm_alpha_3']=df['values'].ewm(alpha=0.3).mean()
df['ewm_alpha_6']=df['values'].ewm(alpha=0.6).mean()
df

https://kanoki.org/2020/04/23/how-to-remove-outliers-in-python/

Labels: ,


April 16, 2020

 

Download and unzip any file using python

Usually I download a file and extract using 2 linux commands like this...

! wget https://github.com/d6t/d6tstack/raw/master/test-data.zip
! unzip -o test-data.zip 

But it can also be done using python code as shown below!

import urllib.request
import zipfile

cfg_fname_sample = "test-data.zip"
urllib.request.urlretrieve(
    "https://github.com/d6t/d6tstack/raw/master/" + cfg_fname_sample, cfg_fname_sample
)

zip_ref = zipfile.ZipFile(cfg_fname_sample, "r")
zip_ref.extractall(".")
zip_ref.close()

Labels: , ,


April 10, 2020

 

Numpy Basics

Some basic numpy methods everyone should be aware of.

import numpy as np

mylist = [[100, 2, 3], [4, 5786, 6]]
a = np.array(mylist)
a

np.ravel(a)

np.append(a, [2])
np.append(a, [10, 11, 12])
np.append(a, [""])
b = np.array([[400], [800]])
np.append(a, b, axis=1)
np.append(a, [[50, 60, 70]], axis=0)

np.insert(a, 2, [1, 2, 34])

a[:1]
a[0][2]
a.size
a.shape

np.where(a == 3)

np.sort(a, axis=1)
np.sort(a, axis=0)

a.tolist()

np.delete(a, 1, axis=0)
np.delete(a, 1, axis=1)

Labels: ,


March 28, 2020

 

Pandas case study 31

How to create a json document from pandas dataframe

If my data looks like this, how do I create a document per user?

Name Sem Subject Grade
0 John Sem1 Mathematics A
1 Sara Sem1 Biology B
2 John Sem2 Biology A+
3 Sara Sem2 Mathematics B++

In this case, John and Sara are the two users who appear for 2 semisters. I need a nested json document for each user.
John's data will look like this...

{
  "John": {
    "Sem1": {
      "Subject": "Mathematics",
      "Grade": "A"
    },
    "Sem2": {
      "Subject": "Biology",
      "Grade": "A+"
    }
  }
}

Here is how to create the dataframe:

df = pd.DataFrame(
    {
        "Name": ["John", "Sara", "John", "Sara"],
        "Sem": ["Sem1", "Sem1", "Sem2", "Sem2"],
        "Subject": ["Mathematics", "Biology", "Biology", "Mathematics"],
        "Grade": ["A", "B", "A+", "B++"],
    }
)

This is obviously a groupby problem. But we also need to export it to dictionary. Dictionary comprehension will make sure to include all users in a single document.

ppdict = {
    n: grp.loc[n].to_dict("index")
    for n, grp in df.set_index(["Name", "Sem"]).groupby(level="Name")
}

In order to display the data correctly, use json module like this...

import json
print(json.dumps(ppdict, indent=2))

The output will look like this...

{
  "John": {
    "Sem1": {
      "Subject": "Mathematics",
      "Grade": "A"
    },
    "Sem2": {
      "Subject": "Biology",
      "Grade": "A+"
    }
  },
  "Sara": {
    "Sem1": {
      "Subject": "Biology",
      "Grade": "B"
    },
    "Sem2": {
      "Subject": "Mathematics",
      "Grade": "B++"
    }
  }
}

https://kanoki.org/2020/03/24/convert-pandas-dataframe-to-dictionary/

Labels:


March 14, 2020

 

Pandas case study 30

Effective visualization in pandas in just 7 lines of code.

url = "https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD&bom=true&format=true&delimiter=%3B"

from urllib.request import urlretrieve
urlretrieve(url, "data.csv")

import pandas as pd
df = pd.read_csv("data.csv", delimiter=";", index_col="Date", parse_dates=True)

%matplotlib inline
df.resample("w").sum().plot()


# https://www.youtube.com/watch?v=_ZEWDGpM-vM&list=WL&index=98&t=23s


Labels:


March 13, 2020

 

Visual pandas - bamboolib

bamboolib - a GUI for pandas dataframes. Stop googling pandas commands

1) Get your free 14 days trial key.

https://bamboolib.8080labs.com/

2) Install required python package:

pip install bamboolib

jupyter nbextension enable --py qgrid --sys-prefix
jupyter nbextension enable --py widgetsnbextension --sys-prefix
jupyter nbextension install --py bamboolib --sys-prefix
jupyter nbextension enable --py bamboolib --sys-prefix

3) Restart docker container:

4) Start exploring visual pandas!

import bamboolib as bam
import pandas as pd
df = pd.read_csv(bam.titanic_csv)
df

_____

import modules automatically when required!

pip install pyforest
conda install nodejs

python -m pyforest install_extensions

Restart docker container.

Labels: ,


February 29, 2020

 

Pandas case study 29

Is there a way to identify leading and trailing NAs in a pandas.DataFrame?  Currently I do the following but it seems not straightforward:

df = pd.DataFrame(dict(a=[0.1, 0.2, 0.2],
                       b=[None, 0.1, None],
                       c=[0.1, None, 0.1])
lead_na = (df.isnull() == False).cumsum() == 0
trail_na = (df.iloc[::-1].isnull() == False).cumsum().iloc[::-1] == 0
trail_lead_nas = top_na | trail_na

Any ideas how this could be expressed more efficiently?

Answer:

df.ffill().isna() | df.bfill().isna()

https://stackoverflow.com/questions/59820159/identify-leading-and-trailing-nas-in-pandas-dataframe

Labels:


 

Pandas case study 28

I am attempting to generate a dataframe (or series) based on another dataframe, selecting a different column from the first frame dependent on the row using another series. In the below simplified example, I want the frame1 values from 'a' for the first three rows, and 'b for the final two (the picked_values series).

frame1=pd.DataFrame(np.random.randn(10).reshape(5,2),index=range(5),columns=['a','b'])
picked_values=pd.Series(['a','a','a','b','b'])
Frame1

    a           b
0   0.283519    1.462209
1   -0.352342   1.254098
2   0.731701    0.236017
3   0.022217    -1.469342
4   0.386000    -0.706614
Trying to get to the series:

0   0.283519
1   -0.352342
2   0.731701
3   -1.469342
4   -0.706614

I was hoping values[picked_values] would work, but this ends up with five columns.

Answer:

pd.Series(frame1.lookup(picked_values.index,picked_values))

https://stackoverflow.com/questions/59898266/select-columns-in-a-dataframe-conditional-on-row

Labels:


 

Pandas case study 27

I have a dataframe that looks like below.

dataframe1 =
In  AA   BB  CC
0   10   1   0
1   11   2   3
2   10   6   0
3   9    1   0
4   10   3   1
5   1    2   0

now I want to create a dataframe that gives me the count of modes for each column, for column AA the count is 3 for mode 10, for columns CC the count is 4 for mode 0, but for BB there are two modes 1 and 2, so for BB I want the sum of counts for the modes. so for BB the count is 2+2=4, for mode 1 and 2.

Therefore the final dataframe that I want looks like below.

Columns  Counts
AA        3
BB        4
CC        4

How to do it?


Answer:

You can compare columns with modes and count matches by sum:

df = pd.DataFrame({'Columns': df.columns,
                   'Val':[df[x].isin(df[x].mode()).sum() for x in df]})
print (df)
  Columns  Val
0      AA    3
1      BB    4
2      CC    4

https://stackoverflow.com/questions/59874756/counting-mode-occurrences-for-all-columns-in-a-dataframe

Labels:


 

Pandas case study 26

I have a list of dictionaries, and I would like to obtain those that have the same value in a key:

my_list_of_dicts = [{
    'id': 3,
    'name': 'John'
  },{
    'id': 5,
    'name': 'Peter'
  },{
    'id': 2,
    'name': 'Peter'
  },{
    'id': 6,
    'name': 'Mariah'
  },{
    'id': 7,
    'name': 'John'
  },{
    'id': 1,
    'name': 'Louis'
  }
]

Answer:

df = pd.DataFrame(my_list_of_dicts)
df[df.name.isin(df[df.name.duplicated()]['name'])].to_json(orient='records')

https://stackoverflow.com/questions/59822973/keep-duplicates-by-key-in-a-list-of-dictionaries/60465827#60465827

Labels:


 

Pandas case study 25

I have this dataframe that I need to re-format for report purpose.

df = pd.DataFrame(data = {'RecordID' : [1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5],
'DisplayLabel' : ['Source','Test','Value 1','Value 2','Value3','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2','Source','Test','Value 1','Value 2'],
'Value' : ['Web','Logic','S','I','Complete','Person','Voice','>20','P','Mail','OCR','A','I','Dictation','Understandable','S','I','Web','Logic','R','S']})

I am trying to "unmelt" though not exactly the source and test columns into new dataframe.

Answer 1: mask, pivot and join

mask = df['DisplayLabel'].str.contains('Value')
df2 = df[~mask].pivot(index='RecordID', columns='DisplayLabel', values='Value')

dfpiv = (
    df[mask].rename(columns={'DisplayLabel':'Result'})
            .set_index('RecordID')
            .join(df2)
            .reset_index()
)

Answer 2: set_index, unstack, then melt

df.set_index(['RecordID', 'DisplayLabel']).Value.unstack().reset_index() \
  .melt(['RecordID', 'Source', 'Test'], var_name='Result', value_name='Value') \
  .sort_values('RecordID').dropna(subset=['Value'])

https://stackoverflow.com/questions/59847074/unmelt-only-part-of-a-column-from-pandas-dataframe

Labels:


Archives

June 2001   July 2001   January 2003   May 2003   September 2003   October 2003   December 2003   January 2004   February 2004   March 2004   April 2004   May 2004   June 2004   July 2004   August 2004   September 2004   October 2004   November 2004   December 2004   January 2005   February 2005   March 2005   April 2005   May 2005   June 2005   July 2005   August 2005   September 2005   October 2005   November 2005   December 2005   January 2006   February 2006   March 2006   April 2006   May 2006   June 2006   July 2006   August 2006   September 2006   October 2006   November 2006   December 2006   January 2007   February 2007   March 2007   April 2007   June 2007   July 2007   August 2007   September 2007   October 2007   November 2007   December 2007   January 2008   February 2008   March 2008   April 2008   July 2008   August 2008   September 2008   October 2008   November 2008   December 2008   January 2009   February 2009   March 2009   April 2009   May 2009   June 2009   July 2009   August 2009   September 2009   October 2009   November 2009   December 2009   January 2010   February 2010   March 2010   April 2010   May 2010   June 2010   July 2010   August 2010   September 2010   October 2010   November 2010   December 2010   January 2011   February 2011   March 2011   April 2011   May 2011   June 2011   July 2011   August 2011   September 2011   October 2011   November 2011   December 2011   January 2012   February 2012   March 2012   April 2012   May 2012   June 2012   July 2012   August 2012   October 2012   November 2012   December 2012   January 2013   February 2013   March 2013   April 2013   May 2013   June 2013   July 2013   September 2013   October 2013   January 2014   March 2014   April 2014   May 2014   July 2014   August 2014   September 2014   October 2014   November 2014   December 2014   January 2015   February 2015   March 2015   April 2015   May 2015   June 2015   July 2015   August 2015   September 2015   January 2016   February 2016   March 2016   April 2016   May 2016   June 2016   July 2016   August 2016   September 2016   October 2016   November 2016   December 2016   January 2017   February 2017   April 2017   May 2017   June 2017   July 2017   August 2017   September 2017   October 2017   November 2017   December 2017   February 2018   March 2018   April 2018   May 2018   June 2018   July 2018   August 2018   September 2018   October 2018   November 2018   December 2018   January 2019   February 2019   March 2019   April 2019   May 2019   July 2019   August 2019   September 2019   October 2019   November 2019   December 2019   January 2020   February 2020   March 2020   April 2020   May 2020   July 2020   August 2020   September 2020   October 2020   December 2020   January 2021   April 2021   May 2021   July 2021   September 2021   March 2022   October 2022   November 2022   March 2023   April 2023   July 2023   September 2023   October 2023   November 2023   April 2024   May 2024   June 2024   August 2024   September 2024   October 2024   November 2024   December 2024   January 2025   February 2025   April 2025   June 2025   July 2025   August 2025  

This page is powered by Blogger. Isn't yours?