Shantanu's Blog

Database Consultant

normalize data using pandas

Here is now to convert a non-normalized excel worksheet to normalized data that can be easily imported into database like mysql.

import pandas as pd
import numpy as np

audit["type"] = audit["VENDOR"]
audit.loc[mask1, "type"] = np.nan
audit["type"].ffill(inplace=True)
_____

The date is in the first row (for e.g. 26/9/2016) and salesman name is in the next row (for e.g. RIZWAN JEE.)

This can be extracted using this code...

d1=mydate.columns[0]
datetime.datetime.strptime(d1,'%d/%m/%Y').strftime('%Y-%m-%d')

'2016-09-26'
_____

x=myname.iloc[0]
x.name[1].split(':')[1]

u' RIZWAN JEE.'

Labels: ,

pandas maths methods

import pandas as pd
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})

This will create a list something like this...
a b
1 3
2 4

sum method will take the total of each column.
df.sum()
a    3
b    7

But we need the sum of each row. so change the default axis value of 0 to 1.
df.sum(axis=1)
0    4
1    6

we can now simply divide each cell value with the total.
df.div(df.sum(axis=1), axis=0)
a b
0.250000 0.750000
0.333333 0.666667
_____

while using the division method, the default axis value is 1. If I do not mention the axis, I will get unexpected results.

df.div([4, 6])
a b
0.25 0.500000
0.50 0.666667

Changing axis to 0 from 1 will show the expected results.

df.div([4, 6], axis=0)
a b
0.250000 0.750000
0.333333 0.666667

Labels: ,

replace dictionary or lists with pandas dataframe

Let's assume we need to join these 2 lists and process the data later.

country=['china', 'usa', 'india']
medals=[34, 54, 0 ]

For e.g. we may need to know the country that has won the highest number of medals. For that we need to join these 2 lists and then sort on medals column. The first step is to join the lists using zip function.

zip(country, medals)

This will create a list of tuples. We can easily convert it to dictionary and then sort on values.

mydict=dict(zip(country, medals))
import operator
sorted_x = sorted(mydict.items(), key=operator.itemgetter(1))

Instead of creating a dictionary, we can import the 2 lists into a dataframe table that works like excel spreadsheet.

import pandas as pd
df=pd.DataFrame(zip(country, medals))
df.sort([1])

We can now sort on first column using sort method of the dataframe object.

Labels: ,

Pandas series for simple python problems

Let's assume we need to extract the data from a dictionary. The keys are stored in a list as follows:

mydict={'a':34, 'b':54, 'c':66}
mylist=['a', 'b', 'x']

The results expected are as follows:
a    34
b    54
x     N/A

We can do this using pure python...
newdict=dict()
for i in mylist:
if i in mydict:
newdict[i]= mydict[i]
else:
newdict[i]=0

Or use pandas:
import pandas as pd
myobj=pd.Series(mydict, index=mylist)

Labels: ,