# Shantanu's Blog

Database Consultant

## Rescaling data

Rescaling data is very important step in achieving better results with any machine learning algorithm. For e.g. if we have this list...

my_lst = [1.9, 1.5, 0.8]

After rescaling without using mean, we get all positive values those are good for image processing.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler(with_mean=False)
scaler.fit_transform(np.array(my_lst).reshape(-1, 1))
(4.2, 3.3, 1.7)

Another option is to re-scale using with_mean (i.e. default)

scaler = StandardScaler(with_mean=True)
scaler.fit_transform(np.array(my_lst).reshape(-1, 1))
(1.1, 0.22, -1.33)

After rescaling using mean the average of new values is 0 and standard deviation is 1

(1.1 + 0.22 - 1.33) / 3

sqrt((((1.09 - 0) ** 2) + ((0.21 - 0) ** 2) + ((-1.31 - 0) ** 2)) / 3)

_____

We can also scale the data between 0 to 1. The Max value will be 1 while minimum will be 0 and all other values will be between 0 to 1.

from sklearn.preprocessing import MinMaxScaler
mm_scaler= MinMaxScaler()
mm_scaler.fit_transform(np.array(my_lst).reshape(-1, 1))

[1, 0.63, 0]

The same values can also be calculated using functions like min, max and list comprehension.

X_m2 = [(X - min(my_lst)) / (max(my_lst) - min(my_lst)) for X in my_lst]

[y * (max(X_m2) - min(X_m2)) + min(X_m2) for y in X_m2]

_____

Here is low level working of standard deviation calculation.

from math import sqrt

my_mean = sum(my_lst) / len(my_lst)
my_std = sqrt(sum((x - my_mean) ** 2 for x in my_lst) / len(my_lst))

my_mean, my_std
## (1.4, 0.45)

[(i - my_mean) / my_std for i in my_lst]
## [1.09, 0.21, -1.31]

[(i - 0) / my_std for i in my_lst]
## [4.17, 3.29, 1.75 ]

(1.9 + 1.5 + 0.8) / 3
## 1.4

sqrt((((1.9 - 1.4) ** 2) + ((1.5 - 1.4) ** 2) + ((0.8 - 1.4) ** 2)) / 3)
## 0.45

((1.9 - 1.4) / 0.45), ((1.5 - 1.4) / 0.45), ((0.8 - 1.4) / 0.45)
## (1.1, 0.22, -1.33)

((1.9 - 0) / 0.45), ((1.5 - 0) / 0.45), ((0.8 - 0) / 0.45)
## (4.22, 3.33, 1.77)

Labels: ,

## Changing CSV to parquet

Changing the CSV file to parquet is much easier than you think.
Simply copy the table using "create table" syntax with a different external location!

CREATE TABLE ghcnblog.tblallyears_qa2
with (format='PARQUET', external_location='s3://todel162/ghcnblog/allyearsqa2/'
) AS
SELECT * FROM "ghcnblog"."tblallyears1"

Labels: ,

## Vectorize your data into spare matrix

Vectorizer is an important tool to process un-structured data. For e.g. natural language processing is not possible with count vectorizer.

cv=CountVectorizer()
my_cv=cv.fit_transform(result['Reviews'])
cv_df = pd.SparseDataFrame(my_cv, columns=cv.get_feature_names())

cvs = cv_df.iloc[1].dropna()
cvs[cvs > 0]

tfidf = TfidfVectorizer()
X=tfidf.fit_transform(result['Reviews'])
tf_ndf=pd.SparseDataFrame(X, columns=tfidf.get_feature_names())

tf_ndf.iloc[1].dropna()
_____

The default vectorizer will include 2 or more alphanumeric characters. It means single letters will be ignored. Change your parameters to something like this...

cv = CountVectorizer(binary=True, max_features=200,  token_pattern = r"(?u)\b\w+\b",
# vocabulary=my_vocabulary,  stop_words=my_stop_words)

Where the variables are...

my_vocabulary=['movie', 'movies', 'movi']

additions=['br', 's', 't', 'don']
from sklearn.feature_extraction import text

Labels: ,