Shantanu's Blog
Database Consultant
January 14, 2025
RAG made easy using LLama
# use virtual environment to install python and packages
uv init ai-app2
cd ai-app2
pip install llama-index
# download training data
mkdir data
cd data
wget https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
cd ..
# start python prompt
python
import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What the author do growing up?")
print(response)
Labels: machine_learning, python
November 17, 2024
Language prediction
FastText library by facebook has the language detection feature.
import fasttext
model = fasttext.load_model("/tmp/lid.176.ftz")
model.predict(" विकिपीडिया पर", k=2)
The above code returns Hindi "hi" correctly. Google also has it's own library called langdetect. The following code returns Marathi "mr" correctly.
from langdetect import detect
detect("आत्मा आणि")
The polyglot library has supported this and other language tools since a very long time.
https://github.com/saffsd/polyglot
Labels: nlp, python
August 24, 2024
Using playwright on ARM processor
You can use playwright on ARM processor using these steps:
1) Use docker to start a container:
docker run -it --rm --ipc=host mcr.microsoft.com/playwright:v1.46.1-jammy /bin/bash
2) Once inside the container, type these commands:
apt-get install python3-pip
pip install playwright
playwright install
3) Create or copy a test file:
vi app/app.py
from playwright.sync_api import sync_playwrightdef handler(event, context): with sync_playwright() as p: browser = p.chromium.launch(args=["--disable-gpu", "--single-process", "--headless=new"], headless=True) page = browser.new_page() page.goto("https://stackoverflow.com/questions/9780717/bash-pip-command-not-found") print(page.title()) browser.close()4) Run the file:
python3 app/app.py
If you get the title of the page, i.e. "python - bash: pip: command not found - Stack Overflow" as output then everything is working ok.
_____
Here is another example:
import asyncio
from playwright.async_api import async_playwright # 1.44.0
async def main():
term = "\"टंकलेखन\""
url = f"https://www.google.com/search?q={term}"
async with async_playwright() as pw:
browser = await pw.chromium.launch(args=["--disable-gpu", "--single-process", "--headless=new"], headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="domcontentloaded")
# Find the element with ID "result-stats" and get its text
result_stats = await page.locator('#result-stats').text_content()
if result_stats:
print("Result stats:", result_stats)
else:
print("Element 'result-stats' not found.")
await browser.close()
if __name__ == "__main__":
asyncio.run(main())
Labels: python
October 14, 2022
Finding the number of google search results
Here is the python code to return the number of results returned by google search query.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36','referer':'https://www.google.com'}
mylist = list()
def get_count(word):
target_url='https://www.google.com/search?q='+word+'&num=200'
resp = requests.get(target_url, headers=headers)
soup=BeautifulSoup(resp.text,'html.parser')
results = soup.find_all("div",{"class":"jGGQ5e"})
print (word, len(results))
for x in range(0,len(myresults)):
domain=urlparse(myresults[x].find("div",{"class":"yuRUbf"}).find("a").get("href")).netloc
mylist.append(domain)
return mylist
Now when I call the function, I get to know how many results were returned for a given query:
get_count('india')
This will show
india 98
['en.wikipedia.org',
'www.india.gov.in',
'www.state.gov', ... ... ]
There are 98 domains returned for a given search term.
Labels: python
March 16, 2022
Interactive pandas dataframe
iTables is an important utility for pandas dataframe. It will make the df or series interactive.
https://github.com/mwouts/itables
Install the package with:
pip install itables
Activate the interactive mode:
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)
or use itables.show to show just one Series or DataFrame as an interactive table.
_____
1) At the moment itables does not have an offline mode. While the table data is embedded in the notebook, the jquery and datatables.net are loaded from a CDN.
2) When the data in a table is larger than maxBytes, which is equal to 64KB by default, itables will display only a subset of the table. To show the table in full, modify the value of maxBytes either locally:
show(df, maxBytes=0)
or globally:
import itables.options as opt
opt.maxBytes = 2 ** 20
Labels: pandas, python, usability
April 18, 2021
fuzzy matching
Record linking and fuzzy matching are terms used to describe the process of joining two data sets together that do not have a common unique identifier. Examples include trying to join files based on people’s names or merging data that only have organization’s name and address.
everyone and his uncle seems to have a solution. But none of them work. Here is a good article on this subject. It has mentioned fuzzymatcher and recordlinkage modules.
https://pbpython.com/record-linking.html
_____
An easier solution is mentioned in the comments of the page..
#!pip install string-grouper
import pandas as pd
from string_grouper import match_strings
df = pd.read_csv("company_list.csv", encoding="ISO-8859-1", header=None)
df.columns = ["serial", "company_name"]
ndf = match_strings(df["company_name"])
ndf[ndf["left_index"] != ndf["right_index"]].to_csv("to_study.csv")
_____
This code will find the nearest matching string for each line of a given file.
from string_grouper import *
import pandas as pd
with open('/home/ubuntu/ne_list2.txt') as f:
mylist=f.readlines()
nelist=list()
for i in mylist:
nelist.append(i.replace('\n',''))
df = pd.DataFrame(nelist)
df.columns = ['bookname']
lastlist=list()
for i in df.bookname:
for y in df.bookname:
x =compute_pairwise_similarities(pd.Series([i]),pd.Series([y]))
if i!=y and x.item() > 0.4:
lastlist.append((i,y))
pd.DataFrame(lastlist).to_csv('/home/ubuntu/to_study.csv')
Labels: python
April 17, 2021
crawl indian language websites
CommonCrawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.
https://commoncrawl.org/
Here are 3 easy steps to download the data in any language, for e.g. Marathi or Gujarati
1) Clone the repo git clone https://github.com/qburst/common-crawl-malayalam.git
2) Change language code from mal to mar at the line -
AND content_languages LIKE 'mal%'"
The other codes are: hin (Hindi), mal (Malayalam), mar (Marathi), guj (Gujarati)
Run this shell script after replacing XXX with your AWS access and security key.
./extract_malayalam_warcs.sh XXX XXX 2021-10 s3://nlp-malayalam
3) Change unicode range from mal: range(3328, 3456) to your own desired language range, for e.g. devnagari: range(2304, 2432) Gujarati: range(2688, 2816)
Run this shell script after replacing XXX with your AWS access and security key.
./filter_malayalam.sh XXX XXX s3://nlp-malayalam/2021-10/warcs s3://nlp-malayalam/2021-10/output2
Filtered and unfiltered data will be available in output2 sub-folder of s3 bucket.
_____
Update:
The good French guys from Sorbonne University and inria have extracted sentences from common crawler for all languages and made them available at
https://oscar-corpus.com/
I have no words to thank them. The paper is available here...
http://corpora.ids-mannheim.de/CMLC7-final/CMLC-7_2019-Oritz_et_al.pdf
Without this data, it would have been very difficult to extract reliable data for languages like Marathi, Gujarati.
Labels: athena, aws, python, shell script
September 18, 2020
Speech to text using assembly AI
Run this python code to submit the mp3 file from S3. You will have to register first to get the authorization API key.
https://app.assemblyai.com/login/
import requests
headers = {
"authorization": "XXX",
"content-type": "application/json"
}
endpoint = "https://api.assemblyai.com/v2/transcript"
json = {
"audio_url": "https://s3-us-west-2.amazonaws.com/blog.assemblyai.com/audio/8-7-2018-post/7510.mp3"
}
response = requests.post(endpoint, json=json, headers=headers)
print(response.json())
You will get id and status like this...
'id': 'g9j4q46h9-5d04-4f96-8186-b4def1b1b65b', 'status': 'queued',
Use the id to query the results.
endpoint = "https://api.assemblyai.com/v2/transcript/g9j4q46h9-5d04-4f96-8186-b4def1b1b65b"
response = requests.get(endpoint, headers=headers)
print(response.json())
And you will get the text of audio file. It will look something like this...
'text': 'You know, Demons on TV like that. And and for people to expose themselves to being rejected on TV or you know, her M humiliated by fear factor or you know.'
Labels: machine_learning, python
April 17, 2020
High Level module for NLP tasks
GluonNLP provides Pre-trained models for common NLP tasks. It has carefully designed APIs that greatly reduce the implementation complexity.
import mxnet as mx
import gluonnlp as nlp
glove = nlp.embedding.create('glove', source='glove.6B.50d')
def cos_similarity(embedding, word1, word2):
vec1, vec2 = embedding[word1], embedding[word2]
return mx.nd.dot(vec1, vec2) / (vec1.norm() * vec2.norm())
cos_similarity(glove, 'baby', 'infant').asnumpy()[0]
_____
This will load wikipedia article words as a python list.
train = nlp.data.WikiText2(segment='train')
train[10000:10199]
Labels: machine_learning, nlp, python
April 16, 2020
Download and unzip any file using python
Usually I download a file and extract using 2 linux commands like this...
! wget https://github.com/d6t/d6tstack/raw/master/test-data.zip
! unzip -o test-data.zip
But it can also be done using python code as shown below!
import urllib.request
import zipfile
cfg_fname_sample = "test-data.zip"
urllib.request.urlretrieve(
"https://github.com/d6t/d6tstack/raw/master/" + cfg_fname_sample, cfg_fname_sample
)
zip_ref = zipfile.ZipFile(cfg_fname_sample, "r")
zip_ref.extractall(".")
zip_ref.close()
Labels: pandas, python, usability
April 10, 2020
Numpy Basics
Some basic numpy methods everyone should be aware of.
import numpy as np
mylist = [[100, 2, 3], [4, 5786, 6]]
a = np.array(mylist)
a
np.ravel(a)
np.append(a, [2])
np.append(a, [10, 11, 12])
np.append(a, [""])
b = np.array([[400], [800]])
np.append(a, b, axis=1)
np.append(a, [[50, 60, 70]], axis=0)
np.insert(a, 2, [1, 2, 34])
a[:1]
a[0][2]
a.size
a.shape
np.where(a == 3)
np.sort(a, axis=1)
np.sort(a, axis=0)
a.tolist()
np.delete(a, 1, axis=0)
np.delete(a, 1, axis=1)
Labels: pandas, python
March 04, 2020
Redshift UDF to remove salutations
There are times when I need to remove the salutations like mr or mrs. from the name column in redshift. I can write a user defined function that will do the needful.
# select f_extract_name2('mr shantanu oak');
f_extract_name2
-----------------
SHANTANU OAK
(1 row)
The function is written in python and source code will look like this...
CREATE OR REPLACE FUNCTION f_extract_name2 (myname varchar(1000) ) RETURNS varchar(1000) IMMUTABLE as $$
try:
remove_list=['MR', 'MR.', 'MRS.', 'MRS', 'MISS', 'MASTER', 'MISS.', 'MASTER.' ]
new_list=list()
for i in myname.upper().split():
if i not in remove_list:
new_list.append(i)
if len(new_list) == 2:
return (" ".join(new_list))
except:
pass
$$ LANGUAGE plpythonu
Labels: python, redshift
February 19, 2020
Manage athena tables using python
Here is 4 or 5 lines of code to read data from athena table into pandas
dataframe.
from pyathena import connect
from pyathena.pandas.util import to_sql
bucket = ‘my_bucket_name’
conn = connect(
aws_access_key_id=access,
aws_secret_access_key=secret,
s3_staging_dir="s3://"
+ bucket + "/tutorial/staging/",
region_name="us-east-1",
)
ndf = pd.read_sql("SELECT * FROM sampledb.todel limit 100",
conn)
# pandas dataframe to Athena table
to_sql(ndf, "sample_table", conn, "s3://" + bucket +
"/tutorial/s3dir/",
schema="sampledb",
index=False, if_exists="replace")
_____
The following script will display output of "show create table" command for all tables. It will also create a new excel file called "output.xlsx". 10 records from each table will be saved on separate sheets of the file. Running this script is recommended to learn more about the tables you have saved in Athena.
import pandas as pd
from pyathena import connect
from pyathena.pandas.util import to_sql
conn = connect(
aws_access_key_id="XXX", aws_secret_access_key="XXX",
s3_staging_dir="s3://as-athena-qquery-results-5134690XXXX-us-east-1/todel/",
region_name="us-east-1",
)
dbname = pd.read_sql("show databases", conn)
mydict = dict()
for db in dbname["database_name"]:
mydict[db] = pd.read_sql("show tables in {0}".format(db), conn)
newdict = dict()
for k in mydict.keys():
for i in mydict[k].values:
newdict["{0}.{1}".format(k, i[0])] = pd.read_sql("show create table {0}.{1}".format(k, i[0]), conn)
# print the create table output:
for i in newdict.keys():
for x in newdict[i].values:
print(x[0])
print("\n")
# select 10 records from each table and save as excel sheets:
datadict = dict()
for k in mydict.keys():
for i in mydict[k].values:
try:
datadict["{0}.{1}".format(k, i[0])] = pd.read_sql( "select * from {0}.{1} limit 10".format(k, i[0]), conn)
except:
pass
with pd.ExcelWriter("output.xlsx") as writer:
for k, v in datadict.items():
v.to_excel(writer, sheet_name=k)
Labels: athena, aws, pandas, python
connect to redshift and import data into pandas
There are 2 ways to connect to redshift server and get the data into pandas dataframe. Use the module "sqlalchemy" or "psycopg2". As you can see, sqlalchemy is using psycopg2 module internally.
from sqlalchemy import create_engine
pg_engine = create_engine(
"postgresql+psycopg2://%s:%s@%s:%i/%s" % (myuser, mypasswd, myserver, int(myport), mydbname)
)
my_query = "select * from some_table limit 100”
df = pd.read_sql(my_query, con=pg_engine)
since "create_engine" class can also be used to connect to mysql database, it is recommended for the sake of consistency.
_____
#!pip install psycopg2-binary
import psycopg2
pconn = psycopg2.connect("host=myserver port=myport dbname=mydbname user=myuser password=mypasswd")
my_query = "select * from some_table limit 100”
cur = pconn.cursor()
cur.execute(my_query)
mydict = cur.fetchall()
import pandas as pd
df = pd.DataFrame(mydict)
Labels: aws, pandas, python, redshift
February 16, 2020
Using Python in Redshift
Here is a python function that can be
installed in Redshift. It will normalize the text by removing junk characters
and non-essential strings.
CREATE OR REPLACE FUNCTION
f_file_split (mystr varchar(1000) ) RETURNS varchar(1000) IMMUTABLE as $$
try:
import
itertools
mylist=list()
if
mystr:
for i in mystr[:100].split("_"):
for x in i.split("-"):
for y in x.split("/"):
mylist.append(y.split("."))
news = '
'.join(itertools.chain(*mylist))
newlist=list()
stopwords = ['sanstha', 'vikas', 'society', 'seva', 'json']
for i in
news.split():
if len(i) < 4 or i in stopwords or i.isdigit() or i.startswith('bnk')
or not i.isalpha() :
pass
else:
newlist.append(i.lower().replace('vkss',
'').replace('vks',''))
return '
'.join(set(newlist))
except:
pass
$$ LANGUAGE plpythonu
I can add a new column in the table
and populate that column with transformed values.
alter table final_details add column
branch_name_fuzzy varchar(500);
update final_details set
branch_name_fuzzy = f_file_split(filename);
Labels: aws, python, redshift
February 15, 2020
Manage Athena tables using PyAthena
PyAthena is an indispensable toolfor Amazon Athena.
import pandas as pd
from pyathena import connect
from pyathena.pandas.util import to_sql
# create connection object
conn = connect(aws_access_key_id="xxx",
aws_secret_access_key="xxx",
s3_staging_dir="s3://testme162/tutorial/staging/",
region_name="us-east-1",
)
# You may have a very large dataframe instead of this...
df = pd.DataFrame({"a": [1, 2, 3, 4, 5, 6, 7, 8, 11, 21, 545]})
# use the helper function
to_sql(df, "todel", conn, "s3://testme162/tutorial/s3dir/",
schema="sampledb", index=False, if_exists="replace")
# read the athena data into a new dataframe
ndf = pd.read_sql("SELECT * FROM sampledb.todel limit 100", conn)
Labels: athena, aws, python
February 08, 2020
Pandas case study 22
This is how easy it is to connect to mysql data-source and get the query results into a dataframe.
# conda install -y sqlalchemy pymysql
import pandas as pd
import sqlalchemy
engine = sqlalchemy.create_engine('mysql+pymysql://root:XXXXX@172.17.0.1/examDB')
df = pd.read_sql_query('SELECT * FROM candidateresult limit 10', engine, index_col = 'resultid')
_____
You can also connect to redshift database and get the data into a pandas dataframe using this code...
import easyboto
x=easyboto.connect()
x.my_add='xxxx.us-east-1.redshift.amazonaws.com'
x.my_user='root'
x.my_pas='xxx'
x.my_db='xxx'
dlr=x.runQuery("select * from some_table limit 10 ")
dlr.columns=["useID","messageid","client_code", "message", "status", "mobilenos"]
dlr.set_index('messageid')
You will need to save a file called "easyboto.py" and here is the code:
https://raw.githubusercontent.com/shantanuo/easyboto/master/easyboto.py
_____
For more advance options use awswrangler
https://oksoft.blogspot.com/search?q=awswrangler
Labels: pandas, python
January 30, 2020
Google translate API
# This will translate from Finnish text into Marathi using high level python module that uses google API internally.
import googletrans
translator = googletrans.Translator()
result = translator.translate("Mikä on nimesi", src="fi", dest="mr")
result.text
# 'तुझे नाव काय आहे'
# What is your name
# googletrans.LANGUAGES # for full list of supported languages
Labels: google, python
January 23, 2020
Pandas case study 21
I have 2 dataframes having exactly the same data, but in a different order and with different column names.
Based on the numbers in the two data frames, I would like to be able to match each column name in df1 to each column name in df2.
from io import StringIO
import pandas as pd
audit_trail1 = StringIO(
"""a1 a2 a3 a4 a5 a6 a7
1 3 4 5 3 4 5
0 2 0 3 0 2 1
2 5 6 5 2 1 2
"""
)
df1 = pd.read_csv(audit_trail1, sep="\s+")
audit_trail2 = StringIO(
"""b1 b2 b3 b4 b5 b6 b7
3 5 4 5 1 4 3
0 1 2 3 0 0 2
2 2 1 5 2 6 5
"""
)
df2 = pd.read_csv(audit_trail2, sep="\s+")
# Solution:
m = df1.T.sort_values(by=list(df1.index)).index
n = df2.T.sort_values(by=list(df2.index)).index
dict(zip(m, n))
# https://stackoverflow.com/questions/59721120/use-data-in-pandas-data-frames-to-match-columns-together
Labels: pandas, python
January 22, 2020
Understanding Naive Bayes
Naive Bayes implicitly assumes that all the attributes are mutually independent. That is not the case in most of the cases. If a categorical variable has a category in the test dataset, which was not observed in training dataset, then the model may fail.
Naive Bayes can handle any type of data (for e.g. continuous or discrete) and the size of data does not really matter. It can be applied to IRIS dataset as shown below:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
accuracy_score(y_test, y_pred)
Labels: machine_learning, python
Archives
June 2001
July 2001
January 2003
May 2003
September 2003
October 2003
December 2003
January 2004
February 2004
March 2004
April 2004
May 2004
June 2004
July 2004
August 2004
September 2004
October 2004
November 2004
December 2004
January 2005
February 2005
March 2005
April 2005
May 2005
June 2005
July 2005
August 2005
September 2005
October 2005
November 2005
December 2005
January 2006
February 2006
March 2006
April 2006
May 2006
June 2006
July 2006
August 2006
September 2006
October 2006
November 2006
December 2006
January 2007
February 2007
March 2007
April 2007
June 2007
July 2007
August 2007
September 2007
October 2007
November 2007
December 2007
January 2008
February 2008
March 2008
April 2008
July 2008
August 2008
September 2008
October 2008
November 2008
December 2008
January 2009
February 2009
March 2009
April 2009
May 2009
June 2009
July 2009
August 2009
September 2009
October 2009
November 2009
December 2009
January 2010
February 2010
March 2010
April 2010
May 2010
June 2010
July 2010
August 2010
September 2010
October 2010
November 2010
December 2010
January 2011
February 2011
March 2011
April 2011
May 2011
June 2011
July 2011
August 2011
September 2011
October 2011
November 2011
December 2011
January 2012
February 2012
March 2012
April 2012
May 2012
June 2012
July 2012
August 2012
October 2012
November 2012
December 2012
January 2013
February 2013
March 2013
April 2013
May 2013
June 2013
July 2013
September 2013
October 2013
January 2014
March 2014
April 2014
May 2014
July 2014
August 2014
September 2014
October 2014
November 2014
December 2014
January 2015
February 2015
March 2015
April 2015
May 2015
June 2015
July 2015
August 2015
September 2015
January 2016
February 2016
March 2016
April 2016
May 2016
June 2016
July 2016
August 2016
September 2016
October 2016
November 2016
December 2016
January 2017
February 2017
April 2017
May 2017
June 2017
July 2017
August 2017
September 2017
October 2017
November 2017
December 2017
February 2018
March 2018
April 2018
May 2018
June 2018
July 2018
August 2018
September 2018
October 2018
November 2018
December 2018
January 2019
February 2019
March 2019
April 2019
May 2019
July 2019
August 2019
September 2019
October 2019
November 2019
December 2019
January 2020
February 2020
March 2020
April 2020
May 2020
July 2020
August 2020
September 2020
October 2020
December 2020
January 2021
April 2021
May 2021
July 2021
September 2021
March 2022
October 2022
November 2022
March 2023
April 2023
July 2023
September 2023
October 2023
November 2023
April 2024
May 2024
June 2024
August 2024
September 2024
October 2024
November 2024
December 2024
January 2025
February 2025
April 2025
June 2025
July 2025
August 2025
