Athena can also be used for quickly previewing the csv or text files saved in S3.
# vi symlink.txt
As you can see, I have saved the entire path in a symbolic links text file and uploading it to S3 using this command...
aws s3 cp symlink.txt s3://132cols/symlink/
I will now use the path of symbolic links file while creating the table as shown below.
CREATE EXTERNAL TABLE test_bulk (
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '~'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
The important point to note is that I have changed the comma , in "FIELDS TERMINATED BY" parameter to tile ~ or else I would get only the first column. Due to this change, I can see the entire row in a single column called "dummy_col". This tip will be useful when you want to check what's in a large file without downloading it first. This method can also be used to re-check if the file that you have uploaded to S3 is valid and readable.
Labels: athena, aws
Pandas dataframe stores all the data into a single table that makes it difficult to understand the relationships between columns. For e.g. I will like to know how Area Abbreviation is related to Item code in the following data.
# Download the csv file from kaggle:
import pandas as pd
food_df = pd.read_csv('FAO.csv' , encoding='latin1')
food_df = food_df.drop(columns=food_df.columns[10:])
I will now import auto normalize class from featuretools. This will detect the internal relationships between columns and show us a nice graph.
from featuretools.autonormalize import autonormalize as an
entityset = an.auto_entityset(food_df)
Element Code_Item Code_Area Code [Rows: 21477, Columns: 4]
Element Code [Rows: 2, Columns: 2]
Item Code [Rows: 117, Columns: 2]
Area Code [Rows: 174, Columns: 5]
Area Abbreviation [Rows: 169, Columns: 2]
Element Code_Item Code_Area Code.Area Code -> Area Code.Area Code
Element Code_Item Code_Area Code.Item Code -> Item Code.Item Code
Element Code_Item Code_Area Code.Element Code -> Element Code.Element Code
Area Code.Area Abbreviation -> Area Abbreviation.Area Abbreviation
Do not forget to check featuretools module as well. This will add new columns to your dataframe those can be useful in building machine learning module.
import featuretools as ft
fm, features = ft.dfs(entityset=entityset, target_entity='Element Code_Item Code_Area Code')
Labels: machine_learning, pandas, usability
some of the important Google Chrome extensions:
1) HTML5 Outliner: Generates a navigable page outline with heading and sectioning elements
2) Page load time: Displays page load time in the toolbar
3) Open Multiple URLs: Extract and open a list of URLs
4) Former2 Helper: Helps avoid CORS issues with former2.com while calling the AWS service API endpoints
5) Chrome Remote Desktop
6) Open in Colab: Open a Github-hosted notebook in Google Colab
Outliers can be removed or adjusted using statistical methods of IQR, Z-Score and Data Smoothing.
1) For calculating IQR (Inter Quartile Range) of a dataset, first calculate it’s 1st Quartile(Q1) and 3rd Quartile(Q3) i.e. 25th and 75 percentile of the data and then subtract Q1 from Q3
import pandas as pd
data = [-2,8,13,19,34,49,50,53,59,64,87,89,1456]
df = pd.DataFrame(data)
df.columns = ['values']
ndf['75%'] - ndf['25%']
# returns 45
For finding out the Outlier using IQR we have to define a multiplier which is 1.5 ideally that will decide how far below Q1 and above Q3 will be considered as an Outlier.
higher_limit = ndf['75%'] + 1.5 * 45
lower_limit = ndf['25%'] - 1.5 * 45
df[(df['values'] > higher_limit) | (df['values'] < lower_limit)]
2) Z-Score tells how far a point is from the mean of dataset in terms of standard deviation. An absolute value of z score which is above 3 is considered as an outlier.
from scipy import stats
df[df['z_score'] > 3]
# returns values
3) Data smoothing is a process to adjust the spikes and peaks. If your current value if 13 and previous value is 8 and smoothing level is 0.6 then the smoothed value is 11 given by
13*0.6 + (1-0.6)*8
Pandas smoothing function (EWM) can be used to calculate the exponential weighted Moving Average at different alpha levels.
Labels: machine_learning, pandas
Here is how we can run a single query across multiple regions.
-- Create a table with a single id column
CREATE EXTERNAL TABLE ids(id bigint)
PARTITIONED BY (region string)
ROW FORMAT DELIMITED
-- Add two partitions, use two buckets from two regions
ALTER TABLE ids ADD PARTITION (region='us')
ALTER TABLE ids ADD PARTITION (region='eu')
-- Count the distinct ids, in all regions
SELECT COUNT(DISTINCT id)