October 28, 2024

awk Case Study - 14

 1) Download stardict files:

git clone https://github.com/freedict/fd-dictionaries.git

2) Download python package to read stardict files:
git clone https://github.com/ilius/pyglossary.git
cd pyglossary/
cp /home/ubuntu/fd-dictionaries/eng-hin/eng-hin.tei .
python3 main.py

# convert eng-hin.tei file to out.txt

Select the first 3 columns:

Change multiple HTML tags to a single pipe | delimiter
and display the first 3 columns

awk '{
    gsub(/<[^>]*>/, "|");
   
    gsub(/\|+/, "|");
   
    match($0, /([^|]*\|){3}/);
    first_three = substr($0, RSTART, RLENGTH);  
    print first_three
}' out.txt > test.csv

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.