#5 Basic NLP with Command-line
Alex Flückiger
Faculty of Humanities and Social
Sciences
University of Lucerne
31 March 2022
.
├── README.md
└── lectures
├── images
│ └── ai.jpg
└── md
├── KED2022_01.md
└── KED2022_02.md
Example location of the course material:
/home/alex/KED2022
pwd
get the path to the current directorycd ..
go one folder upcd FOLDERNAME
go one folder down into FOLDERNAMEls -l
see the content of the current folder.txt
)
.csv
, .tsv
,
.xml
)
wc *.txt # count number of lines, words, characters
egrep -ir "computational" folder/ # search in all files in folder, ignore case
# common egrep options:
# -i search case-insensitive
# -r search recursively in all subfolders
# --colour highlight matches
# --context 2 show 2 lines above/below match
egrep -ic "big data" *.txt # count across all txt-files, ignore case
# piping steps to get word frequencies
cat text.txt | tr " " "\n" | sort | uniq -c | sort -h > wordfreq.txt
# explanation of individual steps:
tr " " "\n" # replace spaces with newline
sort -h # sort lines alphanumerically
uniq -c # count repeated lines
= n_occurrences / n_total_words
.tsv
file# convert word frequencies into tsv-file
# additional step: replace a sequence of spaces with a tabulator
cat text.txt | tr " " "\n" | sort | uniq -c | sort -h | \
tr -s " " "\t" > test.tsv
Print the following sentence in your command line using
echo
.
echo "There are a few related fields: NLP, computational linguistics, and computational text analysis."
How many words are in this sentence? Use the pipe operator to
combine the command above with wc
.
Match the words computational
and colorize its
occurences in the sentence using egrep
.
Get the frequencies of each word in this sentence using
tr
and other commands.
echo "ÜBER" | tr "A-ZÄÖÜ" "a-zäöü" # fold text to lowercase
echo "3x3" | tr -d "[:digit:]" # remove all digits
cat text.txt | tr -d "[:punct:]" # remove punctuation like .,:;?!-
tr "Y" "Z" # replace any Y with Z
# lowercase, no punctuation, no digits
cat speech.txt | tr "A-ZÄÖÜ" "a-zäöü" | \
tr -d "[:punct:]" | tr -d "[:digit:]" > speech_clean.txt
cat test.txt | tr -s "\n" " " # replace newlines with spaces
cat -n text.txt # show line numbers
sed "1,10d" text.txt # remove lines 1 to 10
# splits file at every delimiter into a stand-alone file
csplit huge_text.txt "/delimiter/" {*}
# show differences side-by-side and only differing lines
diff -y --suppress-common-lines text_raw.txt text_proc.txt
🤓 Published code and data are parts of the endeavour of open science.
Change into your local copy of the GitHub course repository
KED2022 and update it with git pull
. When you haven’t
cloned the repository, follow section 5 of the installation
guide .
You find some party programmes (Grüne, SP, SVP) in
materials/data/swiss_party_programmes/txt
. The programmes
are provided in plain text which I have extracted from the publicly
available PDFs.
Have a look at the content of some of these text files using
more
.
Compare the absolute frequencies of single terms or multi-word expressions of your choice (e.g., Ökologie, Sicherheit, Schweiz)…
Use the file names as filter to get various aggregation of the word counts.
Pick terms of your interest and look at their contextual use by extracting relevant passages. Does the usage differ across parties or time?
Share your insights with the class using Etherpad.
tsv
dataset. Compute the relative word frequency instead of the absolute
frequency using any spreadsheet software (e.g. Excel). Are your
conclusions still valid after accounting for the size?Pro Tip 🤓: Use egrep
to look up
commands in the .md
course slides
When you look for useful primers on Bash, consider the following resources: