The ABC of Computational Text Analysis

Formal Search Patterns

How to extract all email addresses in a text collection?

Please contact us via info@organization.org.
---
For specific questions ask Mrs. Green (a.green@mail.com).
---
Reach out to support@me.ch

👉 Solution: Write a single pattern to match any valid email adress

[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}   # match any email address (case-insensitive)

What are patterns for?

finding 🔎
extracting 🛠️
removing/cleaning 🗑️
replacing 🔁

… specific parts in texts

Data Cleaning is paramount!

What are Regular Expressions (RegEx)?

RegEx builds on two classes of symbols

literal characters and strings
- letters, digits, words, phrases, dates etc.
meta expressions with special meaning
- e.g., \w represents alphanumeric characters
- [Cc]o+l → Col, col, Cool, coool …
akin to regular languages

Finding + Extracting

extended globally search for regular expression and print (egrep)

tool to filter/keep matching lines only

# check a regular expression quickly
echo "check this pattern" | egrep "pattern" 

egrep "yes" file.txt        # search in a specific file
egrep -r "yes" folder       # search recursively within folder

egrep "yes" *.txt           # keep lines containing pattern (yes) across txt-files
egrep -i "yes" *.txt        # dito, ignore casing (Yes, yes, YES ...)
egrep -v "noisy" *.txt      # do NOT keep lines containing noisy

# extract raw match only to allow for subsequent counting
egrep -o "only" *.txt       # print match only instead of entire line
egrep -h "only" *.txt       # suppress file name

Quantifiers

repeat preceding character `X` times

? zero or one
+ one or more
* zero or any number
{n}, {m,n} a specified number of times

egrep -r "Bundesrath?es"        # match old and new spelling
egrep -r "a+"                   # match one or more "a"
egrep -r "e{2}"                 # match sequence of two "e"

⚠️ Do not confuse regex with Bash wildcards!

Character Sets

[...] any of the characters between brackets
- any vowel: [auoei]
- any digit: [0-9]
- any letter: [A-Za-z]
[^...] any character but none of these (negation)
- anything but the vowels: [^auoei]

# match the capitalized and non-capitalized form
egrep -r "[Gg]rüne"

# match sequences of 3 vowels
egrep -r [aeiou]{3}

# extract all bigrams (sequence of two words)
egrep -rohi "[a-z]+ [a-z]+"

Special Symbols

. matches any character (excl. newline)
\ escapes to match literal
- \. means the literal . instead of “any symbol”
\w matches any alpha-numeric character
- same as [A-Za-z0-9_]
\s matches any whitespace (space, newline, tab)
- same as [ \t\n]

# match anything between brackets
egrep -r "\(.*\)"

The power of `.*` …

matches any character any times

More Complex Examples

# extract basename of URLs
egrep -ro "www\.\w+\.[a-z]{2,}"

# extract valid email adresses (case-insensitive)
egrep -iro "[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}" **/*.txt

Combining RegEx with Frequency Analysis

something actually useful

# count political areas by looking up words ending with "politik"
egrep -rioh "\w*politik" **/*.txt | sort | uniq -c | sort -h

# count ideologies/concepts by looking up words ending with "ismus"
egrep -rioh "\w*ismus" **/*.txt | sort | uniq -c | sort -h

Start simple,
add complexity subsequently.

In-class: Exercise

Use the command line to navigate to the local copy of the Github repository KED2022 and make sure it is up-to-date with git pull. Change in to the directory materials/data/swiss_party_programmes/txt.
Use egrep to extract all uppercased words like UNO, OECD, SP and count their frequency.
Use egrep to extract all plural nouns with female endings e.g. Schweizerinnen (starting with an uppercase letter, ending with innen, and any letter in between). Do the same for the male forms. Is there a qualitative or a quantitative difference between the gendered forms?

# Some not so random hints 
piping with |
sort
uniq -c
egrep -roh **/*.txt

In-class: Solution

Use egrep to extract all uppercased words like UNO, OECD, SP and count their frequency.
- egrep -roh "[A-Z]{2,}" **/*.txt | sort | uniq -c | sort -h
Use egrep to extract all plural nouns with female endings e.g. Schweizerinnen (starting with an uppercase letter, ending with innen, and any letter in between). Do the same for the male forms. Is there a qualitative or a quantitative difference between the gendered forms?
- egrep -roh "[A-Z][a-z]+innen\b" **/*.txt | sort | uniq -c | sort -h
- egrep -roh "[A-Z][a-z]+er\b" **/*.txt | sort | uniq -c | sort -h (there is no way with regular expression to extract only nouns of the male form but not Wasser and the like. For this, you have to use some kind of machine learning.)

Replacing + Removing

stream editor (sed)

advanced find + replace using regex
- sed "s/WHAT/WITH/g" file.txt
sed replaces any sequence, tr only single symbols

echo "hello" | sed "s/llo/y/g"      # replace "llo" with a "y"

# by setting the g flag in "s/llo/y/g",
# sed replaces all occurences, not only the first one

Contextual Replacing

reuse match with grouping

define a group with parentheses (group_pattern)
\1 equals the expression inside first pair of parentheses
\2 expression of second pair
…

# swap order of name (last first -> first last)
echo "Lastname Firstname" | sed -E "s/(.+) (.+)/\2 \1/"

# matching also supports grouping
# match any pair of two identical digits
egrep -r "([0-9])\1"

More Meta-Symbols

\b matches word boundary
- word\b does not match words
^ matches begin of line and $ end of line
- ^A matches only A at line start
| is a disjunction (OR)
- (Mr|Mrs|Mr\.|Mrs\.) Green matches alternatives

Greediness Trap

greedy ~ match the longest string possible
quantifiers * or + are greedy
non-greedy by excluding some symbols
- [^EXCLUDE_SYMBOLS] instead of .*

# greedy: an apple, other apple
echo "an apple, other apple" | egrep "a.*apple"

# non-greedy: an apple
echo "an apple, other apple" | egrep "a[^,]*apple"

Assignment #2 ✍️

get/submit via OLAT
- starting tomorrow
- deadline 15 April 2022, 23:59
use forum on OLAT
- subscribe to get notifications
ask friends for support, not solutions

In-class: Exercises I

Use egrep to extract capitalized words and count them. What are the most frequent nouns?
Use egrep to extract words following any of these strings: der die das. Hint: Use a disjunction.
Do the self-check on the next slide.
Use sed -E to remove the table of content, the footer and the page number in the programme of the Green Party. Check the corresponding PDF to get a visual impression and test your regular expression with egrep first to see if you match the correct parts in the document.

In-class: Solution I

Use egrep to extract capitalized words and count them. What are the most frequent nouns?
- egrep -roh "[A-Z][a-z]+" **/*.txt | sort | uniq -c | sort -h
Use egrep to extract words following any of these strings: der die das. Hint: Use a disjunction.
- egrep -roh "(der|die|das) \w+" **/*.txt
Use sed -E to remove the table of content, the footer and the page number in the programme of the Green Party. Check the corresponding PDF to get a visual impression and test your regular expression with egrep first to see if you match the correct parts in the document.
- cat gruene_programme_2019.txt | sed "1,192d" | sed -E "s/^Wahlplattform.*2023$//g" | sed -E "s/^[0-9]+$//g"

In-class: Self-Check

equivalent patterns

a+ == aa*               # "a" once or more than once
a? == (a|_)             # "a" once or nothing
a{3} == aaa             # three "a"
a{2,3} == (aa|aaa)      # two or three "a"
[ab] == (a|b)           # "a" or "b"
[0-9] == (0|1|2|3|4|5|6|7|8|9)  #any digit

In-class: Exercise II

Count all the bigrams (sequence of two words) using character sets and quantifiers. What about trigrams (three words)?
Extract the words following numbers (also consider numbers like: 1'000, 1,000 or 5%). Then, count all the words while excluding the numbers themselves. Hint: Pipe another grep to remove the digits.
You are ready to come up with your own patterns…

In-class: Solution II

Count all the bigrams (sequence of two words) using character sets and quantifiers. What about trigrams (three words)?
- egrep -hoir "\b[a-z]+ [a-z]+\b" | sort | uniq -c | sort -h
- egrep -hoir "\b[a-z]+ [a-z]+ [a-z]+\b" | sort | uniq -c | sort -h
Extract the words following numbers (also consider numbers like: 1'000, 1,000 or 5%). Then, count all the words while excluding the numbers themselves. Hint: Pipe another grep to remove the digits.
- egrep -rhoi "[0-9][0-9,'%]+ [a-z]+" | egrep -io "[a-z]+" | sort | uniq -c | sort -h
- Alternative: egrep -rhoi "[0-9][0-9,'%]+ [a-z]+" | sed -E "s/[0-9][0-9,'%]+//g" | sort | uniq -c | sort -h

In-class: Exercise III

Since you know about RegEx, we can use a more sophisticated tokenizer to split a text into words. What is the difference between the old and new approach? Test it and check the helper page with man.
```
# new, improved approach
cat text.txt | tr -sc "[a-zäöüA-ZÄÖÜ0-9-]" "\n"

# old approach
cat text.txt | tr " " "\n"   
```

The ABC of Computational Text Analysis

Recap last Lecture

Outline

Text as Pattern

Formal Search Patterns

How to extract all email addresses in a text collection?

What are patterns for?

Data Cleaning is paramount!

What are Regular Expressions (RegEx)?

RegEx builds on two classes of symbols

Finding + Extracting

extended globally search for regular expression and print (egrep)

Quantifiers

repeat preceding character X times

Character Sets

Special Symbols

The power of .* …

More Complex Examples

Combining RegEx with Frequency Analysis

something actually useful

Start simple, add complexity subsequently.

In-class: Exercise

In-class: Solution

Replacing + Removing

stream editor (sed)

Contextual Replacing

reuse match with grouping

More Meta-Symbols

Greediness Trap

Assignment #2 ✍️

In-class: Exercises I

In-class: Solution I

In-class: Self-Check

equivalent patterns

In-class: Exercise II

In-class: Solution II

In-class: Exercise III

More Resources

required

highly recommended

online regular expression editor

Questions?

repeat preceding character `X` times

The power of `.*` …

Start simple,
add complexity subsequently.