#7 Working with (your own) Data
Alex Flückiger
Faculty of Humanities and Social
Sciences
University of Lucerne
14 April 2022
abc
\w \s [^abc] *
.*
👉 basically, any textual documents…
👉 check out other resources licensed by ZHB
👉 search for a
topic followed by corpus
, text collection
or
text as data
😓 There are still not many.
Make your web search more efficient by using dedicated tags. Examples:
"computational social science"
nature OR environment
site:nytimes.com
⬇️
digital native documents .pdf
, .docx
,
.html
⬇️
convert to .txt
⬇️
scans of (old) documents .pdf
, .jpg
,
.png
⬇️
Optical Character Recognition (OCR)
machine-readable ✅
pandoc
to convert many file formats.docx
on Nexis# convert docx to txt
pandoc infile.docx -o outfile.txt
### Install first with
brew install pandoc # macOS
sudo apt install pandoc # Ubuntu
pdftotext
extracts text from non-scanned PDF# convert native pdf to txt
pdftotext -nopgbrk -eol unix infile.pdf
### Install first with
brew install poppler # macOS
sudo apt install poppler-utils # Ubuntu
tesseract
performs OCR
# convert scanned pdf to tiff, control quality with parameters
convert -density 300 -depth 8 -strip -background white -alpha off \
infile.pdf temp.tiff
# run OCR for German ("eng" for English, "fra" for French etc.)
tesseract -l deu temp.tiff file_out
### Install first with
brew install imagemagick # macOS
sudo apt-get install imagemagick # Ubuntu
# disable security policy for Windows
sudo sed -i '/<policy domain="coder" rights="none" pattern="PDF"/d' /etc/ImageMagick-6/policy.xml
# increase memory limits
sudo sed -i -E 's/name="memory" value=".+"/name="memory" value="8GiB"/g' /etc/ImageMagick-6/policy.xml
sudo sed -i -E 's/name="map" value=".+"/name="map" value="8GiB"/g' /etc/ImageMagick-6/policy.xml
sudo sed -i -E 's/name="area" value=".+"/name="area" value="8GiB"/g' /etc/ImageMagick-6/policy.xml
sudo sed -i -E 's/name="disk" value=".+"/name="disk" value="8GiB"/g' /etc/ImageMagick-6/policy.xml
# output searchable pdf instead of txt
convert -density 300 -depth 8 -strip -background white -alpha off -compress group4 \
file_in.pdf temp.tiff
tesseract -l deu temp.tiff file_out pdf
wget
to download any files from the internet# get a single file
wget EXACT_URL
# get all linked pdf from a single webpage
wget --recursive --accept pdf -nH --cut-dirs=5 \
--ignore-case --wait 1 --level 1 --directory-prefix=data \
https://www.bk.admin.ch/bk/de/home/dokumentation/abstimmungsbuechlein.html
# --accept FORMAT_OF_YOUR_INTEREST
# --directory-prefix YOUR_OUTPUT_DIRECTORY
# loop over all txt files
for file in *.txt; do
# indent all commands in loop with a tab
# rename each file
# e.g. a.txt -> new_a.txt
mv $file new_$file
done
for FILEPATH in *.pdf; do
# convert pdf to image
convert -density 300 $FILEPATH -depth 8 -strip \
-background white -alpha off temp.tiff
# define output name (remove .pdf from input)
OUTFILE=${FILEPATH%.pdf}
# perform OCR on the tiff image
tesseract -l deu temp.tiff $OUTFILE
# remove the intermediate tiff image
rm temp.tiff
done
git pull
. Check out the data samples in
materials/data
and the scripts to extract their text in
materials/code
.pandoc, imagemagick, poppler
.wget
to download cogito and its
predecessor uniluAKTUELL issues (PDF files) from the UniLu
website. Start with downloading one issue first and then try to
automatize the process to download all the listed issued using arguments
for the wget
command.tesseract
. Try with a single issue first
and then write a loop to batch process all of them.wget
to download a book from Project Gutenberg and
count some things (e.g., good/bad, joy/sad).wget
is a powerful tool. Have a look at its arguments
and search for more examples in tutorials on the web.