[ITP: Programming A2Z] Word Counting

Notes

  • “Artisinal data” coined by Sarah Groff-Palermo = small, fragmented, incomplete, human

    • Data to express who we are in the language of today

  • Concordance = list of principal words in a text, listing every instance with its immediate context

  • Sentiment analysis, pronouns hold the key! —> James Pennebaker

  • Associative arrays relates a key with a specific number or value

    • Un-ordered list, dictionary

  • loadStrings() function returns an array where each element is a line from the text

  • TF-IDF (term frequency inverse document frequency) = what words are important to this text versus others

    • score = term frequency * log(total # of docs / # of docs term appears in)

  • Corpus = collection of written texts

Assignment

1. shows initial word counting 2. split tokens by space 3. remove German stop words

I’ve had this question written down in my notebook for the last week: “what are texts that are important/interesting to me?” I didn’t really get a chance to think too much about it before jumping into this assignment. I went to Project Gutenberg and the home page featured Meine Erinnerungen aus Ostafrika by General von Lettow-Vorbeck. German! I can understand that!

I wasn’t so sure about that specific text but I looked at the list of texts in German and I landed on this: Der Bucheinband: Seine Technik und seine Geschichte by Paul Adam, a book about the art and history of book binding from 1890. Pretty cool, I think!

I started by putting the plaint text into a .txt file and removing the weird header and footer license stuff that was in English. I counted the words of this book using the code from the Coding Train tutorial.

In the initial counting I noticed that a lot of single letters were being counted as words… which was strange. When I compared my token array to the text itself I figured out that the special German characters (ö, ä, ß, etc) were causing issues with the token splitting. I changed the split call from non-word characters to splitting by whitespace which made the word list more sensible to me.

When I look at the word list now, at the top are words like der, die, das which all mean “the”. Und is “and”, mit is “with”, in is in, zu is “to”, all words that don’t mean much, right? These are considered stop words which are just commonly used words in a language. We all know that pronouns can be really important, but I wanted to challenge myself to remove them from my word count. Maybe then the word count would be more representative of the content of the text.

The internet is amazing! With a quick search, a found a complete list of German stop words on Github. I uploaded the “plain” list to my p5 sketch and put all the words into an array. Then, I did a quick check before creating a div for the words and their counts to see if that word is on the stop word list. My new word list has a lot of great book-words in it, like leather, fold, cover. Some other stand out words for me are: weise = way, ganz = quite/all, genau = exactly/precisely, gut = good.

This book has really great illustrations and images depicting all the book binding techniques and many beautiful books. Some of my favorites are below:

Since I dabble in illustration too, I thought I could try my hand at visualizing some of the top words from Der Bucheinband: