As part of the studies in Digital Methods and Contemporary datasets we learned about using R and Tidyverse to do simple text processing, in this case on longer texts. We looked at Tõde ja õigus, a canonical text in Estonia by A.H. Tammsaare and used tidytext to count and locate tokens, find ngrams and content words, and distinguishing keywords. We used a set of Estonian fiction texts from an open collection to study word usage across different works.
The materials are posted online on github. Some graphs made:
Five-grams were calculated with unnest tokens
Words were lowercased, proper names kept in. Comparison made across all 39 chapters.
Frequency of forest, land, and city related words was calculated as a proportion of tokens. Words beginning with ^mets, ^maa, ^linn were used.