Using Estonian Fiction texts to learn simple text analysis in R 2

As part of the studies in Digital Methods and Contemporary datasets we learned about using R and Tidyverse to do simple text processing, in this case on longer texts. We looked at Tõde ja õigus, a canonical text in Estonia by A.H. Tammsaare and used tidytext to count and locate tokens, find ngrams and content words, and distinguishing keywords. We used a set of Estonian fiction texts from an open collection to study word usage across different works.

The materials are posted online on github. Some graphs made:

Locations of popular 5-grams in “Tõde ja õigus. 1 kd.”

Locations of popular 5-grams in "Tõde ja õigus. 1 kd." Five-grams were calculated with unnest tokens

Distinguishing keywords of chapters 1-10 in “Tõde ja õigus. 1 kd.”

Distinguishing keywords of chapters 1-10 in "Tõde ja õigus. 1 kd Words were lowercased, proper names kept in. Comparison made across all 39 chapters.

Frequency of forest, land, city in Estonian Fiction books.

Frequency of forest, land, and city related words was calculated as a proportion of tokens. Words beginning with ^mets, ^maa, ^linn were used.