I gave a workshop in the Seminar of Digital Archives on the approach we are building at the National Library of Estonia to access text collections. The event was preceded by a day of talks at the Digital Humanities and Digital Archives with interesting talks on this topic.
The workflow used at the moment relies on Jupyter Notebooks opened on the same servers as the text files, and a few custom R commands to retreive the texts in a clean format. The custom R commands are presented as a package on github. The Jupyter Notebooks work with a username on the servers, we used temporary usernames at the workshop, more permanent usernames can be got when asking us.
The first tests look promising, texts can be retrieved fairly quickly and analysed with common tools.
The workshop materials were posted on hackmd. Some graphs made:
The distinguishing words between texts containing Georg Lurich and Konrad Mägi 1886-1940.
The proportion of texts in Postimees 1886-1940 containing words related to steam, electricity, and horses.
Some selected examples of words containing “elekt” in Estonian and their distribution over time.
Sessioon A: RR digitaalarhiivis DIGAR olevate tekstide kasutamine (hands-on workshop) Mida saab teha digitaalsete tekstidega? Ligipääs Digari Eesti artiklitele avatud koodi kaudu. Lihtsam tekstitöötlus ja selle tulemused R-is. Kasutame RStudiot, juhend selle paigaldamiseks saadetakse registreerunutele. Töötuba ei eelda varasemaid programmeerimisoskusi. Sellest hoolimata kirjutame aga ise töötoas koodi.