Estonian National Bibliography is a metadata set that aims to collect information on all publications written in any language in Estonia and all texts written in Estonian in whichever country. The dataset has been compiled in digital format since 2002 and aggregates the work of multiple institutions and generations in collecting the publication information.
The data here has been extracted on Jan 27 2022.
This dataset presents the Estonian National Bibliography dataset in wide instead of long format used in Marc21, with some of the variables that may be useful for text-mining studies. It includes the following information
Information on coding the variables can be found here: [meta file on github]
Helpful information on the metadata available can be found here: http://data.digar.ee/#page5
The rules followed in adding information on older books can be found here: https://www.elnet.ee/wiki/lib/exe/fetch.php?media=kataloogimine:vanaraamatmarc_2019.pdf
The current code organizes the ENB metadata into a tidy format: with one publication per line, and with the basic metadata information distributed into few relevant categories. This has been done to facilitate further data analytic steps, particularly when someone may be unfamiliar with Marc21 database structures.
The dataset has altogether information on 272411 printed items. The coverage has been estimated to be better than 95% for all of the relevant published works.
The dataset is divided into two: works in Estonian language, and works in other languages. The two sets are displayed over the year of publication here in different colors.
An overview of the coverage of the data is given below. Grey areas indicate datapoints that do have the information of that column, black areas the datapoints that do not have the information of that column. For example 82% of the books have information on the publisher, however just 59% of the books have information on the author. 16% of the books have a link to an online digital copy.
Data available in ENB
The city names where the works have been published have been harmonized manually and through a few heuristic algorithms. The tokens that appear more than 40 times should be mostly harmonized, while rarer tokens have been included only through algorithmic processing. Depicted below are the most common cities of publication and the number of publications in them separately for language sets.
Publisher names have also been harmonized manually and algorithmically with more frequent names in focus.
The bibliography has each publication marked by some general topics. Some works have no topic marking, many works have several topic markings. By 20-year intervals, here are the most common topics in the set. For visualization purposes, only the first 10 characters of the topic are shown.