ENB metadata

Abstract

Estonian National Bibliography is a metadata set that aims to collect information on all publications written in any language in Estonia and all texts written in Estonian in whichever country. The dataset has been compiled in digital format since 2002 and aggregates the work of multiple institutions and generations in collecting the publication information.

The data here has been extracted on Jan 27 2022.

This dataset presents the Estonian National Bibliography dataset in wide instead of long format used in Marc21, with some of the variables that may be useful for text-mining studies. It includes the following information

  • publication ID in National Library
  • time of publication (aeg)
  • place of publication (koht_raw)
  • partly standardized place of publication (koht)
  • publisher (kirjastus_raw)
  • partly standardized publisher information (kirjastus)
  • title (title, subtitle, comptitle)
  • author (name, date of birth,id)
  • other associated authors (translator, editor etc)
  • number of copies printed
  • font information
  • keyword
  • genre
  • link to fulltext

Information on coding the variables can be found here: [meta file on github]

Helpful information on the metadata available can be found here: http://data.digar.ee/#page5

The rules followed in adding information on older books can be found here: https://www.elnet.ee/wiki/lib/exe/fetch.php?media=kataloogimine:vanaraamatmarc_2019.pdf

Intro

The current code organizes the ENB metadata into a tidy format: with one publication per line, and with the basic metadata information distributed into few relevant categories. This has been done to facilitate further data analytic steps, particularly when someone may be unfamiliar with Marc21 database structures.

Summary

The dataset has altogether information on 272411 printed items. The coverage has been estimated to be better than 95% for all of the relevant published works.

The dataset is divided into two: works in Estonian language, and works in other languages. The two sets are displayed over the year of publication here in different colors.

Coverage of metainformation

An overview of the coverage of the data is given below. Grey areas indicate datapoints that do have the information of that column, black areas the datapoints that do not have the information of that column. For example 82% of the books have information on the publisher, however just 59% of the books have information on the author. 16% of the books have a link to an online digital copy.

Data available in ENB

Data available in ENB

Cities

The city names where the works have been published have been harmonized manually and through a few heuristic algorithms. The tokens that appear more than 40 times should be mostly harmonized, while rarer tokens have been included only through algorithmic processing. Depicted below are the most common cities of publication and the number of publications in them separately for language sets.

Authors

Publishers

Publisher names have also been harmonized manually and algorithmically with more frequent names in focus.

Topics

The bibliography has each publication marked by some general topics. Some works have no topic marking, many works have several topic markings. By 20-year intervals, here are the most common topics in the set. For visualization purposes, only the first 10 characters of the topic are shown.