MIM-GOLD-NER

This Icelandic named entity (NE) corpus, MIM-GOLD-NER, is a version of the MIM-GOLD corpus tagged for NEs. Over 48 thousand NEs are tagged in this corpus of one million tokens, which can be used for training named entity recognizers for Icelandic.

  • Download MIM-GOLD-NER here

About the corpus

The MIM-GOLD-NER corpus was developed at Reykjavik University in 2018–2020, funded by the Strategic Research and Development Programme for Language Technology (LT). Two LT students were in charge of the corpus annotation and of training named entity recognizers using machine learning methods.

A semi-automatic approach was used for annotating the corpus. Lists of Icelandic person names, location names, and company names were compiled and used for extracting and classifying as many named entities as possible. Regular expressions were then used to find certain numerical entities in the corpus. After this automatic pre-processing step, the whole corpus was reviewed manually to correct any errors. The corpus is tagged for eight named entity types:

  • PERSON – names of humans, animals and other beings, real or fictional.
  • LOCATION – names of locations, real or fictional, i.e. buildings, street and place names, both real and fictional. All geographical and geopolitical entities such as cities, countries, counties and regions, as well as planet names and other outer space entities.
  • ORGANIZATION – companies and other organizations, public or private, real or fictional. Schools, churches, swimming pools, community centers, musical groups, other affiliations.
  • MISCELLANEOUS – proper nouns that don’t belong to the previous three categories, such as products, books and movie titles, events, such as wars, sports tournaments, festivals, concerts, etc.
  • DATE – absolute temporal units of a full day or longer, such as days, months, years, centuries, both written numerically and alphabetically.
  • TIME – absolute temporal units shorter than a full day, such as seconds, minutes, or hours, both written numerically and alphabetically.
  • MONEY – exact monetary amounts in any currency, both written numerically and alphabetically.
  • PERCENT – percentages, both written numerically and alphabetically

MIM-GOLD-NER is intended for training of named entity recognizers for Icelandic. It is in the CoNLL format, and the position of each token within the NE is marked using the BIO tagging format. The corpus can be used in its entirety or by training on subsets of the text types that best fit the intended domain.

The Named Entity Corpus corpus is distributed with the same special user license as MIM-GOLD, which is based on the MIM license since the texts in MIM-GOLD were sampled from the MIM corpus.


Contact

Hrafn Loftsson
hrafn@ru.is

Svanhvít Lilja Ingólfsdóttir
svanhviti16@ru.is