This Icelandic named entity (NE) corpus, MIM-GOLD-NER, is a version of the MIM-GOLD corpus tagged for NEs. Over 48 thousand NEs are tagged in this corpus of one million tokens, which can be used for training named entity recognizers for Icelandic.
The MIM-GOLD-NER corpus was developed at Reykjavik University in 2018–2020, funded by the Strategic Research and Development Programme for Language Technology (LT). Two LT students were in charge of the corpus annotation and of training named entity recognizers using machine learning methods.
A semi-automatic approach was used for annotating the corpus. Lists of Icelandic person names, location names, and company names were compiled and used for extracting and classifying as many named entities as possible. Regular expressions were then used to find certain numerical entities in the corpus. After this automatic pre-processing step, the whole corpus was reviewed manually to correct any errors. The corpus is tagged for eight named entity types:
MIM-GOLD-NER is intended for training of named entity recognizers for Icelandic. It is in the CoNLL format, and the position of each token within the NE is marked using the BIO tagging format. The corpus can be used in its entirety or by training on subsets of the text types that best fit the intended domain.
The Named Entity Corpus corpus is distributed with the same special user license as MIM-GOLD, which is based on the MIM license since the texts in MIM-GOLD were sampled from the MIM corpus.
Hrafn Loftsson
hrafn@ru.is
Svanhvít Lilja Ingólfsdóttir
svanhviti16@ru.is