MIM-GOLD

MIM-GOLD is a corpus containing one million words of text. The texts were tagged automatically and the tags were than manually corrected. The texts in the MIM-GOLD corpus were sampled from the texts of the MIM corpus. For the use of MIM-GOLD a special license based on the MIM license is therefore valid. The MIM-GOLD corpus is intended as a gold standard for the training of data-driven PoS taggers. MIM-GOLD was a joint project between The Árni Magnússon Institute for Icelandic Studies, Reykjavík University and the University of Iceland.

When publishing results based on the texts in MIM-GOLD please refer to: Hrafn Loftsson, Jökull H. Yngvason, Sigrún Helgadóttir and Eiríkur Rögnvaldsson. 2010. Developing a PoS-tagged corpus using existing tools. Sarasola, Kepa, Francis M. Tyers and Mikel L. Forcada (eds.): 7th SaLTMiL Workshop on Creation and Use of Basic Lexical Resources for Less-Resourced Languages, LREC 2010, pp. 53-60. Valetta, Malta. Further information about the project can be found in Sigrún Helgadóttir et al. (2014) and Steinþór Steingrímsson et al. (2015).

About the MIM-GOLD corpus

The Tagged Icelandic Corpus (MÍM) was published in 2013. The corpus contains about 25 million running words of texts written during the first decade of the 21st century. 

While MÍM was being compiled about one million tokens were sampled from 13 of the 23 domains of MÍM. The new corpus should replace the corpus of the Icelandic Frequency Dictionary (IFD) as a gold standard for the training of data-driven taggers for Icelandic. 

In 2013 version 0.9 of MIM-GOLD was published. Version 1.0 was published in 2018. The development of MIM-GOLD is described below. The process is divided into 5 phases, numbered 0 to 4. 

Phase 0 

Work on MIM-GOLD commenced in the summer of 2009 when a grant was secured from the Student Innovation fund to hire a student to start the project. The texts were sampled at the Árni Magnússon Institute for Icelandic Studies and the student under the supervision of Hrafn Loftsson at Reykjavík University developed a system for tagging the texts. The texts were tokenized with a tokenizer that is a part of the IceNLP system. The texts were then tagged with five taggers: fnTBL, MXPOST, IceTagger, Bidir and TnT (Loftsson et al., 2010). The tool CombiTagger was then used to vote between the proposed tags. A method was used that chooses the tag that most taggers suggested. The taggers were trained on the corpus of the Icelandic Frequency Dictionary (IFD). The tagset of the IFD was therefore used. 

During the winter of 2009-2010, a search for systematic errors in the MIM-GOLD corpus was performed. Noun phrase (NP), prepositional phrase (PP) and verb phrase (VP) error detection programs described by Loftsson (2009) were used. A large proportion of the errors detected were checked manually and errors corrected. Tagging accuracy was then estimated by inspecting every 100th word. A tag is correct if the whole tagstring (consisting of up to 6 characters) is correct. Mean tagging accuracy was estimated as 92.3%, ranging between 87.6 and 95.5% depending on text domain (Loftsson et al., 2010). This part of the project also recieved a contribution from a grant form the Icelandic Research Fund. 

Phase 1 

During the summer of 2010 another grant was secured from the Student Innovation Fund to employ a student to manually check and correct tags of all the words in MIM-GOLD. The first job was to finish checking errors found during Phase 0 that had not been corrected (texts from Morgunblaðið). Work on checking texts from printed books was also started. The student was then hired part-time during term and during 2010-2011; all the words in MIM-GOLD were manually checked and corrected. Version 0.9 of MIM-GOLD that was made available on this website in 2013 contains the files after this correction phase. Mean accuracy was estimated as before by inspecting the tag for every 100th word. Mean accuracy was estimated as 96.4%, ranging between 89.9% and 98.5% depending on text domain (Helgadóttir et al., 2014). The project also got a contribution from META-NORD and the Ministry of Education Science and Culture. 

Phase 2 

The next correction phase started at the end of 2012. The corpus was first tagged automatically with the tagger IceTagger which is a part of the IceNLP software. A script was written that compares the tags output by IceTagger with the (presumed) correct tags in the corpus. If a difference is found the discrepancy was marked as and error candidate. A second student was employed during the summer of 2013 and part-time after that to manually inspect the error candidates. For each error candidate, the student was instructed to i) select the tag in the corpus; or ii) select the tag prposed by IceTagger; or iii) select a new correct tag when neither IceTagger nor the corpus contained the correct tag. After about 80% of the texts had been checked and corrected tagging accuracy was estimated as 99.6%, ranging between 99.5 and 100.0% depending on text domain (Sigrún Helgadóttir et al., 2014). One more student was emplyed in late 2013 to finish checking and correcting the tags. That work was finished in 2014. Tagging accuracy was not estimated at the end of this phase. This part of the project was supported in part by META-NORD5 and the Ministry of Education Science and Culture6. 

Phase 3 

Steinþór Steingrímsson, Sigrún Helgadóttir and Eiríkur Rögnvaldsson experimented in 2015 with training the tagger Stagger (Östling, 2013) on the IFD and MIM-GOLD (Steingrímsson et al., 2015). Hrafn Loftsson and Robert Östling experimented in 2013 with developing a tagger for Icelandic by training and testing Stagger on the IFD and obtained 93,84% accuracy (Loftsson and Östling, 2013). Since this was the best result obtained so far with tagging Icelandic text it was decided to test Stagger on MIM-GOLD. By comparing the accuracy obtained when training and testing Stagger on the IFD and on MIM-GOLD it was clear that there were still a number of inconsistencies and incorrect tags in MIM-GOLD. (Steingrímsson et al., 2015). For the experiment a version of MIM-GOLD after the completion of Phase 2 was used. The experiment with training and testing Stagger on IFD reported by Loftsson and Östling (2013) was repeated for MIM-GOLD by using linguistic features (LF) and the unknown word guesser IceMorphy (part of the IceNLP software). An extended lexicon based on the Database of Icelandic Inflection (BÍN) was added. By applying ten-fold cross-validation 92.76% accuracy was obtained for MIM-GOLD. As a result of this outcome it was decided to work further on reducing the number of errors and inconsistencies in MIM-GOLD. Lists of inconistencies and errors were made and students were employed to check them manually. The tagset was also modified slightly. Work on this phase was completed in 2017. This part of the project was funded by the Institute of Linguistics at the University of Iceland and the Icelandic Ministry of Education, Science and Culture. 

Phase 4 

Starkaður Barkarson obtained the data of the MIM-GOLD after Phase 3 was completed and trained Stagger on the texts (Barkarson, 2018). Tagging accuracy was not estimated after Phase 3 by inspecting a sample as had been done after previous correction phases. Starkaður repeated the experiment performed by Steinþór Steigrímsson, Sigrún Helgadóttir and Eiríkur Rögnvaldsson in 2015. He performed a comparable ten-fold cross-valdiation on MIM-GOLD and obtained 92.74% accuracy. 

Despite the corrections made to MIM-GOLD, tagging accuracy did not seem to increase. To make sure that the experiments were completely comparable the experiment performed by Steinþór Steingrímsson and his colleagues (Steingrímsson et al., 2015) was repeated as far as possible. Same version of MIM-GOLD (before Phase 3) was used and same division into training and testing sets. Data for Database of Icelandic Inflection (BÍN) were not completely comparable since now a later version was used. Starkaður obtained 92.41% accuracy by using BIN and IceMorphy as compared to 92.76% in the experiment performed by Steinþór Steingrímsson and colleagues. Starkaður therefore claims that corrections made to MIM-GOLD resulted in an increase in accuracy of 0.30 percentage points. He believes that the reason for the difference may be found in the set of words and word endings that was available to IceMorphy since there is a large difference in accuracy of unknown words (just under 15%) but a small difference in accuracy of known words (0.09%) (Barkarson, 2017). 

Modified tagset 

To simplify grammatical analysis and reduce inconsistencies in tagging the tagset of the IFD was slightly modified during correction phases of MIM-GOLD. These changes were made: 

  • Foreign names were originally tagged as proper nouns. During Phase 3 they were tagged as foreign words (e). (Steingrímsson et al., 2015).
  • In the IFD, function words preceding að were classified as adverbs (aa). During Phase 2, on the other hand, they are classified as prepositions if they are followed by a complement clause. Thus, the word til in the sentence „Hann hljóp til að komast fyrr heim“ is classified as a preposition governing genitive case (ae). (Helgadóttir et al., 2014; Steingrímsson et al., 2015; Barkarson, 2017).
  • Further classification of proper nouns ended during Phase 3. Tags of all proper nouns now end in -s, instead of -m (person names), (place names) and -s (other proper nouns). Number of tags is reduced by 68. (Steingrímsson et al., 2015).
  • During Phase 3, v was adopted as a tag for e-mail addresses and web addresses. (Steingrímsson et al., 2015).
  • During Phase 3, as was adopted as a tag for abbreviations. In the IFD tagset abbreviations were broken up into individual words and each letter tagged as the word it stood for. (Steingrímsson et al., 2015).
  • During Phase 3, it was decide that all number constants that were tagged as cardinals (tf...) should be given the tag ta and not analyzed further according to gender, number and case as is done when numbers are written with alphabetic characters. (Steingrímsson et al., 2015).

In his Master's dissertation, Starkaður Barkarson (2018) discusses the effect of analysing foreign names with the tag e and the need to simplify the analysis of punctuation signs. 

Version 0.9 

Version 0.9 of MIM-GOLD was released in 2013 with 13 files after corrections performed during Phase 1. Mean accuracy was estimated as 96.4%, ranging between 89.9% and 98.5% depending on text domain. The text files are in Linux format and coded in UTF-8 codepage. The format of the files is one token per line, each line consists of the word, followed by a tab and then the tag. Sentences are separated by empty lines. 

Version 1.0 

In version 1.0 of MIM-GOLD, released in 2018, there are 13 files with corrections of tags that have been performed until 2017 and with a modified tagset as described above. The texts are comparable to texts in version 0.9 apart from corrections of tokenization and corrections of tags. The format of the files is one token per line, each line consists of the word, followed by a tab and then the tag. Sentences are separated by empty lines. 

The corpus is distributed with a special user license which is based on the MIM license since the texts in MIM-GOLD were sampled from the MIM corpus

The people behind the project 

  • Hrafn Loftsson
  • Eiríkur Rögnvaldsson
  • Sigrún Helgadóttir
  • Jökull H. Yngvason
  • Kristján Friðbjörn Sigurðsson
  • Steinunn Valbjörnsdóttir
  • Brynhildur Stefánsdóttir
  • Jón Friðrik Daðason
  • Starkaður Barkarson

References