Tagged Icelandic Corpus

The Tagged Icelandic Corpus (MÍM) is a morphosyntactically tagged corpus of Icelandic consisting of about 25 million tokens of contemporary Icelandic texts collected from varied sources during the years 2006-2010. The corpus is intended for use in Language Technology projects and for linguistic research. The corpus is available for search through a web interface and for download in TEI-conformant XML format. Each text in the corpus is accompanied by metadata.

The texts in MIM are available for use in two different ways:

When publishing results based on the texts in the Tagged Icelandic Corpus please refer to: Sigrún Helgadóttir, Ásta Svavarsdóttir, Eiríkur Rögnvaldsson, Kristín Bjarnadóttir and Hrafn Loftsson. 2012. The Tagged Icelandic Corpus (MÍM)Proceedings of the Workshop on Language Technology for Normalisation of Less-Resourced Languages -SaLTMiL 8 – AfLaT2012, pp. 67-72. Istanbul, Turkey.

About MÍM

What is a tagged corpus?

A tagged corpus is a collection of electronic texts in a standard format. The texts are analysed in various ways to make them suitable for linguistic research and language technology projects. Each running word in the text is followed by a tag which shows part-of-speech and often also morphsyntactic elements like case, number and gender for nominals and person, number and tense for verbs. Each running word is also accompanied by a lemma, e.g. nominals in the nominative singular and the infinitive for verbs. Each text is also accompanied by metadata (bibiliographic information for published texts).

The compilation of the Icelandic Corpus

The project to compile a tagged corpus containing Icelandic contemporary texts was started in 2004 at the Institute of Lexicography. The project was continued in 2006 at the newly founded Árni Magnússon Institue for Icelandic Studies (AMI) when the Institute of Lexicography became a part of that institute. The corpus should contain about 25 million running words of texts from different genres of Icelandic written in the twenty-first century. One of the main criteria for the compilation of the corpus was that it should contain a “balanced” or a “representative” text collection. The texts are written during the years 2000–2010 and are, with one exception, original writings in Icelandic written by native speakers of Icelandic. Only texts that were available electronically were collected.

To enable the use of the corpus in language technology projects, it was considered important to secure copyright clearance for the texts to be used. All owners of copyrighted text signed a special declaration and agreed that their material may be used free of licensing charges.

It was anticipated that most of the texts would be protected by copyright (the final figure is about 86%). Early on in the project, cooperation was secured from the Writer's Union of Iceland, the Association of Non-fiction and Educational Writers in Iceland, and the Icelandic Publishers' Association. All these associations recommended to their members that they should cooperate with the project. The most important of these, and the most difficult to secure, was the recommendation of the publishers' association, since publishers are normally the keepers of digital copies of published material. When permission had been obtained from an author of a published book, the publisher was contacted to obtain an electronic copy of the text. Both informative writings and imaginative writings were collected. Texts from published books make up just under 24% of the texts in the corpus.

The second-largest portion of text, about 22%, is taken from newspapers, mostly from printed newspapers (less than 1% from two online newspapers). The printed newspapers are Morgunblaðið (20%) and Fréttablaðið (2%). Text from various printed periodicals is about 9.5% of the corpus. About 14% of the texts in the corpus are official texts and therefore not covered by copyright. These are speeches from the Icelandic Parliament (Alþingi), (about 2% of the corpus texts), legal texts and adjudications (5.2%), and texts from the websites of government ministries (6.8%). All these texts, apart from the parliamentary speeches that were obtained from the database of Alþingi, were harvested directly from the respective websites. Here is list of text categories in the corpus. Here is a list of all texts in the corpus.

Copyright owners were given a copy of the user license that users have to agree to in order to be able to download the corpus texts.

It is possible to extract all kinds of useful information from the corpus such as information on the frequency of word classes, words and word forms, phrases, syntax and semantics. Such data are useful for dictionary compilation, the making of spell checkers and grammar checkers, translation software, tools for speech recognition and speech synthesis, and the making of tools for the blind, those who are hard-of-hearing, those who are motor-handicapped, and persons suffering dyslexia.

Cooperation and financing

During the first years, the project was financed by the Language Technolgy project of the Ministry of Education, Science and Culture. The research project Variation in Syntax supplied the spoken component of the corpus. The project was partly financed by the research project Viable Language Technology Beyond English which received a Grant of Excellence from The Icelandic Research Fund during the years 2009–2011. The project was financed from February 2011 to January 2013 by the Icelandic part of the META-NORD project which is a cooperation between the Nordic and Baltic Countries and is a part of META-NET. Special parts of the project have been financed by grants from The University Research Fund and the Icelandic Student Innovation Fund. The Árni Magnússon Institute for Icelandic Studies is a partner of The Icelandic Centre for Language Technology (ICLT). Researchers affiliated with the ICLT have also taken part in the compilation of the corpus.

Tagging the corpus

The corpus was tagged by automatic means. The software used, CorpusTagger, was developed for the work on the MIM-GOLD corpus (Hrafn Loftsson et al., 2010). The text was segmented into sentences and tokenized with the IceNLP software. The text was tagged with four taggers: fnTBL, MXPOST (Ratnaparkhi, 1996), TriTagger which is a part of the IceNLP software and is a re-implementation of the well known Hidden Markov Model (HMM) tagger TnT (Brants, 2000) and IceTagger (Loftsson, 2008) which is a rule-based tagger and also a part of the IceNLP software. The taggers fnTBL, MXPOST og TriTagger are all data-driven taggers that were trained on the IFD corpus. The IFD corpus was also used for the development of the rule-based tagger IceTagger. Finally the software CombiTagger was used to vote between the tags. The MÍM corpus is thus tagged with the tagset of the IFD corpus with the exception that proper names are not classified as personal names, place names and other proper names. The text was lemmatized with the tool Lemmald (Anton Ingason et al., 2008) which also is a part of the IceNLP software. The automatic morphosyntactic tagging accuracy has been estimated to be 88,1–95,1% depending on text type (Hrafn Loftsson o.fl., 2010) and the lemmatization accuracy is estimated as approximately 90%.

The MÍM tagset

Word frequency

The text was lemmatized with the tool Lemmald as already mentioned, and the lemmatization accuracy was estimated to be approximately 90%. To be able to obtain reliable figures for the frequency of lemmas, it is necessary that lemmatization accuracy is considerably higher. However, to obtain some idea about the frequency of lemmas, the frequency of lemmas that occur more often than 100 times is shown. The Excel file contains 14 sheets. The first sheet (freq) contains lemmas that occur more than 100 times sorted by frequency. Word class (pos) is specified, i.e. the first character of the tag. These letters are used: a: adverbs; c: conjunctions; e: foreign words; f: pronouns; g: article; l: adjectives; n: nouns; s: verbs; t: numerals; x: unspecified. It should be pointed out that prepositions are classified as adverbs. In the next sheet (alphabetic) lemmas are in alphabetical order. In the following sheet (freq(alphab)) lemmas are ordered by frequency, but lemmas with the same frequency are ordered alphabetically. In the next sheet ((pos(freq(alphb))) lemmas are orderd by pos, then frequency and alphabetically at last. The following sheets contain lemmas for each word class where the lemmas are ordered by frequency and then alphabetically.

Project Manager

Project group

Other co-workers

  • Auður Þórunn Rögnvaldsdóttir (preparatory stage)
  • Eyrún Ellý Valsdóttir (text collection and text cleaning)
  • Hjördís Stefánsdóttir (text collection and text cleaning)
  • Guðmundur Örn Leifsson (search interface)
  • Kristján Friðbjörn Sigurðsson (manually checking tags in MIM-GOLD)
  • Jökull Huxley Yngvason (CorpusTagger)
  • Kristín Margrét Jóhannsdóttir (metadata and text cleaning)
  • Steinþór Steingrímsson (import to TEI format, search interface)

Contact

E-mail: clarin@clarin.is

References

  • Brants, Thorsten. 2000. TnT - A Statistical Part-of-Speech Tagger. Proceedings of the Sixth Applied Natural Language Processing Conference ANLP-2000, pp. 224–231. Seattle, Washington, USA.
  • Ingason, Anton K., Sigrún Helgadóttir, Hrafn Loftsson and Eiríkur Rögnvaldsson. 2008. A Mixed Method Lemmatization Algorithm Using Hierachy of Linguistic Identities (HOLI). In B. Nordström and A. Ranta (eds.), Advances in Natural Language Processing, 6th International Conference on NLP, GoTAL 2008, Proceedings. Gothenburg, Sweden.
  • Loftsson. Hrafn. 2008. Tagging Icelandic text: A linguistic rule-based approach. Nordic Journal of Linguistics, 31(1), 47-72.
  • Loftsson, Hrafn, Jökull H. Yngvason, Sigrún Helgadóttir and Eiríkur Rögnvaldsson. 2010. Developing a PoS-tagged corpus using existing tools. In Proceedings of "Creation and use of basic lexical resources for less-resourced languages", workshop at the 7th International Conference on Language Resources and Evaluation, LREC 2010. Valetta, Malta.
  • Ratnaparkhi, A. 1996. A Maximum Entropy Model for Part-of-Speech Tagging. In Proceedings of the Conference on Empirical Mehods in Natural Lanugage Processing (EMNLP-96), pp. 133–143. Philadelphia. PA.

Further reading