The Icelandic Spoken Language Corpus

The Icelandic Spoken Language Corpus (ISLC) contains four different subcorpora:

Spontaneous conversations, from the project ÍSTAL (An Icelandic Spoken Language Bank) (from 2000, 194,048 tokens)
Group conversations, from the project MIN (Modern loanwords in the Nordic languages) (from 2002, 97,316 tokens)
Parliamentary debates (from 2004-2005, 177,427 tokens)
Conversations of teenagers with other teenagers and adults, from the project "How do young Icelanders speak in the beginning of the 21st century?" (from 2006, 35,527 tokens)

The corpus is tagged which means that each running word is accompanied by a morphosyntactic tag and lemma, and each text is accompanied by bibliographic information. The corpus is intended for linguistic research and for use in Language Technology projects.

The corpus is available in two ways.

Search. The corpus is available for search where the tags (linguistic annotation) can be used to define the search more accurately. The search gives a KWIC index and information about the source of each text example. The search interface is based on Korp developed by Språkbanken in Gothenburg.
Download. The corpus can be downloded in its entirety. The texts are in a special XML format, TEI P5, which is defined by TEI (Text Encoding Initiative). All texts are accompanied by metadata (bibliographic for published works). The texts can be used with a CC BY license. All users are registered with their e-mail address when they accept the user license.

When publishing results based on the texts in the Icelandic Parliamentary Corpus please refer to: Steingrímsson, Steinþór, Sigrún Helgadóttir, Eiríkur Rögnvaldsson, Starkaður Barkarson and Jón Guðnason. 2018. Risamálheild: A Very Large Icelandic Text Corpus. Proceedings of LREC 2018, pp. 4361-4366. Myazaki, Japan.

About ISLC

What is a tagged corpus?

A tagged corpus is a collection of electronic texts in a standard format. The texts are analyzed in various ways to make them suitable for linguistic research and Language Technology projects. Each running word in the text is followed by a tag which shows part-of-speech and often also morphosyntactic elements like case, number and gender for nominals and person, number and tense for verbs. Each running word is also accompanied by a lemma, e.g. nominals in the nominative singular and the infinitive for verbs. Each text is also accompanied by metadata (bibliographic information for published texts).

Tagging the corpus

The corpus was tagged by automatic means. The texts were divided into sentences and running words and then tagged and lemmatized. IceNLP was used to divide the text into sentences and running words. Tagging was performed with IceStagger (Loftsson and Östling, 2013). Lemmatization was performed with the lemmatizer Nefnir. Nefnir is a new lemmatizer by Jón Friðrik Daðason and has not been described yet but it gives better results than the previously used lemmatizer (Lemmald, (Ingason et al., 2008)). Tags and lemmas are not manually corrected. The tagset used for tagging IPC was developed for the making of the Icelandic Frequency Dictionary (IFD) with a few changes: proper nouns are not analyzed specially as person names, place names and other names as was done in the IFD; the tag v is used for url's and e-mail addresses; abbreviations are not divided into individual words and are tagged with the tag as; all number constants are tagged with the tag ta. A corpus made by concatenating the IFD corpus and the MIM-GOLD corpus was used to train IceStagger. Dictionaries used when tagging were augmented with the dictionary of The Database of Modern Icelandic Inflection BÍN.

Tagset for the Icelandic Spoken Language Corpus

Cooperation and financing

The corpus is a subset of the Icelandic Gigaword Corpus which was compiled during the years 2015 to 2017 at the Árni Magnússon Institute for Icelandic Studies and was funded mostly by the Infrastructure Fund (no. 151110-0031, project manager Eiríkur Rögnvaldsson), the Contribution grants fund (Mótframlagasjóður) at the University of Iceland and the Ministry of Education and Culture.

Project group

Software development

Gunnar Thor Örnólfsson
Kristján Rúnarsson
Starkaður Barkarson

Contact

E-mail: malfong[hja]malfong.is

References

Ingason, Anton K., Sigrún Helgadóttir, Hrafn Loftsson and Eiríkur Rögnvaldsson. 2008. A Mixed Method Lemmatization Algorithm Using Hierachy of Linguistic Identities (HOLI). In B. Nordström and A. Ranta (eds.), Advances in Natural Language Processing, 6th International Conference on NLP, GoTAL 2008, Proceedings. Gothenburg, Sweden.
Loftsson, Hrafn, and Robert Östling. 2013. Tagging a morphologically complex language using an averaged perceptron tagger: The case of Icelandic. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA-2013), NEALT Proceedings Series 16. Oslo, Norway.

CLARIN á Íslandi