The Icelandic Spoken Language Corpus (ISLC) contains four different subcorpora:
The corpus is tagged which means that each running word is accompanied by a morphosyntactic tag and lemma, and each text is accompanied by bibliographic information. The corpus is intended for linguistic research and for use in Language Technology projects.
The corpus is available in two ways.
When publishing results based on the texts in the Icelandic Parliamentary Corpus please refer to: Steingrímsson, Steinþór, Sigrún Helgadóttir, Eiríkur Rögnvaldsson, Starkaður Barkarson and Jón Guðnason. 2018. Risamálheild: A Very Large Icelandic Text Corpus. Proceedings of LREC 2018, pp. 4361-4366. Myazaki, Japan.
A tagged corpus is a collection of electronic texts in a standard format. The texts are analyzed in various ways to make them suitable for linguistic research and Language Technology projects. Each running word in the text is followed by a tag which shows part-of-speech and often also morphosyntactic elements like case, number and gender for nominals and person, number and tense for verbs. Each running word is also accompanied by a lemma, e.g. nominals in the nominative singular and the infinitive for verbs. Each text is also accompanied by metadata (bibliographic information for published texts).
The corpus was tagged by automatic means. The texts were divided into sentences and running words and then tagged and lemmatized. IceNLP was used to divide the text into sentences and running words. Tagging was performed with IceStagger (Loftsson and Östling, 2013). Lemmatization was performed with the lemmatizer Nefnir. Nefnir is a new lemmatizer by Jón Friðrik Daðason and has not been described yet but it gives better results than the previously used lemmatizer (Lemmald, (Ingason et al., 2008)). Tags and lemmas are not manually corrected. The tagset used for tagging IPC was developed for the making of the Icelandic Frequency Dictionary (IFD) with a few changes: proper nouns are not analyzed specially as person names, place names and other names as was done in the IFD; the tag v is used for url's and e-mail addresses; abbreviations are not divided into individual words and are tagged with the tag as; all number constants are tagged with the tag ta. A corpus made by concatenating the IFD corpus and the MIM-GOLD corpus was used to train IceStagger. Dictionaries used when tagging were augmented with the dictionary of The Database of Modern Icelandic Inflection BÍN.
Tagset for the Icelandic Spoken Language Corpus
The corpus is a subset of the Icelandic Gigaword Corpus which was compiled during the years 2015 to 2017 at the Árni Magnússon Institute for Icelandic Studies and was funded mostly by the Infrastructure Fund (no. 151110-0031, project manager Eiríkur Rögnvaldsson), the Contribution grants fund (Mótframlagasjóður) at the University of Iceland and the Ministry of Education and Culture.
E-mail: malfong[hja]malfong.is