The Icelandic Gigaword Corpus (IGC) consists of about 1500 million running words of text. It is tagged, whereby each running word is accompanied by a morphosyntactic tag and lemma, and each text is accompanied by bibliographic information. The corpus is intended for linguistic research and for use in language technology projects.
The corpus is available in two ways.
When publishing results based on the texts in the Icelandic Gigaword Corpus please refer to: Steingrímsson, Steinþór, Sigrún Helgadóttir, Eiríkur Rögnvaldsson, Starkaður Barkarson and Jón Guðnason. 2018. Risamálheild: A Very Large Icelandic Text Corpus. Proceedings of LREC 2018, pp. 4361-4366. Myazaki, Japan.
A tagged corpus is a collection of electronic texts in a standard format. The texts are analyzed in various ways to make them suitable for linguistic research and language technology projects. Each running word in the text is followed by a tag which shows part-of-speech and often also morphosyntactic elements like case, number and gender for nominals and person, number and tense for verbs. Each running word is also accompanied by a lemma, e.g. nominals in the nominative singular and the infinitive for verbs. Each text is also accompanied by metadata (bibliographic information for published texts).
The Icelandic Gigaword corpus consists of about 1300 million running words of text. Part of the corpus texts are official texts (e.g. parliamentary speeches as far back as 1911, law texts, adjudications). The corpus also contains large text collections from news media and various texts from the text collection of the Árni Magnússon Institute for Icelandic Studies. The Gigaword corpus is a tagged corpus as described above. The corpus was compiled during the years 2015 to 2017 at the Árni Magnússon Institute for Icelandic Studies. Only texts available in digital form were collected.
To enable the use of the corpus in language technology projects, it was considered important to secure copyright clearance for the texts to be used. Originally the idea was to secure permission from copyright owners to give access to the texts with Creative Commons licenses . All copyright holders could not agree to those terms. The corpus is therefore divided into two parts, IGC1 and IGC2. IGC1 contains texts that can be used with a special license developed for the Tagged Icelandic Corpus (MIM). IGC2 contains official texts and texts that can be used with a CC BY license. All copyright holders have agreed that their material may be used free of licensing charges. Copyright owners that did not accept the CC BY license signed a special declaration developed for the Tagged Icelandic Corpus with necessary amendments for the IGC1.
To be able to gain an overview of where the texts in the corpus originate, they have been classified into six categories. The largest portion of text in IGC comes from web media, just over 38%. From printed papers, there are just under 30% of texts in the corpus, from radio and television just over 4% and official texts are 26% of the corpus. From the text collection of The Árni Magnússon Institute for Icelandic Studies, there is less than 1% of the texts. Other texts (about 0.7%) come from the University of Iceland Science Web and the Icelandic part of Wikipedia. The ratio provided for each text category is based on the number of running words in the texts. Texts made available with a special license based on the license of the Tagged Icelandic Corpus (IGC1) are just under 57% of the corpus and the remainder are texts made available with the CC BY license (IGC2). Just over 86% of the texts are from the period after the year 2000 and just over 94% of the texts from the period after 1980. The oldest texts in the corpus are law texts from the 13th century. The corpus also contains parliamentary speeches as far back as 1911 and a few texts from old newspapers and magazines from before 1900.
There are 4.154.058 files with 1.260.235.818 running words in IGC.
It is possible to extract all kinds of useful information from the corpus such as information on the frequency of word classes, words and word forms, phrases, syntax and semantics. Such data are useful for dictionary compilation, the making of spell checkers and grammar checkers, translation software, tools for speech recognition and speech synthesis and the making of tools for the blind, those hard-of-hearing and those that are motor-handicapped and persons suffering from dyslexia.
The corpus was tagged by automatic means. The texts in IGC were divided into sentences and running words and then tagged and lemmatized. IceNLP was used to divide the text into sentences and running words. Tagging was performed with IceStagger (Loftsson and Östling, 2013). Lemmatization was performed with the lemmatizer Nefnir. Nefnir is a new lemmatizer by Jón Friðrik Daðason and has not been described yet but it gives better results than the previously used lemmatizer (Lemmald, (Ingason et al., 2008)). Tags and lemmas are not manually corrected. The tagset used for tagging IGC was developed for the making of the Icelandic Frequency Dictionary (IFD) with a few changes: proper nouns are not analyzed specially as person names, place names and other names as was done in the IFD; the tag v is used for url's and e-mail addresses; abbreviations are not divided into individual words and are tagged with the tag as; all number constants are tagged with the tag ta. A corpus made by concatenating the IFD corpus and the MIM-GOLD corpus was used to train IceStagger. Dictionaries used when tagging were augmented with the dictionary of The Database of Modern Icelandic Inflection BÍN.
The corpus was compiled during the years 2015 to 2017 at the Árni Magnússon Institute for Icelandic Studies and was funded mostly by the Infrastructure Fund (no. 151110-0031, project manager Eiríkur Rögnvaldsson), the Contribution Grants Fund (Mótframlagasjóður) at the University of Iceland and the Ministry of Education and Culture. The company Creditinfo gave assistance in retrieving texts from radio and television and from some web media and printed media.