The Icelandic Gigaword Corpus

The Icelandic Gigaword Corpus (IGC) consists of about 1500 million running words of text. It is tagged, whereby each running word is accompanied by a morphosyntactic tag and lemma, and each text is accompanied by bibliographic information. The corpus is intended for linguistic research and for use in language technology projects.

The corpus is available in two ways.

Search. The corpus is available for search where the tags (linguistic annotation) can be used to define the search more accurately. The search gives a KWIC index and information about the source of each text example. It is possible to choose one of six categories of text for search. The search interface is based on Korp developed by Språkbanken in Gothenburg.
Download. The texts of the corpus can be used in Language Technology projects. The corpus is split into two parts, IGC1 and IGC2. Prospective users accept a special user license for IGC1 and CC BY license for IGC2. The texts are in a special xml format, TEI P5, which is defined by TEI (Text Encoding Initiative). All texts are accompanied by metadata (bibliographic information for published works). All users are registered with their e-mail address when they accept the user license.

When publishing results based on the texts in the Icelandic Gigaword Corpus please refer to: Steingrímsson, Steinþór, Sigrún Helgadóttir, Eiríkur Rögnvaldsson, Starkaður Barkarson and Jón Guðnason. 2018. Risamálheild: A Very Large Icelandic Text Corpus. Proceedings of LREC 2018, pp. 4361-4366. Myazaki, Japan.

About IGC

What is a tagged corpus?

A tagged corpus is a collection of electronic texts in a standard format. The texts are analyzed in various ways to make them suitable for linguistic research and language technology projects. Each running word in the text is followed by a tag which shows part-of-speech and often also morphosyntactic elements like case, number and gender for nominals and person, number and tense for verbs. Each running word is also accompanied by a lemma, e.g. nominals in the nominative singular and the infinitive for verbs. Each text is also accompanied by metadata (bibliographic information for published texts).

The compilation of the Icelandic Gigaword Corpus

The Icelandic Gigaword corpus consists of about 1300 million running words of text. Part of the corpus texts are official texts (e.g. parliamentary speeches as far back as 1911, law texts, adjudications). The corpus also contains large text collections from news media and various texts from the text collection of the Árni Magnússon Institute for Icelandic Studies. The Gigaword corpus is a tagged corpus as described above. The corpus was compiled during the years 2015 to 2017 at the Árni Magnússon Institute for Icelandic Studies. Only texts available in digital form were collected.

To enable the use of the corpus in language technology projects, it was considered important to secure copyright clearance for the texts to be used. Originally the idea was to secure permission from copyright owners to give access to the texts with Creative Commons licenses . All copyright holders could not agree to those terms. The corpus is therefore divided into two parts, IGC1 and IGC2. IGC1 contains texts that can be used with a special license developed for the Tagged Icelandic Corpus (MIM). IGC2 contains official texts and texts that can be used with a CC BY license. All copyright holders have agreed that their material may be used free of licensing charges. Copyright owners that did not accept the CC BY license signed a special declaration developed for the Tagged Icelandic Corpus with necessary amendments for the IGC1.

To be able to gain an overview of where the texts in the corpus originate, they have been classified into six categories. The largest portion of text in IGC comes from web media, just over 38%. From printed papers, there are just under 30% of texts in the corpus, from radio and television just over 4% and official texts are 26% of the corpus. From the text collection of The Árni Magnússon Institute for Icelandic Studies, there is less than 1% of the texts. Other texts (about 0.7%) come from the University of Iceland Science Web and the Icelandic part of Wikipedia. The ratio provided for each text category is based on the number of running words in the texts. Texts made available with a special license based on the license of the Tagged Icelandic Corpus (IGC1) are just under 57% of the corpus and the remainder are texts made available with the CC BY license (IGC2). Just over 86% of the texts are from the period after the year 2000 and just over 94% of the texts from the period after 1980. The oldest texts in the corpus are law texts from the 13th century. The corpus also contains parliamentary speeches as far back as 1911 and a few texts from old newspapers and magazines from before 1900.

There are 4.154.058 files with 1.260.235.818 running words in IGC.

It is possible to extract all kinds of useful information from the corpus such as information on the frequency of word classes, words and word forms, phrases, syntax and semantics. Such data are useful for dictionary compilation, the making of spell checkers and grammar checkers, translation software, tools for speech recognition and speech synthesis and the making of tools for the blind, those hard-of-hearing and those that are motor-handicapped and persons suffering from dyslexia.

Tagging the corpus

The corpus was tagged by automatic means. The texts in IGC were divided into sentences and running words and then tagged and lemmatized. IceNLP was used to divide the text into sentences and running words. Tagging was performed with IceStagger (Loftsson and Östling, 2013). Lemmatization was performed with the lemmatizer Nefnir. Nefnir is a new lemmatizer by Jón Friðrik Daðason and has not been described yet but it gives better results than the previously used lemmatizer (Lemmald, (Ingason et al., 2008)). Tags and lemmas are not manually corrected. The tagset used for tagging IGC was developed for the making of the Icelandic Frequency Dictionary (IFD) with a few changes: proper nouns are not analyzed specially as person names, place names and other names as was done in the IFD; the tag v is used for url's and e-mail addresses; abbreviations are not divided into individual words and are tagged with the tag as; all number constants are tagged with the tag ta. A corpus made by concatenating the IFD corpus and the MIM-GOLD corpus was used to train IceStagger. Dictionaries used when tagging were augmented with the dictionary of The Database of Modern Icelandic Inflection BÍN.

Tagset for the Icelandic Gigaword corpus.

Cooperation and financing

The corpus was compiled during the years 2015 to 2017 at the Árni Magnússon Institute for Icelandic Studies and was funded mostly by the Infrastructure Fund (no. 151110-0031, project manager Eiríkur Rögnvaldsson), the Contribution Grants Fund (Mótframlagasjóður) at the University of Iceland and the Ministry of Education and Culture. The company Creditinfo gave assistance in retrieving texts from radio and television and from some web media and printed media.

Project group

Software development

Gunnar Thor Örnólfsson
Kristján Rúnarsson
Starkaður Barkarson

Special thanks

Creditinfo

Contact

E-mail: malfong[hja]malfong.is

References

Ingason, Anton K., Sigrún Helgadóttir, Hrafn Loftsson and Eiríkur Rögnvaldsson. 2008. A Mixed Method Lemmatization Algorithm Using Hierachy of Linguistic Identities (HOLI). In B. Nordström and A. Ranta (eds.), Advances in Natural Language Processing, 6th International Conference on NLP, GoTAL 2008, Proceedings. Gothenburg, Sweden.
Loftsson, Hrafn, and Robert Östling. 2013. Tagging a morphologically complex language using an averaged perceptron tagger: The case of Icelandic. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA-2013), NEALT Proceedings Series 16. Oslo, Norway.

CLARIN á Íslandi