The Saga Corpus

The Saga Corpus contains a number of Old Icelandic narrative texts: Family Sagas (Íslendingasögur), Sturlunga Saga, Sagas of the Kings of Norway (Heimskringla) and the Book of Settlement (Landnámabók). With the exception of Landnámabók, the texts are from the publication of Svart á hvítu and Mál og menning that were published between 1985 and 1991.

The texts have been normalized to Modern Icelandic spelling. Several inflectional endings were also changed to Modern Icelandic form. The texts can be searched and they can also be downloaded for use in linguistic research and LT projects.

The texts of the Saga Corpus are available for use in two different ways:

  • Search. The search is available through the website of the corpus page of the Árni Magnússon Institute. Grammatical information can be used to refine the search. Bibliographic information is displayed for the texts that appear in the search results. Here is a list of the texts that can be searched. On the search page it is possible to choose any of the texts for search. One of the works is Íslendingaþættir. It is also possible to choose any of those for search.
  • Download. The texts are available in a special xml-format that is defined by TEI (Text Encoding Initiative). Bibliographical information is included with all the texts. Prospective users must register and accept the terms and conditions. The texts are accessible with a CC BY 4.0 licence.

When publishing results based on the texts in the Saga Corpus please refer to: Rögnvaldsson, Eiríkur, and Sigrún Helgadóttir. Morphosyntactic Tagging of Old Icelandic Texts and Its Use in Studying Syntactic Variation and Change. In Sporleder, Caroline, Antal P.J. van den Bosch og Kalliopi A. Zervanou (eds.): Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series. pp. 63–76. Springer, Berlin.

About the Saga Corpus

The Texts

The corpus that is made available here contains fourty-one texts from the Family Sagas, Sturlunga Saga, Heimskringla and the Book of Settlement. The division of the corpus is shown in the table below. Numbers refer to running words excluding punctuation.

TextWords
Family Sagas 982.066
Sturlunga Saga 260.586
Heimskringla 231.502
Book of Settlement 37.120
Total 1.511.275

 

The texts of the Family Sagas are taken from the publication of Svart á hvítu (Halldórsson et al. (eds.), 1985-1986) and also the text of Sturlunga Saga (Thorsson et al. (eds.), 1988). The text of Heimskringla is from the publication of Mál og menning from the year 1991 (Kristjánsdóttir et al. (eds.), 1991). The spelling was normalized to Modern Icelandic spelling and some inflectional endings were changed to Modern Icelandic form. The text of the Book of Settlement is from the publication of Jakob Benediktsson from 1968 (Jakob Benediktsson, 1968). The book was scanned and the text normalized to Modern Icelandic spelling in the same way as the other texts. List of the texts can be found here. One of the texts is Íslendingaþættir, a collection of tales, called þættir.

What was changed?

Transliteration to modern spelling includes reducing the number of vowel symbols ('æ' is used for both 'æ' and 'œ', 'ö' is used for both 'ø' and 'ǫ', the letter u is inserted between a consonant and r at the end of a word (maðr > maður), ss and rr at the end of a word is shortened (íss > ísherr > her) and t and k at the end of a word in unstressed syllables is changed to ð and g (þat > þaðok > og). Furthermore, some inflectional endings were changed to Modern Icelandic form.

Tagging experiments

Three attempts were made at tagging the texts. The first experiments were performed in 2005. The texts were at first tagged with a method that was developed for Modern Icelandic. Methods of tagging Icelandic text have been developed using tagged texts of the Icelandic Frequency Dictionary. The data-driven tagger TnT (Brants, 2000) was trained on the tagged texts of the Icelandic Frequency Dictionary (Helgadóttir, 2004, 2007). A model was created, that can be used to tag new texts and all the texts in the Saga Corpus were tagged using this model. In order to measure the tagging accuracy four randomly selected samples of 1000 words each were used; one from the Family Sagas, one from Heimskringla and two from the Sturlunga Saga. The tags in these samples were corrected manually. When the correct tags in these examples were counted, the tagging accuracy was 88% whereas it was 90.4% in the texts from the Icelandic Frequency Dictionary. The structure of sentences in Old Icelandic is quite different from that in Modern Icelandic. Different word order should particularly affect the accuracy of a statistical tagger such as TnT, which is based on trigrams. However, sentences in Old Icelandic texts are generally very short and it is easier to analyze short sentences than long ones.

Then seven whole texts (sagas) and two fragments from the Sturlunga collection were selected for manual correction, totalling 95,000 words. The TnT tagger was trained on these texts and the new model used for the tagging of the whole corpus. Accuracy was again measured on the four samples which resulted in 91.7% accuracy. Finally the TnT-tagger was trained on the union of the corrected Old Icelandic texts and the Modern Icelandic texts. The Old Icelandic Corpus was then tagged using this model. Accuracy was measured in the same way as before and reached 92.7% (Rögnvaldsson and Helgadóttir, 2011).

In 2013, Hrafn Loftsson and Robert Östling experimented again with the tagging of the Old Icelandic texts (Loftsson and Östling, 2013). They corrected the training corpus from the Sturlunga collection partly automatically and partly manually, and corrected a total of 2,144 tags. They tested three taggers; the best performing tagger was Stagger (Östling, 2012). The authors tagged the corrected training corpus (which they call SAGA-GOLD) using 10-fold cross-validation and adding the IFD corpus to each training fold. By doing this, they obtained mean accuracy 91.76%. The authors also combined the output of three taggers (TriTagger, HMM+Ice+HMM (Loftsson et al., 2009) and Stagger) and obtained 92.32% accuracy.

In January 2018, Starkaður Barkarson retagged all the texts of the Saga Corpus using Stagger. A new training corpus was made by concatenating the IFD corpus (about 500 thousand running words), the texts from the Sturlunga collections (about 95,000 running words, SAGA-GOLD) with corrections performed by Hrafn Loftsson and the new gold standard for Icelandic, MIM-GOLD (about 1 million running words). Tagging accuracy was estimated with the same method as in the first experiment, i.e. by tagging the three thousand word samples and comparing the tags to the corrected tags. The accuracy was estimated to be 93.5%. This number is not comparable to the result obtained by Loftsson and Östling since they did not use the thousand word samples but did a ten-fold cross-validation instead.

Tagging and lemmatizing the texts

The texts were tokenized and split into sentences by using the IceNLP suite. The texts were tagged with the tagger Stagger as described above and lemmatized with the lemmatizer Nefnir. Nefnir is a new lemmatizer by Jón Friðrik Daðason and has not been described yet but it gives better results than the previously used lemmatizer (Lemmald, (Ingason et al., 2008)). After tagging was completed, tags in the part of the Sturlunga texts that are also a part of the training corpus were restored to the corrected value.

Contact

malfong[hja]malfong.is

References

  • Benediktsson, Jakob (ed.). 1968. Íslenzk fornrit I. Íslendingabók - Landnámabók. Hið íslenzka fornritafélag.
  • Brants, Thorsten. 2000. TnT - A Statistical Part-of-Speech Tagger. Proceedings of the Sixth Applied Natural Language Processing Conference ANLP-2000, bls. 224-231. Seattle, Washington, USA.
  • Halldórsson, Bragi, Jón Torfason and Örnólfur Thorsson (eds.). 1985-1986. Íslendinga sögur. Svart á hvítu. Reykjavík.
  • Helgadóttir, Sigrún. 2004. Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic. In H. Holmboe (ed.): Nordisk Sprogteknologi. Museum Tusculanums Forlag. Sigrún Helgadóttir. 2007. Mörkun íslensks texta (.pdfOrð og tunga 9:75-107. Reykjavík.
  • Ingason, Anton K., Sigrún Helgadóttir, Hrafn Loftsson and Eiríkur Rögnvaldsson. 2008. A Mixed Method Lemmatization Algorithm Using Hierachy of Linguistic Identities (HOLI). In B. Nordström og A. Ranta (eds.), Advances in Natural Language Processing, 6th International Conference on NLP, GoTAL 2008, Proceedings. Gothenburg, Sweden.
  • Kristjánsdóttir, Bergljót, Bragi Halldórsson, Jón Torfason and Örnólfur Thorsson (eds.). 1991. Heimskringla. Mál og menning. Reykjavík.
  • Loftsson, Hrafn, Ida Kramarczyk, Sigrún Helgadóttir and Eiríkur Rögnvaldsson. 2009. Improving the PoS tagging accuracy of Icelandic text. In Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA-2009). Odense, Denmark.
  • Loftsson, Hrafn,  and Robert Östling. 2013. Tagging a Morphologically Complex Language Using an Averaged Perceptron Tagger: The Case of Icelandic. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA-2013), NEALT Proceedings Series 16. Oslo, Norway.
  • Rögnvaldsson, Eiríkur, and Sigrún Helgadóttir. 2011. Morphosyntactic Tagging of Old Icelandic Texts and Its Use in Studying Syntactic Variation and Change. In C. Sporleder, A.P.J. van den Bosch and K.A. Zervanou (eds.): Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series, pp. 63-76. Springer, Berlín.
  • Thorsson, Örnólfur, Bergljót Kristjánsdóttir, Bragi Halldórsson, Gísli Sigurðsson, Guðrún Ása Grímsdóttir, Guðrún Ingólfsdóttir, Jón Torfason and Sverrir Tómasson (eds.). 1988. Sturlunga saga. Svart á hvítu. Reykjavík.
  • Östling, Robert. 2012. Stagger: A modern POS tagger for Swedish. In Proceedings of the 4 th Swedish Language Technology Conference, SLTC, Lund, Sweden.