Icelandic Parsed Historical Corpus (IcePaHC)

The Icelandic Parsed Historical Corpus (IcePaHC) is a project that has built a diachronic corpus with samples of written Icelandic from all periods from the 12th century to modern times. The corpus is mostly compatible with the corpora of historical English developed at UPenn. For historical texts spelling is modernized for phonological change.

The current release is version 0.9 of 1,002,390 words total from every century between the 12th and the 21st centuries inclusive. All of the text for version 1.0 is already included, but some minor corrections remain to be finished.

The corpus, as well as software developed as part of the IcePaHC project, is released under an LGPL license, to ensure compatibility with other LGPL-licensed NLP tools, notably the IceNLP toolkit, which is used extensively in the development of the corpus. IcePaHC is available for both search and download. The corpus is free and there is no registration wall.

The most up-to-date version and information on the current state of development can be accessed at the version control repository at Github.

When citing the corpus please refer to: Wallenberg, Joel, Anton Karl Ingason, Einar Freyr Sigurðsson and Eiríkur Rögnvaldsson. 2011. Icelandic Parsed Historical Corpus (IcePaHC). Version 0.9. http://www.linguist.is/icelandic_treebank

About IcePaHC

Detailed information on the development and content of IcePaHC, as well as instructions on how to search the corpus, can be found in the following sources:

Project group