Icelandic Parsed Historical Corpus (IcePaHC)

The Icelandic Parsed Historical Corpus (IcePaHC) is a project that has built a diachronic corpus with samples of written Icelandic from all periods from the 12th century to modern times. The corpus is mostly compatible with the corpora of historical English developed at UPenn. For historical texts spelling is modernized for phonological change.

The current release is version 0.9 of 1,002,390 words total from every century between the 12th and the 21st centuries inclusive. All of the text for version 1.0 is already included, but some minor corrections remain to be finished.

The corpus, as well as software developed as part of the IcePaHC project, is released under an LGPL license, to ensure compatibility with other LGPL-licensed NLP tools, notably the IceNLP toolkit, which is used extensively in the development of the corpus. IcePaHC is available for both search and download. The corpus is free and there is no registration wall.

Search. IcePaHC can be searched in two different ways:
Through a preview version of Treebank Studio, an online tool using the PaCQL query language.
Through INESS.
Download. Three versions with a different setup can be downloaded:
IcePaHC. Version 0.9. LGPL. (zip, 11.8 MB)
IcePaHC for Windows. Version 0.9. LGPL. (user friendly visual setup, 12.2 MB)
IcePaHC - Platform Independent User Interface (Java). Version 0.9. LGPL. (manual setup, 12.7 MB)

The most up-to-date version and information on the current state of development can be accessed at the version control repository at Github.

When citing the corpus please refer to: Wallenberg, Joel, Anton Karl Ingason, Einar Freyr Sigurðsson and Eiríkur Rögnvaldsson. 2011. Icelandic Parsed Historical Corpus (IcePaHC). Version 0.9. http://www.linguist.is/icelandic_treebank

About IcePaHC

Detailed information on the development and content of IcePaHC, as well as instructions on how to search the corpus, can be found in the following sources:

IcePaHC home page
IcePaHC download page
Annotation manual for the Penn Historical Corpora and the York-Helsinki Corpus of Early English Correspondence
CorpusSearch User Guide
Rögnvaldsson, Eiríkur, Anton Karl Ingason, Einar Freyr Sigurðsson and Joel Wallenberg. 2012. The Icelandic Parsed Historical Corpus (IcePaHC). Proceedings of LREC 2012, pp. 1978-1984.

CLARIN á Íslandi

Icelandic Parsed Historical Corpus (IcePaHC)

About IcePaHC

Project group