ParIce is a parallel corpus of Icelandic and English texts, containing 3,589,052 sentence pairs. The texts were aligned on sentence level and word level. The text have also been tagged and lemmatized. The Icelandic text contains 46,727,741 tokens.
The corpus is available in two ways.
When publishing results based on the texts in ParIce please refer to: Barkarson, Starkaður, and Steinþór Steingrímsson. 2019. Compiling and Filtering ParIce: An English-Icelandic Parallel Corpus. In: Proceedings of the 22nd Nordic Conference on Computational Linguistics, pp. 140-145. Turku, Finland.
The texts in ParIce are collected from eleven different sources, mostly from available parallel corpora (Opus, Tilde, ELRC) or retrieved from websites.
Text | Pairs | Description | Origin |
The Bible | 65,241 | Icelandic text of Biblía 21. aldar (2007) and English text of King James Version. |
English text: http://christos-c.com Icelandic text: http://biblian.is |
EEA documents | 1,701,172 | Regulations and other documents of EEA | PDF-files from http://eur-lex.europa.eu |
Patient information leaflets (EMA) | 404,333 | Package leaflets (EMA) | Tilde (https://tilde-model.s3-eu-west-1.amazonaws.com/Tilde_MODEL_Corpus.html) |
European Southern Observatory (ESO) | 12,633 | Press releases from European Southen Obeservatory. | http://www.eso.org/public/ |
Statistics Iceland - from website | 2,288 | Texts from the website of Statistics Iceland | ELRC (http://www.lr-coordination.eu/resources) |
Icelandic Sagas | 17,597 | Texts of twelve Sagas: Brennunjáls saga, Eiríks saga rauða, Eyrbyggjasaga, Grettis saga, Gunnlaugs saga ormstungu, Hænsna-Þóris saga, Hávarðar saga Ísfirðings, Heiðarvíga saga, Hrafnkels saga freysgoða, Laxdæla saga, Þórðar saga hreðu, Víga-Glúms saga |
Icelandic texts: Publications of Svart á hvítu from 1985-1986 (http://clarin.is/en/resources/sagacorpus/) English texts: Gutenberg Project (http://www.gutenberg.org) |
KDE4 | 49,909 | Texts from the localization files of KDE | Opus (http://opus.nlpl.eu/KDE4.php) |
Classical litterature | 12,416 | Texts of four books from the 19th century: Michael Strogoff by Jules Verne, Rupert Hentzau by Anthony Hope, Subjection of Women by John Stuart Mill, The Prisoner of Zenda by Anthony Hope |
Icelandic texts: http://rafbokavefur.is English texts: Gutenberg Project (http://www.gutenberg.org) |
OpenSubtitles | 1,304,628 | Collections of subtitles of films and tv series from 1931 to 2016. | Opus (http://opus.nlpl.eu/OpenSubtitles2018.php) |
Tatoeba | 8,263 | Collections of sentences from Tatoeba | Tatoeba (http://tatoeba.org) |
Ubuntu | 10,572 | Texts from Ubuntu translation files | Opus (http://opus.nlpl.eu/Ubuntu.php) |
Total | 3,589,052 |
Information about the processing of the texts, i.e. alignment, filtering, tagging and lemmatization, can be found in an article by Starkaður Barkarson and Steinþór Steingrímsson (2019).
Project manager:
Software development
E-mail: malfong@malfong.is