ParIce

http://hdl.handle.net/20.500.12537/16

 

ParIce is a parallel corpus of Icelandic and English texts, containing 3,589,052 sentence pairs. The texts were aligned on sentence level and word level. The text have also been tagged and lemmatized. The Icelandic text contains 46,727,741 tokens.

 The corpus is available in two ways.

  • Search. The corpus is available for search where the tags (linguistic annotation) can be used to define the search more accurately. The search gives a KWIC index and information about the source of each text example. The search interface is based on Korp developed by Språkbanken in Gothenburg.
  • Download. The corpus can be downloded in its entirety. The texts are in a special XML format, TEI P5, which is defined by TEI (Text Encoding Initiative). All texts are accompanied by metadata (bibliographic for published works). The texts can be used with a CC BY 4.0 license. All users are registered with their e-mail address when they accept the user license. The texts from OpenSubtitles have to be downloaded separately from http://opus.nlpl.eu/.

When publishing results based on the texts in ParIce please refer to: Barkarson, Starkaður, and Steinþór Steingrímsson. 2019. Compiling and Filtering ParIce: An English-Icelandic Parallel Corpus. In: Proceedings of the 22nd Nordic Conference on Computational Linguistics, pp. 140-145. Turku, Finland.

About ParIce

Texts

The texts in ParIce are collected from eleven different sources, mostly from available parallel corpora (Opus, Tilde, ELRC) or retrieved from websites.

Text Pairs Description Origin
The Bible 65,241   Icelandic text of Biblía 21. aldar (2007) and English text of King James Version.

English text: http://christos-c.com

Icelandic text: http://biblian.is

EEA documents 1,701,172   Regulations and other documents of EEA PDF-files from http://eur-lex.europa.eu
Patient information leaflets (EMA) 404,333   Package leaflets (EMA) Tilde (https://tilde-model.s3-eu-west-1.amazonaws.com/Tilde_MODEL_Corpus.html)
European Southern Observatory (ESO) 12,633   Press releases from European Southen Obeservatory. http://www.eso.org/public/
Statistics Iceland - from website 2,288   Texts from the website of Statistics Iceland ELRC (http://www.lr-coordination.eu/resources)
Icelandic Sagas 17,597   Texts of twelve Sagas: Brennunjáls saga, Eiríks saga rauða, Eyrbyggjasaga, Grettis saga, Gunnlaugs saga ormstungu, Hænsna-Þóris saga, Hávarðar saga Ísfirðings, Heiðarvíga saga, Hrafnkels saga freysgoða, Laxdæla saga, Þórðar saga hreðu, Víga-Glúms saga

Icelandic texts: Publications of Svart á hvítu from 1985-1986 (http://clarin.is/en/resources/sagacorpus/)

English texts: Gutenberg Project (http://www.gutenberg.org)

KDE4 49,909   Texts from the localization files of KDE Opus (http://opus.nlpl.eu/KDE4.php)
Classical litterature 12,416   Texts of four books from the 19th century: Michael Strogoff by Jules Verne, Rupert Hentzau by Anthony Hope, Subjection of Women by John Stuart Mill, The Prisoner of Zenda by Anthony Hope

Icelandic texts: http://rafbokavefur.is

English texts: Gutenberg Project (http://www.gutenberg.org)

OpenSubtitles 1,304,628   Collections of subtitles of films and tv series from 1931 to 2016. Opus (http://opus.nlpl.eu/OpenSubtitles2018.php)
Tatoeba 8,263   Collections of sentences from Tatoeba Tatoeba (http://tatoeba.org)
Ubuntu 10,572   Texts from Ubuntu translation files Opus (http://opus.nlpl.eu/Ubuntu.php)
Total 3,589,052      

Alignment, filtering, tagging and lemmatization

Information about the processing of the texts, i.e. alignment, filtering, tagging and lemmatization, can be found in an article by Starkaður Barkarson and Steinþór Steingrímsson (2019).

Project manager:

  • Steinþór Steingrímsson

Software development

  • Rose Costa
  • Starkaður Barkarson

Contact:

E-mail: malfong@malfong.is