CLARIN-IS Repository

The repository of CLARIN-IS (repository.clarin.is) contains a lot of resources - tools for language processing, corpora, lexica and language descriptions of various kinds. All the products of the Language Technology Programme for Icelandic (2019–2023) were uploaded to the repository as well as the resources that used to be on www.malfong.is. The repository has a powerful search engine but in order to provide a simple overview, most of the data are shown here.

Corpora

Treebanks

  • The Icelandic Contemporary Treebank (IceConTree) 1.1 | 1.0
  • The Icelandic Parsed Historical Corpus (IcePaHC) 0.9
  • The Faroese Parsed Historical Corpus 1.0
  • NeuralMIcePaHC 20.05 | 20.04
  • GreynirCorpus 21.06 | 20.05 | 20.05

Tagged monolingual corpora

Error corpora

  • Icelandic Error Corpus (IceEC) 1.1 | 1.0 | 0.9
  • The Icelandic Child Language Error Corpus (IceCLEC) 1.1 | 1.0
  • The Icelandic L2 Error Corpus (IceL2EC) 1.2 | 1.1 | 1.0
  • The Icelandic Dyslexia Error Corpus (IceDEC) 1.1 | 1.0
  • Icelandic Taboo Database (iceTaboo) 1.0
  • Icelandic Error Corpus Nonwords 20.09

Parallel corpora

  • ParIce: English-Icelandic parallel corpus 21.10 | 19.10
  • ParIce: Dev/Test/Train Splits 21.10 | 20.05
  • Icelandic-English test set for sentence alignment 21.10
  • Icelandic-English Classification Training Set for Parallel Sentence Alignment Filtering ??
  • Icelandic-English Parallel Sentence Extraction Dataset 21.10
  • En-Is Parallel Named Entity Robustness Corpus - Test data 1.0
  • En-Is Synthetic Parallel Astronomy Corpus with Injected Vocabulary 1.0
  • En-Is Synthetic Parallel Corpus 21.07 | 20.09
  • En-Is Synthetic Parallel Named Entity Robustness Corpus 1.0
  • En-Is Semi-Synthetic Parallel Name Robustness Corpus 1.0
  • cities_is2en 20.09 | 20.05
  • countries_is2iso 20.09 | 20.05
  • isprep4cc 20.09 | 20.05
  • isprep4isloc 20.09 | 20.05

Voice samples and sound files

  • Talrómur Corpus 21.02
  • Talrómur Corpus 2 21.12
  • Samrómur Corpus 21.05
  • Samrómur Queries 21.12
  • Samrómur Children 21.09
  • Spjallrómur Corpus - Icelandic Conversational Speech 22.01
  • Kennslurómur Corpus - Icelandic Lectures 22.01
  • RÚV TV data 20.12
  • RUV TV unknown speakers 22.02
  • Islex Recordings 1.0
  • Test Set for TTS Intelligibility Tests 22.01
  • The Hjal Corpous sækja
  • Málrómur Corpus sækja
  • Parliament Speech Corpus sækja
  • Corpus of Althingi's Parliamentary Speeches for ASR sækja
  • The Jensson Corpus sækja
  • The Thor Corpus sækja
  • The Rúv Corpus sækja

Other corpora

  • The Icelandic Confusion Set Corpus (ICoSC) 2.0 | 1.0
  • Text Normalization Corpus 21.10
  • NQiI - Natural Questions In Icelandic 1.1 | 1.0
  • Icelandic WinoGrande 1.0

Wordlists and dictionaries

Dictionaries and wordnets

  • Icelandic Pronunciation Dictionary 22.01 | 21.10 | 21.02 | 21.01
  • A Dictionary of Contemporary Icelandic 2020
  • Icelandic Hyphenation Dictionary 1.0 | 2.0
  • Islex - Icelandic-Scandinavian multilingual dictionary 2013
  • English-Icelandic/Icelandic-English glossary 21.09
  • The Icelandic Wordweb 21.06 | 21.02 | 20.09

Other wordlists

  • DMII - Abbreviations 21.10
  • Stop-words for the Icelandic Gigaword Corpus 21.08
  • Gold Alignments for English-Icelandic Word Alignments 21.04
  • IceBATS - The Icelandic Bigger Analogy Test Set 21.06
  • Icelandic Multi-SimLex 21.06
  • Icelandic Search Query Errors (IceSQuEr) 0.1
  • Translations of institutions, companies and titles 1.0

Language descriptions

The Database of Icelandic Morphology

  • The Database of Modern Icelandic Inflection (DMII) 19.10
  • DIM Valency Structures 21.10
  • DMII - The Comprehensive Format 21.10
  • DMII Core download

Other

  • Icegrams 1.1.1 | 20.09
  • Icelandic Pronunciation 20.10
  • Icelandic Language Models with Pronunciations 22.01
  • Framburðarorðabókin sækja
  • Almenn framburðarorðabók fyrir talgreiningu sækja
  • Mynstur og setningar sækja

Tools and models

Tokenizers, PoS taggers, lemmatizers and parsers

Named identity recognition

  • Icelandic NER API - Ensamble model 21.09
  • Icelandic NER API - ELECTRA-base model 21.05

Translation machines and models

  • GreynirTranslate - mBART25 NMT models for Translations between Icelandic and English 1.0
  • GreynirTranslate - mBART25 NMT (with layer drop) models for Translations between Icelandic and English 1.0
  • GreynirT2T - En--Is NMT with Tensor2Tensor 1.0
  • GreynirT2T Serving - En--Is NMT Inference and Pre-trained Models 1.0
  • MT: Moses-SMT 1.0

Speech synthesis and Speech Recognition

  • RÚV-DI Speaker Diarization 21.10 | 20.09
  • RÚV-DI Speaker Diarization v5 models 21.05
  • Tiro TTS web service 1.0
  • Tiro Web interface for speech recognition 1.0
  • Prosody feature extraction with speaker information 20.09
  • MOSI: TTS evaluation tool 22.01
  • Samrómur-Children Demonstration Scripts 22.01
  • Webrice extension 22.01
  • Models for automatic g2p for Icelandic 20.10
  • Rule-based g2p for Icelandic 20.10
  • Editor for pronunciation dictionaries 20.10
  • Punctuation model 20.09

Spell and grammar checking

  • Multilabel Error Classifier (Icelandic Error Corpus categories) for Sentences 22.01
  • GreynirCorrect 3.2.1 | 3.2.0 | 1.0.2 | 1.0.0

Other

  • Alexia Lexicon Acquisition Tool for Icelandic 3.0 | 2.0 | 1.0
  • Hunspell-IS. Spell checker, morphological analyzer & thesaurus for Icelandic sækja
  • BinPackage 0.4.2 | 0.3.1
  • Skiptir 20.10
  • UDConverter 22.01

Other resourses

Here below are listed a few resources that are not in the repository of CLARIN-IS but are searchable or can be downloaded on other sites.

Dictionaries and word lists

Málheildir - textaskrár