CLARIN-IS Repository

The repository of CLARIN-IS (repository.clarin.is) contains a lot of resources - tools for language processing, corpora, lexica and language descriptions of various kinds. All the products of the Language Technology Programme for Icelandic (2019–2023) were uploaded to the repository as well as the resources that used to be on www.malfong.is. The repository has a powerful search engine but in order to provide a simple overview, most of the data are shown here.

Corpora

Treebanks

  • The Icelandic Contemporary Treebank (IceConTree) 1.1 | 1.0
  • The Icelandic Parsed Historical Corpus (IcePaHC) 2024.03 0.9
  • The Faroese Parsed Historical Corpus 0.1
  • NeuralMIcePaHC 20.05 | 20.04
  • GreynirCorpus 21.06 | 20.05 | 20.05
  • UD GreynirCorpus 22.06

Tagged monolingual corpora

Error corpora

  • Icelandic Error Corpus (IceEC) 1.1 | 1.0 | 0.9
  • The Icelandic Child Language Error Corpus (IceCLEC) 1.1 | 1.0
  • The Icelandic L2 Error Corpus (IceL2EC) 1.3 | 1.2 | 1.1 | 1.0
  • The Icelandic Dyslexia Error Corpus (IceDEC) 1.2 | 1.1 | 1.0
  • Icelandic Taboo Database (iceTaboo) 1.0
  • Icelandic Error Corpus Nonwords 20.09
  • Spell and grammar checking – Thesis testing 22.10

Parallel corpora

  • ParIce: English-Icelandic parallel corpus 21.10 | 19.10
  • ParIce: Dev/Test/Train Splits 21.10 | 20.05
  • Icelandic-English test set for sentence alignment 21.10
  • Icelandic-English Classification Training Set for Parallel Sentence Alignment Filtering ??
  • Icelandic-English Parallel Sentence Extraction Dataset 21.10
  • En-Is Parallel Named Entity Robustness Corpus - Test data 1.0
  • En-Is Synthetic Parallel Astronomy Corpus with Injected Vocabulary 1.0
  • En-Is Synthetic Parallel Corpus 21.07 | 20.09
  • En-Is Synthetic Parallel Named Entity Robustness Corpus 1.0
  • En-Is Semi-Synthetic Parallel Name Robustness Corpus 1.0
  • cities_is2en 20.09 | 20.05
  • countries_is2iso 20.09 | 20.05
  • isprep4cc 20.09 | 20.05
  • isprep4isloc 20.09 | 20.05
  • Long Context Synthetic Translation Pairs for English and Icelandic 22.09

Voice samples and sound files

  • Talrómur Corpus 21.02
  • Talrómur Corpus 2 22.10 21.12
  • Samrómur Corpus 21.05
  • Samrómur Queries 21.12
  • Samrómur Children 21.09
  • Samrómur L2 22.09
  • Samrómur Mimics 22.09
  • Samromur Unverified 22.07
  • Spjallrómur Corpus - Icelandic Conversational Speech 22.01
  • Kennslurómur Corpus - Icelandic Lectures 22.01
  • Raddrómur Icelandic Speech 22.09
  • RÚV TV data 20.12
  • RUV TV unknown speakers 22.02
  • Islex Recordings 1.0
  • Test Set for TTS Intelligibility Tests 22.01
  • The Hjal Corpous download
  • Málrómur Corpus download
  • Parliament Speech Corpus download
  • Corpus of Althingi's Parliamentary Speeches for ASR download
  • The Jensson Corpus download
  • The Thor Corpus download
  • The Rúv Corpus download
  • Ravnursson Faroese Speech and Transcripts download

Other corpora

  • The Icelandic Confusion Set Corpus (ICoSC) 2.0 | 1.0
  • Text Normalization Corpus 21.10
  • NQiI - Natural Questions In Icelandic 1.1 | 1.0
  • Icelandic WinoGrande 1.0
  • The Reykjavik University Question-Answering Dataset (RUQuAD)  22.02

Wordlists and dictionaries

Dictionaries and wordnets

  • Icelandic Pronunciation Dictionary 22.01 | 21.10 | 21.02 | 21.01
  • A Dictionary of Contemporary Icelandic 2020
  • Icelandic Hyphenation Dictionary 1.0 | 2.0
  • Islex - Icelandic-Scandinavian multilingual dictionary 2022 2013
  • English-Icelandic/Icelandic-English glossary 21.09
  • The Icelandic Word Web 21.06 | 21.02 | 20.09

Other wordlists

  • DMII - Abbreviations 21.10
  • Stop-words for the Icelandic Gigaword Corpus 21.08
  • Gold Alignments for English-Icelandic Word Alignments 21.04
  • IceBATS - The Icelandic Bigger Analogy Test Set 21.06
  • Icelandic Multi-SimLex 21.06
  • Icelandic Search Query Errors (IceSQuEr) 0.1
  • Translations of institutions, companies and titles 22.01
  • Word frequency list from the Icelandic Corpus for Academic Words (MÍNO) 1.0
  • The Icelandic Academic Word List (LÍNO) 1.0
  • Idiomatic Expressions in Icelandic and English 22.09

Language descriptions

The Database of Icelandic Morphology

Other

  • Icegrams 1.1.1 | 20.09
  • Icelandic Pronunciation 20.10
  • Icelandic Language Models with Pronunciations 22.01
  • Pronunciation Dictionary for Icelandic download
  • General Pronunciation Dictionary for ASR download
  • Patterns and sentences download

Tools and models

Tokenizers, PoS taggers, lemmatizers and parsers

Named identity recognition

  • Icelandic NER API - Ensamble model 21.09
  • Icelandic NER API - ELECTRA-base model 21.05

Translation machines and models

  • GreynirTranslate - mBART25 NMT models for Translations between Icelandic and English 1.0
  • GreynirTranslate - mBART25 NMT (with layer drop) models for Translations between Icelandic and English 1.0
  • GreynirT2T - En--Is NMT with Tensor2Tensor 1.0
  • GreynirT2T Serving - En--Is NMT Inference and Pre-trained Models 1.0
  • MT: Moses-SMT 1.0
  • GreynirSeq Domain Translation Pipeline 22.06
  • Semi-supervised Icelandic-Polish Translation System 22.09
  • Long Context Translation Models for English-Icelandic translations 22.09
  • Optimized Long Context Translation Models for English-Icelandic translations 22.09

Speech Recognition

  • RÚV-DI Speaker Diarization 21.10 | 20.09
  • RÚV-DI Speaker Diarization v5 models 21.05
  • Tiro Web interface for speech recognition 1.0
  • Samrómur-Children Demonstration Scripts 22.01
  • Samrómur-Adolescents Kaldi Recipe 22.06
  • Samrómur-NeMo Recipe 22.06
  • Samrómur-L2 Kaldi  Recipe 22.10
  • Samrómur-DeepSpeech Recipe 22.06
  • Punctuation model 20.09
  • 6-GRAM Language Model in Icelandic for NeMo (Binary Format) 22.06
  • DeepSpeech Scorer for Icelandic 22.06
  • Heyra 1.0
  • Voice control and question answering 22.10

Speech synthesis

  • Tiro TTS web service 22.10 | 22.06 | 1.0
  • Prosody feature extraction with speaker information 20.09
  • MOSI: TTS evaluation tool 22.01
  • Webrice extension 22.09 | 22.01
  • WebRICE - An Open Source Web Reader 21.06
  • TTS Text Processing 22.10
  • TTS Document Reader 22.10
  • Icelandic TTS for Android 22.10
  • Multi-speaker GlowTTS model for Talrómur 2 (prerelease) 22.10
  • GlowTTS models for Talrómur 1 22.10
  • Talrómur TTS- model 22.10

Various tools for speech recognition and synthesis

  • MAFIA (Match-Finder Aligner): A speech/text aligning tool 22.06
  • Speech Corpora Toolkit 22.06
  • Upload2S3 22.06

Grapheme-to-phoneme (g2p)

  • Models for automatic g2p for Icelandic 20.10
  • Rule-based g2p for Icelandic 20.10
  • Editor for pronunciation dictionaries 20.10
  • g2p-service 20.11
  • Grapheme-to-phoneme (g2p) module for Icelandic 22.10

Spell and grammar checking

  • Multilabel Error Classifier (Icelandic Error Corpus categories) for Sentences 22.01
  • GreynirCorrect 3.4.5 | 3.4.43.2.1 | 3.2.0 | 1.0.2
  • Yfirlestur 1.0.1 | 1.0.0
  • Yfirlestur Docs 22.10
  • Yfirlestur Word 22.10
  • Byte-Level Neural Error Correction Model for Icelandic - Yfirlestur 22.09
  • Error Classifier (Icelandic Error Corpus categories) for Tokens 22.05
  • Hunspell-IS. Spell checker, morphological analyzer & thesaurus for Icelandic download
  • Binary Error Classifier for Icelandic Sentences 22.09
  • Spellchecking app for Android 22.10

Word embeddings

  • Word Embeddings – Word2Vec optimized for IceBATS 22.04
  • Word Embeddings – GloVe optimized for IceBATS 22.04
  • Word Embeddings - FastText optimized for IceBATS 22.04

Other

  • Alexia Lexicon Acquisition Tool for Icelandic 3.0 | 2.0 | 1.0
  • Skiptir 20.10
  • Annotald 1.0.0
  • GreynirSeq - A Natural Language Processing Toolkit for Icelandic 0.2.0
  • OCR Post-Processing Tool for Icelandic 22.10
  • AnySoftKeyboard with custom autocompletion 22.10
  • IceEval - Icelandic Natural Language Processing Benchmark 22.09

Other resourses

Here below are listed a few resources that are not in the repository of CLARIN-IS but are searchable or can be downloaded on other sites.

Dictionaries and word lists

Málheildir - textaskrár