CLARIN-IS Repository

The repository of CLARIN-IS (repository.clarin.is) contains a lot of resources - tools for language processing, corpora, lexica and language descriptions of various kinds. All the products of the Language Technology Programme for Icelandic (2019–2023) were uploaded to the repository as well as the resources that used to be on www.malfong.is. The repository has a powerful search engine but in order to provide a simple overview, most of the data are shown here.

[open all] [hide all]

Corpora

A treebank is a parsed corpus that holds information about sentence construction and phrases. The Icelandic treebanks are parsed according to the Penn Parsed Corpora of Historical English (PPCHE) parsing scheme but have in some cases been adapted to Icelandic sentence construction. The Icelandic Parsed Historical Corpus (IcePaHC) (“Sögulegi íslenski trjábankinn”) and the . Faroese Parsed Historical Corpus (FarPaHC) (“Sögulegi færeyski trjábankinn”) were both manually annotated. The Icelandic Contemporary Treebank – IceConTree (“Samtímalegi íslenski trjábankinn”) and NeuralMIcePaHC (“Taugavélþáttaði IcePaHC-trjábankinn”) were parsed digitally using IceNeuralParsingPipeline. GreynirCorpus holds 10 million sentences, mostly from news media between 2015-2021, and was tagged using Miðeind’s Greynir language processor, which applies a similar tagging scheme as the treebanks. A section of the GreynirCorpus – its gold standard – has been manually annotated. This section was mapped onto the relational treebank UD GreynirCorpus using UDConverter. Read more about relational treebanks at https://universaldependencies.org.

  • The Icelandic Contemporary Treebank (IceConTree) 1.1 | 1.0
  • Icelandic Parsed Historical Corpus (IcePaHC) 2024.03 0.9 
  • Faroese Parsed Historical Corpus (FarPaHC) 0.1
  • NeuralMIcePaHC 20.05 | 20.04
  • GreynirCorpus 21.06 | 20.05 | 20.05
  • UD GreynirCorpus 22.06

There are various types of tagged corpora. Often, such corpora consist of a text corpus that has been edited (split into sentences and tokens), grammatically tagged (with each token being tagged with a string of text that signifies word class, gender, case etc.) and given a lemma (a lexical search term, e.g. hestur for hests). This is the case with the Icelandic Gigaword Corpus, the Tagged Icelandic Corpus, the Icelandic Frequency Dictionary, the Saga Corpus and the Corpus of Icelandic Academic Vocabulary. Of these, the Icelandic Frequency Dictionary (“Orðtíðnibókin” – OTB) is the oldest. It was published in 1991 and contains around 500.000 textual words harvested from extracts of 100 texts dated from 1980 to 1989. The Tagged Icelandic Corpus (“Mörkuð íslensk málheild” – MÍM) is much bigger. It contains around 25 million words from a wide variety of texts dated from 1980 to 1989. The Icelandic Gigaword Corpus (“Risamálheldin” – RMH) is the most recent and largest of these corpora and is regularly updated with new texts. Unlike OTB and MÍM, the RHM is not “balanced”, i.e. texts chosen for the corpus are not balanced by making sure that there is an approximately equal number of texts for each type of text. Instead, the corpus contains all the texts that are available to it. Because of this it contains a far higher number of texts from e.g. news media and government documents than it does literary texts or extracts from academic journals. The Saga Corpus (“Fornritin”) contains tagged texts from the Icelandic sagas Sturlunga, Heimskringla and Landnámabók. The Corpus of Icelandic Academic Vocabulary (“Málheild fyrir íslenskan námsorðaforða” – MÍNO) was created from a selection of texts from MÍM and RMH with the goal of creating a word frequency list for Icelandic academic vocabulary.

MÍM has been used to create gold standards for grammatical tagging (MÍM-GULL), named entity recognition (MÍM-GUll_NER) and entity linking (MÍM-GULL-EL). Gold standards can be used to test and train various tools. To perform a test on a tool, the gold standard must be divided into a training set and a testing set, as has been done with MÍM-GULL. RHM and OTB have also been used to create training and testing sets that can be used to train e.g. taggers and lemmatisers.

Corpora
Gold standards and test sets

An error corpus is a corpus where errors have been tagged, e.g. with regard to spelling, grammar etc. Among other things, error corpora can be used to develop and train grammar checkers. The Icelandic Error Corpus (“Íslenska villumálheildin” – IceEC), the Icelandic Child Language Error Corpus (“Villumálheild íslensk barnamáls” – IceCLEC), the Icelandic L2 Error Corpus (“Villumálheild íslensku sem annars máls” – IceL2EC) and the Icelandic Dyslexia Error Corpus (“Íslenska lesblinduvillumálheildin” – IceDEC) were all created at the at the University of Iceland and were all developed in a similar manner, using the same error codes. The Icelandic Error Corpus Nonwords (“Óorð íslensku villumálheildarinnar”) and the Icelandic Search Query Errors (“Listi af handleiðréttum atriðum í lokaritgerðum” – IceSQuEr) list error words from texts along with corrections. The Icelandic Taboo Database (“Gagnagrunnur íslenskra bannorða” – IceTaboo) contains a list of Icelandic words that might be considered inappropriate and/or value-laden in one way or another.

Corpora
  • Icelandic Error Corpus (IceEC)  1.1 | 1.0 | 0.9
  • The Icelandic Child Language Error Corpus (IceCLEC) 1.1 | 1.0
  • The Icelandic L2 Error Corpus (IceL2EC) 1.3 | 1.2 | 1.1 | 1.0
  • The Icelandic Dyslexia Error Corpus 1.2 | 1.1 | 1.0
Lists
  • Icelandic Taboo Database (iceTaboo) 1.0
  • Icelandic Error Corpus Nonwords 20.09
  • Icelandic Search Query Errors (IceSQuEr) 0.1
  • Spell and grammar checking – Thesis testing 22.10

A parallel corpus is a collection of texts in two or more languages that have been aligned at least at the sentence level, so that a sentence in one language is matched with a sentence in another language. All of the corpora and lists are in English and Icelandic except the Data for translation between Polish and Icelandic. ParIce is an English-Icelandic parallel corpus intended for training machine translation software. It consists of various subcorpora and includes around 3,5 million paired sentences. Other corpora listed below are so-called synthetic corpora, which can be used when there is an insufficient number of aligned texts available. The En-IS Synthetic Parallel Corpus was made using back-translations, where a machine translation tool is used to translate a text (e.g. Icel.-Eng.) and the resulting translation is then used as a testset for a model that translates back into the original language (Eng.-Icel.). However, the En-Is Synthetic Parallel Astronomy Corpus with Injected Vocabulary was created by taking a parallel corpus and switching out some of its words for more uncommon words. The En-Is Synthetic Parallel Named Entity Robustness Corpus and the En-Is Semi-Synthetic Parallel Name Robustness Corpus are synthetic corpora that are focused on increasing the value of various proper nouns during training by inserting them into the texts.

Parallel testsets can be used to train and test models that are intended for translating between two languages. ParIce Dev/Test Sets consists of a selection of texts from the ParIce corpus where the text alignment has been manually annotated and can be used to train tools that translate between English and Icelandic. The Icelandic-English Test Set for Sentence Alignment is intended for testing automatic sentence alignment tools. The Icelandic-English Classification Training Set for Parallel Sentence Alignment Filtering has a similar function but is rather a training set for a classifier that chooses and separates high quality parallel sentences from less precise sentences. The Icelandic-English Parallel Sentence Extraction Dataset can be used to test the accuracy of parallel sentence extraction processes for comparable corpora. En-Is Parallel Named Entity Robustness Corpus – Test Data contains testsets to assess translations of named entity tokens (e.g. people or place names) between Icelandic and English.

The lists cities_is2en (city names), countries_is2iso (country names), isprep4cc (prepositions preceding country names) and isprep4isloc (prepositions preceding city and place names) can be used to ensure that the names for cities and countries are translated correctly and used with accurate propositions in the Icelandic.

Corpora (Icelandic and English)
  • ParIce: English-Icelandic parallel corpus 21.10 | 19.10 
  • En-Is Synthetic Parallel Astronomy Corpus with Injected Vocabulary 1.0
  • En-Is Synthetic Parallel Corpus 21.07 | 20.09
  • Long Context Synthetic Translation Pairs for English and Icelandic 22.09
  • En-Is Synthetic Parallel Named Entity Robustness Corpus 1.0
  • En-Is Semi-Synthetic Parallel Name Robustness Corpus 1.0
Corpora (Icelandic and Polish)
  • Data for translation between Polish and Icelandic 24.09
Testsets
  • ParIce Dev/Test Sets 21.10 | 20.05 
  • Icelandic-English test set for sentence alignment 21.10
  • Icelandic-English Classification Training Set for Parallel Sentence Alignment Filtering sækja
  • Icelandic-English Parallel Sentence Extraction Dataset 21.10
  • En-Is Parallel Named Entity Robustness Corpus - Test data 1.0
Lists

The corpora all contain audio recordings and transcripts that can be used to develop language technology solutions for speech synthesisers and speech recognisers. Samrómur was made through crowd sourcing and thus includes a magnitude of different speakers. Talrómur refers to three different datasets that contain short recordings by different speakers. Spjallrómur is a conversational speech corpus that contains 54 conversations by 102 speakers. Kennslurómur is a collection of audio recordings with transcripts of lectures recorded during courses at Reykjavík University and the University of Iceland. Raddrómur consists of audio samples taken from radio and podcasts (mainly from the Icelandic National Broadcasting Service – RÚV). RÚV TV and RÚV TV Unknown Speakers are databases that contain audio and transcripts from television programs made by RÚV. All of the above-mentioned corpora are the products of the Icelandic government’s Language Technology Programme for Icelandic 2019-2023. Other corpora are older.

The Speech Intelligibility Testing Data was used to assess the effect of spelling errors on speech synthesis comprehension.

Corpora
  • Talrómur Corpus 24.04 21.02
  • Talrómur Corpus 2 22.10 21.12
  • Talrómur Corpus 3 24.09
  • Samrómur Corpus 21.05 
  • Samrómur - Queries 21.12 
  • Samrómur -  Children 21.09 
  • Samrómur - L2 22.09
  • Samrómur - Mimics 22.09
  • Samromur - Unverified 22.07
  • Spjallrómur - Icelandic Conversational Speech 22.01
  • Kennslurómur - Icelandic Lectures 22.01
  • Raddrómur - Icelandic Speech 22.09
  • RÚV TV data 20.12
  • RUV TV unknown speakers 22.02
  • The Hjal Corpus sækja 
  • Málrómur Corpus sækja 
  • Parliament Speech Corpusr sækja  
  • Corpus of Althingi's Parliamentary Speeches for ASR sækja 
  • The Jensson Corpus sækja 
  • The Þór Corpus sækja 
  • The Rúv Corpus sækja 
  • Ravnursson Faroese Speech and Transcripts sækja
Testsets
  • Test Set for TTS Intelligibility Tests 22.01
  • Icelandic Standardization Benchmark Set: Spelling and Punctuation 24.09
  • Icelandic Standardization Benchmark Set: Language Usage24.09
  • Icelandic Culture and History QA Dataset 24.10
  • Icelandic Linguistic Benchmark for LLMs 24.10
  • Labeled Corpus of Icelandic Homographs 24.04
  • The Icelandic Confusion Set Corpus (ICoSC) 2.0 | 1.0
  • Text Normalization Corpus 21.10
  • NQiI - Natural Questions In Icelandic 1.1 | 1.0
  • Icelandic WinoGrande 1.0
  • The Reykjavik University Question-Answering Dataset (RUQuAD) 22.02
  • IceSum - Icelandic Text Summarization Corpus 22.09 | 21.11
  • Icelandic Youth Language 

Wordlists, wordnets and dictionaries

Various types of dictionaries can be found in the repository. Data for online dictionaries Íslensk nútímamálsorðabók (Dictionary of Contemporary Icelandic) and ISLEX – Icelandic-Scandinavian multilingual dictionary can be found there as it stood at a certain point in time. The Database of Icelandic Morphology (“Beygingarlýsing íslensks nútímamáls” – BÍN) is a collection of paradigms that is accessible through the website of the Árni Magnússon Institute for Icelandic Studies. Five different BÍN databases can be found in the repository in the version that they were at a specific point in time. Additionally, there is also BinPackage – a Python package with a standardised application programming interface to make it easier for programmers and academics to make use of the BÍN data. Other dictionaries include e.g. pronunciation dictionaries. The Pronunciation Dictionary for Icelandic (“Framburðarorðabókin”) is a part of the Hjal-project and contains around 50.000 phonetically written word forms. The General Pronunciation Dictionary for ASR (“Almenn framburðarorðabók fyrir talgreiningu”) is based on the Pronunciation Dictionary for Icelandic but contains around 135.000 word forms and can be used to develop speech recognisers. Icelandic Pronunciation Dictionary for Language Technology (“Íslensk framburðarorðabók fyrir máltækni”) contains manually verified transcriptions of four pronunciation variants. The Icelandic Hyphenation Dictionary contains hyphenation patterns and lists of hyphenations that explain how to hyphenate Icelandic words between lines.

Web dictionaries
  • A Dictionary of Contemporary Icelandic 2020  
  • Islex - Icelandic-Scandinavian multilingual dictionary 2022 2013  
Database of Icelandic Morphology (DIM)
  • The Database of Modern Icelandic Inflection (DMII) 19.10  
  • DIM Valency Structures 21.10 
  • DMII - The Comprehensive Format 21.10 
  • DMII Core sækja 
  • DMII - Abbreviations 21.10 
  • BinPackage 0.4.4 | 0.4.2 | 0.3.1
Other dictionries
  • Icelandic Pronunciation Dictionary for Language Technology 22.01 | 21.10 | 21.02 | 21.01
  • Pronunciation Dictionary for Icelandic sækja 
  • General Pronunciation Dictionary for ASR sækja 
  • Icelandic Hyphenation Dictionary 1.0 | 2.0 

Wordnets describe the semantic relations of words and phrases. IceWordNet is an Icelandic version of the Princeton Core WordNet, which classifies words into linked synsets. The Icelandic Wordweb (“Íslenskt orðanet”) is based on an alternative analysis of the semantic relations of Icelandic words and phrases (see website).

  • Stop-words for the Icelandic Gigaword Corpus 21.08
  • Word frequency list from the Icelandic Corpus for Academic Words (MÍNO) (MÍNO) 1.0
  • The Icelandic Academic Word List (LÍNO) 1.0
  • English-Icelandic/Icelandic-English glossary 21.09
  • Idiomatic Expressions in Icelandic and English 22.09

Language descriptions

Word embedding presents words as vectors, where words that have a similar usage (drengur, strákur) should receive similar numerical values and the same applies to comparable semantic relations (maður – kóngur, kona – drottning). The repository currently contains three sets of word embeddings each of which has been trained using data from the Icelandic Gigaword Corpus.

  • Word Embeddings – Word2Vec optimized for IceBATS 22.04
  • Word Embeddings – GloVe optimized for IceBATS 22.04
  • Word Embeddings - FastText optimized for IceBATS 22.04

N-grams can be either bigrams or trigrams, i.e. either two or three words within a sentence. For example, one might expect the trigram “einu sinni var” (once upon a time) to appear repeatedly in Icelandic fairytales. One of the uses of N-grams is to predict the following word within a sentence. Icegrams is a Python 3 package that contains a large collection of Icelandic trigrams.

  • IceBATS - The Icelandic Bigger Analogy Test Set 21.06
  • Icelandic Pronunciation 20.10
  • Patterns and sentences sækja 

Tools and models

The tokenizer Tokenizer divides input texts into sentences and tokens (words and punctuation).

The grammatical taggers ABL-tagger and CombiTagger tag each token in a text with a text string that designates word class as well as e.g. case, gender, tense and so on. ABL-tagger is the most commonly used tagger for Icelandic texts and provides the most accurate results. Named entity recognition (NER) means that words such as names for people, places and companies are tagged specifically in the input text. The repository contains two named entity recognition models (Icelandic NER API - Ensemble model and Icelandic NER API - ELECTRA-base model).

The lemmatiser ABL-lemmatiser reads tagged texts and lemmatises them, i.e. designates a lexical entry – a lemma – for each word (e.g. hestur for hests).

Parsers process texts and analyse their sentence structure according to specific syntax. IceParser is a rule-based shallow parser, an improved version of the shallow parser found in the IceNLP-package that was developed between 2004 and 2007. Greynir is a rule-based full-parser based on context-free grammar. Miðeind’s Neural Constituency Parser (“Tauganetsþáttari Miðeindar”) is a variant of the Berkeley Neural Parser. IceNeuralParsingPipeline (“Íslenska taugaþáttunarpípan”) is a parsing pipeline that contains all the steps necessary to parse plain Icelandic text, i.e. steps for the pre-processing, parsing and post-processing of a text. It was trained using the IcePaHC-treebank.

Biaffine-Based UD-Parser and COMBO-Based UD Parser are both UD-parsers. UDConverter and UDConverter for GreynirCorpus are not actual parsers put can process parsed data and convert it into a UD structure.

Tokenizers
Taggers
  • ABL-tagger 3.0 | 2.0 | 1.0
  • CombiTagger 1.0 
  • Icelandic NER API - Ensamble model 21.09
  • Icelandic NER API - ELECTRA-base model 21.05
Lemmatizers
Parsers
  • IceParser 1.5.0
  • IceNLP Natural Language Processing toolkit 1.0  
  • GreynirPackage 3.5.2 | 3.5.1 | 3.1.0 | 2.6.1
  • Miðeind's Neural Constituency Parserr 1.0
  • IceNeuralParsingPipeline 20.04
  • COMBO-based UD Parser 22.10
  • Biaffine-based UD Parser 22.10
  • UDConverter 22.01
  • UDConverter for GreynirCorpus 22.06

The repository contains several translation models that translate between Icelandic and English and one that translates between Icelandic and Polish. Long Context Translation for English-Icelandic translations (“Víðsamhengislíkan fyrir þýðingar milli ensku og íslensku”) is the newest model and the one that has been most successful with translations between English and Icelandic. Optimised Long Context Translation Models for English-Icelandic translations (“Bestað víðsamhengislíkan fyrir þýðingar milli ensku og íslensku”) is a lighter and faster model based on the aforementioned model. GreynirTranslate – mBart25 NMT models for Translation between Icelandic and English (“GreynirTranslate - mBART25 NMT þýðingarlíkön fyrir íslensku og ensku”) contains common translation models based on a multilingual BART-model.

MT: Moses-SMT is a system which is used to develop and run statistical machine translations. GreynirT2T is a program library for training translation models that translate between Icelandic and English. GreynirT2T Serving contains programs and models for running GreynirT2T transformer machine translation models. GreynirSeq Domain Translation Pipeline is a software that retrieves an Icelandic-English translation model and can adjust it for training on parallel data that is labelled according to domain.

Translation models
  • Long Context Translation Models for English-Icelandic translations 22.09
  • Optimized Long Context Translation Models for English-Icelandic translations 22.09
  • GreynirTranslate - mBART25 NMT models for Translations between Icelandic and English 1.0 
  • GreynirTranslate - mBART25 NMT (with layer drop) models for Translations between Icelandic and English 1.0
  • Semi-supervised Icelandic-Polish Translation System 22.09
Support tools
  • MT: Moses-SMT 1.0
  • GreynirSeq Domain Translation Pipeline 22.06
  • GreynirT2T - En--Is NMT with Tensor2Tensor 1.0
  • GreynirT2T Serving - En--Is NMT Inference and Pre-trained Models 1.0

Speech recognition is when spoken language is transformed into text. The repository contains various recipes developed for the software Kaldi and other environments for developing speech recognisers. The scripts present different approaches for training speech recognisers.

Punctuation Model (“Greinarmerkingarlíkanið”) is a Python package for punctuating Icelandic text and is very useful for punctuating texts from speech recognisers. 6-GRAM Language Model in Icelandic for NeMo (“Íslenskt 6-stæðu mállíkan fyrir NeMo”) is an N-gram language model based on binary formatted words that is intended for language recognisers developed within the NVIDIA-NeMo-environment. Heyra is an Android application for language recognition.

Recipes
  • Samrómur-Children Demonstration Scripts 22.01
  • Samrómur-Adolescents Kaldi Recipe 22.06
  • Samrómur-L2 Kaldi Recipe 22.10
  • RÚV-DI Speaker Diarization 21.10 | 20.09
  • RÚV-DI Speaker Diarization v5 models 21.05
  • Voice control and question answering 22.10
  • Samrómur NeMo Recipe 22.06
  • Samrómur DeepSpeech Recipe 22.06
Other
  • Punctuation model 20.09
  • 6-GRAM Language Model in Icelandic for NeMo (Binary Format) 22.06
  • DeepSpeech Scorer for Icelandic 22.06
  • Icelandic Language Models with Pronunciations 22.01
  • Heyra 1.0
  • Tiro Web interface for speech recognition 1.0

Speech synthesis is when text is transformed into spoken language. Several models have been trained using data from Talrómur. WebRICE is a web reader developed at Reykjavík University. WebRICE Extension is an add-on for Chrome users and the WebRice – Web Reader is meant for users who want to add a web reader to their websites.

TTS Text Processing (“TTS Textavinnsla”) contains a text processing pipeline for Icelandic speech synthesisers. TTS Document Reader (“TTS Skjalalesari”) contains a web application that processes text and returns an audio file. Prosody Feature Extraction with Speaker Information (“FED-tól fyrir einkenni hljóðvistar með mælendaupplýsingum”) is a tool for labelling speakers in recorded conversations.

Models
  • Multi-speaker GlowTTS model for Talrómur 2 (prerelease) 22.10
  • GlowTTS models for Talrómur 1 22.10
  • Talrómur TTS- model 22.10
Web reader
  • Webrice extension 22.09 | 22.01
  • WebRICE - An Open Source Web Reader 21.06
Other
  • Tiro TTS web service 22.10 | 22.06 | 1.0
  • Prosody feature extraction with speaker information 20.09
  • MOSI: TTS evaluation tool 22.01
  • TTS Text Processing 22.10
  • TTS Document Reader 22.10
  • Icelandic TTS for Android 22.10

MAFIA can be used to automatically create speech recognition data from recordings and transcripts by pairing together sound and text. Speech Corpora Toolkit (“Tækjasafn fyrir talmálsheildir”) is a collection of tools for standardising recordings and transcripts in a manner that prepares them for segmentation and alignment.

  • Icelandic Homograph Classifier 24.04
  • MAFIA (Match-Finder Aligner): A speech/text aligning tool 22.06
  • Speech Corpora Toolkit 22.06

The repository contains three tools that can be used to transcribe Icelandic texts. Rule-Based g2p for Icelandic (“Reglubyggða hljóðritunarforritið”) is based on manually input rules while g2p Module for Icelandic (“Hljóðritunarforrit fyrir íslensku”) is based on models. The package Models for Automatic g2p for Icelandic (“Hljóðritunarlíkön fyrir íslensku”) contains models that were trained using an LSTM neural network and a script that utilises the models.

g2p-Service (“g2p-þjónustan”) and Editor for Pronunciation Dictionaries (“Vefviðmót til þess að vinna með framburðarorðabækur”) are both tools (web applications) for developing pronunciation dictionaries.

Grapheme-to-phoneme
  • Rule-based g2p for Icelandic 20.10
  • Grapheme-to-phoneme (g2p) module for Icelandic 22.10
  • Models for automatic g2p for Icelandic 20.10
Editor for pronunciation dictionaries
  • g2p-þjónusta 20.11
  • Editor for pronunciation dictionaries 20.10

Spelling and grammar checking means checking a text for errors and either correcting them or highlighting them within the text. The repository contains several models that either correct words and sentences or sort them based on the errors in question. Byte-Level Neural Error Correction Model for Icelandic (“Leiðréttingarlíkan fyrir íslensku”) is actually a translation model that translates an Icelandic text with errors into an error-free Icelandic text. GreynirCorrect is a Python 3 package and a command line tool that highlights and corrects various types of spelling and grammar errors. Binary Error Classifier for Icelandic Sentences (“ByT5-base Transformer-líkan fyrir flokkun íslenskra setninga”) sorts sentences according to whether they are likely to contain errors or not. Multilabel Error Classifier for Sentences (“Fjölmerkja villuflokkari fyrir setningar”) detects whether a sentence contains a specific type of error (e.g. a spelling or grammar error), and the Error Classifier for Tokens (“Villuflokkari fyrir tóka”) does the same thing for words.

In addition to these models there are also websites and software available for spelling and grammar checking. Yfirlestur contains code for a spelling and grammar checking website that uses GreynirCorrect for its spelling and grammar checking function. Yfirlestur Docs and Yfirlestur Word contains back-end coding for add-ons for Google Docs and Microsoft Word. Hunspell-is is a software that reads the Icelandic Wiki Dictionary and creates a dictionary for the spell checker Hunspell, which can be used with e.g. LibreOffice, Firefox, Thunderbird and Google Chrome.

Models
  • Icelandic GPT-SW3 for spell and grammar checking 04.24
  • Byte-Level Neural Error Correction Model for Icelandic - Yfirlestur 22.09
  • GreynirCorrect 3.4.5 | 3.4.4 | 3.2.1 | 3.2.0 | 1.0.2  
  • ByT5-base Transformer-líkan fyrir flokkun íslenskra setninga 22.09
  • Multilabel Error Classifier (Icelandic Error Corpus categories) for Sentences 22.01
  • Error Classifier (Icelandic Error Corpus categories) for Tokens 22.05
Softwarer / websites
  • Yfirlestur Docs 22.10
  • Yfirlestur Word 22.10
  • Spellchecking app for Android 22.10
  • Yfirlestur 1.0.1 | 1.0.0  
  • Hunspell-IS. Spell checker, morphological analyzer & thesaurus for Icelandic sækja
  • Alexia Lexicon Acquisition Tool for Icelandic 3.0 | 2.0 | 1.0
  • Skiptir 20.10
  • Annotald 1.0.0
  • GreynirSeq - A Natural Language Processing Toolkit for Icelandic 0.2.0
  • OCR Post-Processing Tool for Icelandic 22.10
  • AnySoftKeyboard with custom autocompletion 22.10
  • IceEval - Icelandic Natural Language Processing Benchmark 22.09

Other resourses

Here below are listed a few resources that are not in the repository of CLARIN-IS but are searchable or can be downloaded on other sites.

Dictionaries and word lists

Málheildir - textaskrár