CLARIN-IS Repository

The repository of CLARIN-IS (repository.clarin.is) contains a lot of resources - tools for language processing, corpora, lexica and language descriptions of various kinds. All the products of the Language Technology Programme for Icelandic (2019–2023) were uploaded to the repository as well as the resources that used to be on www.malfong.is. The repository has a powerful search engine but in order to provide a simple overview, the main resources are listed presented here, where they have been divided into categories and subcategories. Each category includes a short and simple text discussing the usefulness of various language resources and tools.

  • Each entry is accompanied by one or more links pointing to the CLARIN repository (the first link refers to the most recent entry).
  • If data can also be found on GitHub or HuggingFace, a link is provided.
  • The icon refers to an information page related to an entry.
  • The icon refers to a web site with a search engine associated with the relevant data.

[open all] [hide all]

Corpora and Testsets

A treebank is a parsed corpus that holds information about sentence construction and phrases. The Icelandic treebanks are parsed according to the Penn Parsed Corpora of Historical English (PPCHE) parsing scheme but have in some cases been adapted to Icelandic sentence construction. The Icelandic Parsed Historical Corpus (IcePaHC) (“Sögulegi íslenski trjábankinn”) and the . Faroese Parsed Historical Corpus (FarPaHC) (“Sögulegi færeyski trjábankinn”) were both manually annotated. The Icelandic Contemporary Treebank – IceConTree (“Samtímalegi íslenski trjábankinn”) and NeuralMIcePaHC (“Taugavélþáttaði IcePaHC-trjábankinn”) were parsed digitally using IceNeuralParsingPipeline. GreynirCorpus holds 10 million sentences, mostly from news media between 2015-2021, and was tagged using Miðeind’s Greynir language processor, which applies a similar tagging scheme as the treebanks. A section of the GreynirCorpus – its gold standard – has been manually annotated. This section was mapped onto the relational treebank UD GreynirCorpus using UDConverter. Read more about relational treebanks at https://universaldependencies.org.

  • The Icelandic Contemporary Treebank (IceConTree) 1.1 | 1.0
  • Icelandic Parsed Historical Corpus (IcePaHC) 2024.03 0.9 
  • Faroese Parsed Historical Corpus (FarPaHC) 0.1
  • NeuralMIcePaHC 20.05 | 20.04
  • GreynirCorpus 21.06 | 20.05 | 20.05 || GitHub
  • UD GreynirCorpus 22.06

There are various types of tagged corpora. Often, such corpora consist of a text corpus that has been edited (split into sentences and tokens), grammatically tagged (with each token being tagged with a string of text that signifies word class, gender, case etc.) and given a lemma (a lexical search term, e.g. hestur for hests). This is the case with the Icelandic Gigaword Corpus, the Tagged Icelandic Corpus, the Icelandic Frequency Dictionary, the Saga Corpus and the Corpus of Icelandic Academic Vocabulary. Of these, the Icelandic Frequency Dictionary (“Orðtíðnibókin” – OTB) is the oldest. It was published in 1991 and contains around 500.000 textual words harvested from extracts of 100 texts dated from 1980 to 1989. The Tagged Icelandic Corpus is much bigger. It contains around 25 million words from a wide variety of texts dated from 1980 to 1989. The Icelandic Gigaword Corpus  is the most recent and largest of these corpora and is regularly updated with new texts. Unlike OTB and MÍM, the RHM is not “balanced”, i.e. texts chosen for the corpus are not balanced by making sure that there is an approximately equal number of texts for each type of text. Instead, the corpus contains all the texts that are available to it. Because of this, it contains a far higher number of texts from e.g. news media and government documents than it does literary texts or extracts from academic journals. The Saga Corpus (“Fornritin”) contains tagged texts from the Icelandic sagas Sturlunga, Heimskringla and Landnámabók. The Corpus of Icelandic Academic Vocabulary  was created from a selection of texts from MÍM and RMH with the goal of creating a word frequency list for Icelandic academic vocabulary.

MÍM has been used to create gold standards for grammatical tagging (MÍM-GULL), named entity recognition (MÍM-GUll_NER) and entity linking (MÍM-GULL-EL). Gold standards can be used to test and train various tools. To perform a test on a tool, the gold standard must be divided into a training set and a testing set, as has been done with MÍM-GULL. RHM and OTB have also been used to create training and testing sets that can be used to train e.g. taggers and lemmatisers.

Corpora
  • Icelandic Gigaword Corpus  2022 |2021  || HuggingFace ||  
  • Icelandic Gigaword Corpus in a JSON-format 2022
  • Tagged Icelandic Corpus 1.0  
  • Icelandic Frequency Dictionary  18.10 | 12.11  
  • The Saga Corpus sækja  
  • Corpus of Icelandic Academic Vocabulary 1.0 0.9
  • Texts from the Icelandic Web of Science and the European Web 25.02 | HuggingFace
Gold standards and test sets

An error corpus is a corpus where errors have been tagged, e.g. with regard to spelling, grammar etc. Among other things, error corpora can be used to develop and train grammar checkers. The Icelandic Error Corpus, the Icelandic Child Language Error Corpus, the Icelandic L2 Error Corpus and the Icelandic Dyslexia Error Corpus were all created at the University of Iceland and were all developed in a similar manner, using the same error codes. The Icelandic Confusion Set Corpus contains a list of similar words that people tend to get confused about (e.g. ‘kvísl’ and ‘hvísl’), information about their frequency and different pos-tags.

The Icelandic Error Corpus Nonwords and Spell and grammar checking – Thesis testing are lists of error words from texts along with corrections. The Icelandic Taboo Database contains a list of Icelandic words that might be considered inappropriate and/or value-laden in one way or another. IceSQuEr contains a list of query errors from The Database of Modern Icelandic Inflection that did not give any results.

Corpora
  • Icelandic Error Corpus (IceEC)  1.1 | 1.0 | 0.9
  • The Icelandic Child Language Error Corpus (IceCLEC) 1.1 | 1.0 || GitHub
  • The Icelandic L2 Error Corpus (IceL2EC) 1.3 | 1.2 | 1.1 | 1.0 || GitHub
  • The Icelandic Dyslexia Error Corpus 1.2 | 1.1 | 1.0 || GitHub
  • The Icelandic Confusion Set Corpus (ICoSC) 2.0 | 1.0
Lists
  • Icelandic Error Corpus Nonwords 20.09
  • Spell and grammar checking – Thesis testing 22.10
  • Icelandic Taboo Database (iceTaboo) 1.0 || GitHub
  • Icelandic Search Query Errors (IceSQuEr) 0.1

A parallel corpus is a collection of texts in two or more languages that have been aligned at least at the sentence level, so that a sentence in one language is matched with a sentence in another language. All of the corpora and lists are in English and Icelandic except the Data for translation between Polish and Icelandic. ParIce is an English-Icelandic parallel corpus intended for training machine translation software. It consists of various subcorpora and includes around 3,5 million paired sentences. Other corpora listed below are so-called synthetic corpora, which can be used when there is an insufficient number of aligned texts available. The En-IS Synthetic Parallel Corpus was made using back-translations, where a machine translation tool is used to translate a text (e.g. Icel.-Eng.) and the resulting translation is then used as a testset for a model that translates back into the original language (Eng.-Icel.). However, the En-Is Synthetic Parallel Astronomy Corpus with Injected Vocabulary was created by taking a parallel corpus and switching out some of its words for more uncommon words. The En-Is Synthetic Parallel Named Entity Robustness Corpus and the En-Is Semi-Synthetic Parallel Name Robustness Corpus are synthetic corpora that are focused on increasing the value of various proper nouns during training by inserting them into the texts.

Parallel testsets can be used to train and test models that are intended for translating between two languages. ParIce Dev/Test Sets consists of a selection of texts from the ParIce corpus where the text alignment has been manually annotated and can be used to train tools that translate between English and Icelandic. The Icelandic-English Test Set for Sentence Alignment is intended for testing automatic sentence alignment tools. The Icelandic-English Classification Training Set for Parallel Sentence Alignment Filtering has a similar function but is rather a training set for a classifier that chooses and separates high quality parallel sentences from less precise sentences. The Icelandic-English Parallel Sentence Extraction Dataset can be used to test the accuracy of parallel sentence extraction processes for comparable corpora. En-Is Parallel Named Entity Robustness Corpus – Test Data contains testsets to assess translations of named entity tokens (e.g. people or place names) between Icelandic and English.

The lists cities_is2en (city names), countries_is2iso (country names), isprep4cc (prepositions preceding country names) and isprep4isloc (prepositions preceding city and place names) can be used to ensure that the names for cities and countries are translated correctly and used with accurate propositions in the Icelandic.

Corpora (Icelandic and English)
  • ParIce: English-Icelandic parallel corpus 21.10 | 19.10 
  • En-Is Synthetic Parallel Astronomy Corpus with Injected Vocabulary 1.0
  • En-Is Synthetic Parallel Corpus 21.07 | 20.09
  • Long Context Synthetic Translation Pairs for English and Icelandic 22.09
  • En-Is Synthetic Parallel Named Entity Robustness Corpus 1.0
  • En-Is Semi-Synthetic Parallel Name Robustness Corpus 1.0
Corpora (Icelandic and Polish)
  • Data for translation between Polish and Icelandic 24.09
Testsets
  • ParIce Dev/Test Sets 21.10 | 20.05 
  • Icelandic-English test set for sentence alignment 21.10
  • Icelandic-English Classification Training Set for Parallel Sentence Alignment Filtering sækja
  • Icelandic-English Parallel Sentence Extraction Dataset 21.10
  • En-Is Parallel Named Entity Robustness Corpus - Test data 1.0
Lists

Samrómur was made through crowd-sourcing and thus includes a magnitude of different speakers. Spjallrómur is a conversational speech corpus that contains 54 conversations by 102 speakers. Kennslurómur is a collection of audio recordings with transcripts of lectures recorded during courses at Reykjavík University and the University of Iceland. Raddrómur consists of audio samples taken from radio and podcasts (mainly from the Icelandic National Broadcasting Service – RÚV). RÚV TV, RÚV TV Unknown Speakers and Icelandic broadcast speeches are databases that contain audio and transcripts from television and radio programs made by RÚV. All of the above-mentioned corpora are the products of the Icelandic government’s Language Technology Programme for Icelandic 2019-2023. Other corpora for ASR are older.

Voice samples and sound files
  • Samrómur Corpus 21.05 
  • Samrómur - Queries 21.12 
  • Samrómur -  Children 21.09 
  • Samrómur - L2 22.09
  • Samrómur - Mimics 22.09
  • Samromur - Unverified 22.07
  • Spjallrómur - Icelandic Conversational Speech 22.01 || GitHub
  • Kennslurómur - Icelandic Lectures 22.01
  • Raddrómur - Icelandic Speech 22.09
  • RÚV TV data 20.12
  • RUV TV unknown speakers 22.02
  • Icelandic Broadcast Speeches 22.02
  • The Hjal Corpus sækja 
  • Málrómur Corpus sækja 
  • Parliament Speech Corpusr sækja  
  • Corpus of Althingi's Parliamentary Speeches for ASR sækja 
  • The Jensson Corpus sækja 
  • The Þór Corpus sækja 
  • The Rúv Corpus sækja 
  • Ravnursson Faroese Speech and Transcripts sækja

Talrómur refers to three different datasets that contain short recordings by different speakers. They are meant to be used to develop TTS.

Labelled Corpus of Icelandic Homographs contains a list of all labelled homographs and a corpus of sentences containing these homographs, labeled by pronunciation. The corpus can be used e.g. for the training of a homograph classifier, and linguistic research. Text Normalization Corpus contains several text collections before and after normalization. Normalization, in this case, implies that e.g. digits, abbrevitations and various symbols are written in letters ('14.6 kg' would become 'fourteen point six kilograms'). The corpora may be used to train a normalization tool that normalizes text before sending it for speech synthesis

The Speech Intelligibility Testing Data contains sentences for intelligibility testing of a TTS system. It is a set of 50 sentences where each occurs twice: once in its correct version and once containing one spelling error.

Voice samples and sound files
Text corpora
  • Labelled Corpus of Icelandic Homographs 24.04 || GitHub
  • Text Normalization Corpus 21.10
Testsets
  • Test Set for TTS Intelligibility Tests 22.01

Benchmark data are data that may be used for benchmarking of various tools. The data below can be used to assess the linguistic proficiency of large language models, their grammatical capabilities and knowledge of Icelandic culture and history.

Icelandic Linguistic Benchmark for LLMs can be used to assess the linguistic and grammatical capabilities of large language models for Icelandic. Icelandic Standardization Benchmark Set: Spelling and Punctuation consists of examples of written text that does not conform to a language standard with regard to spelling and punctuation, as well as corrected examples and short and longer explanations based on official written rules for Icelandic. Icelandic Standardization Benchmark Set: Language Usage consists of over 300 sentences that do not conform to the language standard and corresponding corrected sentences. Icelandic Culture and History QA Dataset is intended to measure the knowledge of language models about Icelandic culture and history and its ability to answer questions correctly.

  • Icelandic Linguistic Benchmark for LLMs 24.10
  • Icelandic Standardization Benchmark Set: Spelling and Punctuation 24.09
  • Icelandic Standardization Benchmark Set: Language Usage 24.09
  • Icelandic Culture and History QA Dataset 24.10
  • NQiI - Natural Questions In Icelandic 1.1 | 1.0
  • Icelandic WinoGrande 1.0
  • The Reykjavik University Question-Answering Dataset (RUQuAD) 22.02
  • IceSum - Icelandic Text Summarization Corpus 22.09 | 21.11
  • Icelandic Youth Language 

Wordlists, wordnets and dictionaries

Various types of dictionaries can be found in the repository. Data for online dictionaries Íslensk nútímamálsorðabók (Dictionary of Contemporary Icelandic) and ISLEX – Icelandic-Scandinavian multilingual dictionary can be found there as it stood at a certain point in time. The Database of Icelandic Morphology (“Beygingarlýsing íslensks nútímamáls” – BÍN) is a collection of paradigms that is accessible through the website of the Árni Magnússon Institute for Icelandic Studies. Five different BÍN databases can be found in the repository in the version that they were at a specific point in time. Additionally, there is also BinPackage – a Python package with a standardised application programming interface to make it easier for programmers and academics to make use of the BÍN data. Other dictionaries include e.g. pronunciation dictionaries. The Pronunciation Dictionary for Icelandic (“Framburðarorðabókin”) is a part of the Hjal-project and contains around 50.000 phonetically written word forms. The General Pronunciation Dictionary for ASR (“Almenn framburðarorðabók fyrir talgreiningu”) is based on the Pronunciation Dictionary for Icelandic but contains around 135.000 word forms and can be used to develop speech recognisers. Icelandic Pronunciation Dictionary for Language Technology (“Íslensk framburðarorðabók fyrir máltækni”) contains manually verified transcriptions of four pronunciation variants. The Icelandic Hyphenation Dictionary contains hyphenation patterns and lists of hyphenations that explain how to hyphenate Icelandic words between lines.

Web dictionaries
  • A Dictionary of Contemporary Icelandic 2020  
  • Islex - Icelandic-Scandinavian multilingual dictionary 2022 2013  
Database of Icelandic Morphology (DIM)
  • The Database of Modern Icelandic Inflection (DMII) 19.10  
  • DIM Valency Structures 21.10 
  • DMII - The Comprehensive Format 21.10 
  • DMII Core sækja 
  • DMII - Abbreviations 21.10 
  • BinPackage 0.4.4 | 0.4.2 | 0.3.1
Other dictionaries
  • Icelandic Pronunciation Dictionary for Language Technology 22.01 | 21.10 | 21.02 | 21.01 || GitHub
  • Pronunciation Dictionary for Icelandic sækja 
  • General Pronunciation Dictionary for ASR sækja 
  • Icelandic Hyphenation Dictionary 2.0 | 1.0 || GitHub

Wordnets describe the semantic relations of words and phrases. IceWordNet is an Icelandic version of the Princeton Core WordNet, which classifies words into linked synsets. The Icelandic Wordweb (“Íslenskt orðanet”) is based on an alternative analysis of the semantic relations of Icelandic words and phrases (see website).

Stop-words for the Icelandic Gigaword Corpus contains almost 60 thousand 'stopwords' from the Giant Language corpus from 2019. Stopwords are words that can often be ignored when searching large corpora, such as abbreviations, foreign words or system words. Word frequency list from the Icelandic Corpus for Academic Words contains a frequency list that was compiled from the Icelandic Corpus for Academic Words MÍNO. Words occurring 100 times or more in the corpus are arranged by frequency but the total word frequency list is 9,741 words. The Icelandic Academic Word List includes words from MINO that exceed the most common words, words that are used across disciplines and play a key role when discussing a wide range of complex issues. English-Icelandic vocabulary list contains almost 233 thousand Icelandic-English pairs. The glossary was automatically compiled and then verified. The package Idiomatic Expressions in Icelandic and English consists of one thousand Icelandic phrases derived from the ISLEX database of Árnastonfun.

  • Stop-words for the Icelandic Gigaword Corpus 21.08
  • Word frequency list from the Icelandic Corpus for Academic Words (MÍNO)  1.0
  • The Icelandic Academic Word List (LÍNO) 1.0
  • English-Icelandic/Icelandic-English glossary 21.09
  • Idiomatic Expressions in Icelandic and English 22.09

Language descriptions

Word embedding presents words as vectors, where words that have a similar usage (drengur, strákur) should receive similar numerical values and the same applies to comparable semantic relations (man – king, woman – queen). The repository currently contains three sets of word embeddings each of which has been trained using data from the Icelandic Gigaword Corpus and were optimized to obtain a high average score as measured by IceBATS. IceBATS is an Icelandic adaptation of the Bigger Analogy Test Set (BATS) and is intended to evaluate word embeddings based on word analogy tasks.

  • Word Embeddings – Word2Vec optimized for IceBATS 22.04
  • Word Embeddings – GloVe optimized for IceBATS 22.04
  • Word Embeddings - FastText optimized for IceBATS 22.04
  • IceBATS - The Icelandic Bigger Analogy Test Set 21.06

N-grams can be either bigrams or trigrams, i.e. either two or three words within a sentence. For example, one might expect the trigram “einu sinni var” (once upon a time) to appear repeatedly in Icelandic fairytales. One of the uses of N-grams is to predict the following word within a sentence. Icegrams is a Python 3 package that contains a large collection of Icelandic trigrams.

Icelandic Pronunciation contains the file “A Short Overview of the Icelandic Sound System, Pronunciation Variants, and Phonetic Transcription”. Patterns and Sentences is a part of the Hjal-project and contains a list of n-grams that are rare in Icelandic and sentences which contain words where these patterns occur.
  • Icelandic Pronunciation 20.10
  • Patterns and sentences sækja 

Tools and models

Tokenizers

The Tokenizer divides input texts into sentences and tokens (words and punctuation).

Taggers

The grammatical taggers ABL-tagger and CombiTagger tag each token in a text with a text string that designates word class as well as e.g. case, gender, tense and so on. ABL-tagger is the most commonly used tagger for Icelandic texts and provides the most accurate results. Named entity recognition (NER) means that words such as names for people, places and companies are tagged specifically in the input text. The repository contains two named entity recognition models (Icelandic NER API - Ensemble model and Icelandic NER API - ELECTRA-base model).

Lemmatizers

The lemmatiser ABL-lemmatiser reads tagged texts and lemmatises them, i.e. designates a lexical entry – a lemma – for each word (e.g. hestur for hests).

Parsers

Parsers process texts and analyse their sentence structure according to specific syntax. IceParser is a rule-based shallow parser, an improved version of the shallow parser found in the IceNLP-package that was developed between 2004 and 2007. Greynir is a rule-based full-parser based on context-free grammar. Miðeind’s Neural Constituency Parser (“Tauganetsþáttari Miðeindar”) is a variant of the Berkeley Neural Parser. IceNeuralParsingPipeline (“Íslenska taugaþáttunarpípan”) is a parsing pipeline that contains all the steps necessary to parse plain Icelandic text, i.e. steps for the pre-processing, parsing and post-processing of a text. It was trained using the IcePaHC-treebank.

Biaffine-Based UD-Parser and COMBO-Based UD Parser are both UD-parsers. UDConverter and UDConverter for GreynirCorpus are not actual parsers put can process parsed data and convert it into a UD structure.

The repository contains several translation models that translate between Icelandic and English and one that translates between Icelandic and Polish. Long Context Translation for English-Icelandic translations (“Víðsamhengislíkan fyrir þýðingar milli ensku og íslensku”) is the newest model and the one that has been most successful with translations between English and Icelandic. Optimised Long Context Translation Models for English-Icelandic translations (“Bestað víðsamhengislíkan fyrir þýðingar milli ensku og íslensku”) is a lighter and faster model based on the aforementioned model. GreynirTranslate – mBart25 NMT models for Translation between Icelandic and English (“GreynirTranslate - mBART25 NMT þýðingarlíkön fyrir íslensku og ensku”) contains common translation models based on a multilingual BART-model.

MT: Moses-SMT is a system which is used to develop and run statistical machine translations. GreynirT2T is a program library for training translation models that translate between Icelandic and English. GreynirT2T Serving contains programs and models for running GreynirT2T transformer machine translation models. GreynirSeq Domain Translation Pipeline is a software that retrieves an Icelandic-English translation model and can adjust it for training on parallel data that is labelled according to domain.

Translation models
  • Long Context Translation Models for English-Icelandic translations 22.09
  • Optimized Long Context Translation Models for English-Icelandic translations 22.09
  • GreynirTranslate - mBART25 NMT models for Translations between Icelandic and English 1.0 
  • GreynirTranslate - mBART25 NMT (with layer drop) models for Translations between Icelandic and English 1.0
  • Semi-supervised Icelandic-Polish Translation System 22.09
Support tools
  • MT: Moses-SMT 1.0
  • GreynirSeq Domain Translation Pipeline 22.06 || GitHub
  • GreynirT2T - En--Is NMT with Tensor2Tensor 1.0 || GitHub
  • GreynirT2T Serving - En--Is NMT Inference and Pre-trained Models 1.0 || GitHub
  • Web client on top of Google translate compliant API backends 20.05 || GitHub

Speech recognition is when spoken language is transformed into text. Various corpora and language models exist that can be used in the development of speech recognition tools. The corpus can be found under Corpora related to speech recognition, but the language models are here below. The repository contains various recipes developed for the software Kaldi and other environments for developing speech recognisers. The scripts present different approaches for training speech recognisers by integrating corpora and language models

Punctuation Model (“Greinarmerkingarlíkanið”) is a Python package for punctuating Icelandic text and is very useful for punctuating texts from speech recognisers. Heyra is an Android application for language recognition.

Recipes
  • Samrómur-Children Demonstration Scripts 22.01
  • Samrómur-Adolescents Kaldi Recipe 22.06
  • Samrómur-L2 Kaldi Recipe 22.10
  • RÚV-DI Speaker Diarization 21.10 | 20.09
  • RÚV-DI Speaker Diarization v5 models 21.05
  • Voice control and question answering 22.10
  • Samrómur NeMo Recipe 22.06 || GitHub
  • Samrómur DeepSpeech Recipe 22.06 || GitHub
Language models
  • Punctuation model 20.09 || GitHub
  • 6-GRAM Language Model in Icelandic for NeMo (Binary Format) 22.06
  • DeepSpeech Scorer for Icelandic 22.06 || GitHub
  • Icelandic Language Models with Pronunciations 22.01
Other
  • Heyra 1.0
  • Tiro Web interface for speech recognition 1.0 || GitHub

Speech synthesis is when text is transformed into spoken language. Several models have been trained using data from Talrómur. WebRICE is a web reader developed at Reykjavík University. WebRICE Extension is an add-on for Chrome users and the WebRice – Web Reader is meant for users who want to add a web reader to their websites.

TTS Text Processing contains a text processing pipeline for Icelandic speech synthesisers. TTS Document Reader contains a web application that processes text and returns an audio file. Prosody Feature Extraction with Speaker Information is a tool for labelling speakers in recorded conversations.

Models
  • Multi-speaker GlowTTS model for Talrómur 2 (prerelease) 22.10 || GitHub
  • GlowTTS models for Talrómur 1 22.10 || GitHub
  • Talrómur TTS- model 22.10
Web reader
Other

MAFIA can be used to automatically create speech recognition data from recordings and transcripts by pairing together sound and text. Speech Corpora Toolkit (“Tækjasafn fyrir talmálsheildir”) is a collection of tools for standardising recordings and transcripts in a manner that prepares them for segmentation and alignment.

  • Icelandic Homograph Classifier 24.04
  • MAFIA (Match-Finder Aligner): A speech/text aligning tool 22.06 || GitHub
  • Speech Corpora Toolkit 22.06

The repository contains three tools that can be used to transcribe Icelandic texts. Rule-Based g2p for Icelandic (“Reglubyggða hljóðritunarforritið”) is based on manually input rules while g2p Module for Icelandic (“Hljóðritunarforrit fyrir íslensku”) is based on models. The package Models for Automatic g2p for Icelandic (“Hljóðritunarlíkön fyrir íslensku”) contains models that were trained using an LSTM neural network and a script that utilises the models.

g2p-Service (“g2p-þjónustan”) and Editor for Pronunciation Dictionaries (“Vefviðmót til þess að vinna með framburðarorðabækur”) are both tools (web applications) for developing pronunciation dictionaries.

Grapheme-to-phoneme
  • Rule-based g2p for Icelandic 20.10
  • Grapheme-to-phoneme (g2p) module for Icelandic 22.10
  • Models for automatic g2p for Icelandic 20.10
Editor for pronunciation dictionaries
  • g2p-þjónusta 20.11
  • Editor for pronunciation dictionaries 20.10

Spelling and grammar checking means checking a text for errors and either correcting them or highlighting them within the text. The repository contains several models that either correct words and sentences or sort them based on the errors in question. Byte-Level Neural Error Correction Model for Icelandic (“Leiðréttingarlíkan fyrir íslensku”) is actually a translation model that translates an Icelandic text with errors into an error-free Icelandic text. GreynirCorrect is a Python 3 package and a command line tool that highlights and corrects various types of spelling and grammar errors. Binary Error Classifier for Icelandic Sentences (“ByT5-base Transformer-líkan fyrir flokkun íslenskra setninga”) sorts sentences according to whether they are likely to contain errors or not. Multilabel Error Classifier for Sentences (“Fjölmerkja villuflokkari fyrir setningar”) detects whether a sentence contains a specific type of error (e.g. a spelling or grammar error), and the Error Classifier for Tokens (“Villuflokkari fyrir tóka”) does the same thing for words.

In addition to these models there are also websites and software available for spelling and grammar checking. Yfirlestur contains code for a spelling and grammar checking website that uses GreynirCorrect for its spelling and grammar checking function. Yfirlestur Docs and Yfirlestur Word contains back-end coding for add-ons for Google Docs and Microsoft Word. Hunspell-is is a software that reads the Icelandic Wiki Dictionary and creates a dictionary for the spell checker Hunspell, which can be used with e.g. LibreOffice, Firefox, Thunderbird and Google Chrome.

Models
  • Icelandic GPT-SW3 for spell and grammar checking 04.24
  • Byte-Level Neural Error Correction Model for Icelandic - Yfirlestur 22.09
  • GreynirCorrect 3.4.5 | 3.4.4 | 3.2.1 | 3.2.0 | 1.0.2 || GitHub ||  
  • ByT5-base Transformer-líkan fyrir flokkun íslenskra setninga 22.09
  • Multilabel Error Classifier (Icelandic Error Corpus categories) for Sentences 22.01
  • Error Classifier (Icelandic Error Corpus categories) for Tokens 22.05
Softwarer / websites
  • Icelandic Gigaword Corpus JSONL Converter 24.09
  • Alexia Lexicon Acquisition Tool for Icelandic 3.0 | 2.0 | 1.0 || GitHub
  • Skiptir (Hyphenation Tool) 20.10 || GitHub
  • Annotald 1.0.0
  • GreynirSeq - A Natural Language Processing Toolkit for Icelandic 0.2.0
  • OCR Post-Processing Tool for Icelandic 22.10 || GitHub
  • AnySoftKeyboard with custom autocompletion 22.10 || GitHub
  • IceEval - Icelandic Natural Language Processing Benchmark 22.09

Other resourses

Here below are listed a few resources that are not in the repository of CLARIN-IS but are searchable or can be downloaded on other sites.

Dictionaries and word lists

Málheildir - textaskrár