CLARIN’s repository contains datasets, models and software. They are mostly products of the language technology program, but also various other data submitted by institutions and individuals. On www.clarin.is/en/resources most of what is found in the repository is listed in a structured way, which gives a good overview of the content of the repository.
The Icelandic Gigaword Corpus (IGC) has now been expanded with data from 2022 and 2023. This additional data can be downloaded from the CLARIN repository and searched on the Corpora Website of the Árni Magnússon Institute. In addition, the Corpora Website has been updated and some minor flaws have been fixed.
The first edition of The Icelandic Gigaword Corpus was published in 2018 and new editions appeared every year for the first five years. Each time new data was added and tagging methods were improved. The first edition contained about 1,259 million running words, while the second edition contained 2,439 million running words. It was not considered necessary to publish the corpus in its entirety this time, as the methods of tagging and processing of texts have not changed since the last edition was issued. Therefore, an addendum with data in 2022 and 2023 was published, containing around 162 million running words. On the Corpora Website people can search in a new version of the corpus where the new data has been added to the 2022 edition.
The Árni Magnússon Institute's language processing website is up again, enhanced and improved. There you can use the following tools, both by pasting text into a form and by using an API: