The Icelandic Confusion Set Corpus (ICoSC) consists of seven categories of confusion sets, selected for their linguistic properties as homophones, separated orthographically by a single letter. It was compiled during the course of three months in 2019 by Steinunn Rut Friðriksdóttir and Anton Karl Ingason of the language technology department in the University of Iceland.
The Icelandic Confusion Set Corpus (ICoSC) contains CSV spreadsheets containing all collected confusion sets of each category and their frequencies. The spreadsheets are organized so that for each set, the total frequency of each candidate is calculated along with the frequency of each possible PoS tag for that candidate. The seventh and eight column of the tables contain binary values referring to whether the confusion set is grammatically disjoint (all PoS tags differ for the two candidates) or grammatically identical (all PoS tags are identical for the two candidates). The final column shows the frequency of the less frequent candidate of the set which can be used to determine which sets are viable in an experiment. Also included are text files containing the list of words from each category and text files containing all sentence examples from the IGC which contain the words for each category. As the n/nn examples are by far the most frequent confusion sets, the corpus also includes a word list and sentence examples for the 55 most frequent sets.
All files have UTF-8 encoding. The ICoSC consists of the following categories of confusion sets, selected for their linguistic properties as homophones, separated orthographically by a single letter. The categories are: