Íslensk ruglingsmengjamálheild hefur að geyma sjö flokka ruglingsmengja. Í hverjum flokki eru samhljóma orðatvenndir sem aðeins munar einum bókstaf á í stafsetningu. Ruglingsmengjamálheildin var sett saman á þremur mánuðum árið 2019. Höfundar hennar eru Steinunn Rut Friðriksdóttir og Anton Karl Ingason í máltæknihópi Háskóla Íslands.
The Icelandic Confusion Set Corpus (ICoSC) contains CSV spreadsheets containing all collected confusion sets of each category and their frequencies. The spreadsheets are organized so that for each set, the total frequency of each candidate is calculated along with the frequency of each possible PoS tag for that candidate. The seventh and eight column of the tables contain binary values referring to whether the confusion set is grammatically disjoint (all PoS tags differ for the two candidates) or grammatically identical (all PoS tags are identical for the two candidates). The final column shows the frequency of the less frequent candidate of the set which can be used to determine which sets are viable in an experiment. Also included are text files containing the list of words from each category and text files containing all sentence examples from the IGC which contain the words for each category. As the n/nn examples are by far the most frequent confusion sets, the corpus also includes a word list and sentence examples for the 55 most frequent sets.
All files have UTF-8 encoding. The ICoSC consists of the following categories of confusion sets, selected for their linguistic properties as homophones, separated orthographically by a single letter. The categories are: