MerkOr

MerkOr is an innovative Icelandic corpus, based on the semantic relations between words and their classification into semantic fields. All the content of MerkOr was automatically created with methods that can be used to create additional vocabularies of this kind, both for Icelandic and other languages.

The goal is that MerkOr can be used in Icelandic Language Technology, in making software where semantic information of Icelandic is required. MerkOr can also be useful in other ways, such as in linguistic research and Icelandic language education.

  • Download MerkOr from GitHub with LGPL license.

About MerkOr

MerkOr is a semantic database of Icelandic words. It contains several hundred thousand semantic relations that were all constructed automatically. MerkOr analyzed a huge collection of Icelandic texts, established syntactic patterns, and used various statistical calculations to determine semantic relations.

MerkOr is an innovative lexicon with no explanations for individual words, but instead the words are linked to each other with semantic relations and categorized by semantic fields. The database is first and foremost intended for use in software for Icelandic text. However, on the search page it is possible to look up words in the traditional way.

When a word is typed in the search box, it is possible to see what relations it has to other words. There are about a hundred types of semantic relations in the MerkOr database. Here are some of the most common ones: og 'and' indicates that two words are often used together, for instance mamma og pabbi 'mom and dad'. The relations eiginleiki 'is attribute of' indicates that the first word is an attribute of the latter word. It can also be said that the latter word hefur 'has' that what is indicated by the first word. Therefore, mælaborð eiginleiki bíll 'dashboard is attribute of car' or bíll hefur (alla jafna) mælaborð 'car has (in general) dashboard'. Lýsir 'describes' means that a particular adjective can apply to the following noun, as in háhælaðir lýsir skór 'highheeled describes shoes'. There are also relations between verbs and the nouns that can be their objects, such as drekka andlag vatn 'drink object water'. All other relations are between nouns; most using prepositions like á, af, hjá, með 'on, of, at, with' etc. The words given in the results are always in the nominative case and so we will get kaffi á kanna 'coffee-nom on jug-nom' but not kaffi á könnu 'coffee-nom on jug-dat'.

The words are also classified into semantic categories and we have lists of words that all belong to the same semantic category. For example, the word móðurmál 'mother tongue' belongs to the semantic category TUNGUMÁL 'LANGUAGES'.

The order of linked words in the results is based on the strength of the relations. The word that has the strongest relation to the search word through certain relations comes first, and the word that has the strongest relations to the so-called middle of the semantic category appears first on that list.

All the words in the results are links and it is therefore possible to go from one word to another in MerkOr without typing a new search word. It is, however, important to keep in mind that results obtained automatically will contain some errors. The fundamentals of the MerkOr database are:

  • Lexical items. Contains an id, a 'lemma' (=word string), sense number and a word class.
    • Example: [id=109799, lemma=skúr_1, wordclass=noun]
  • Relation. A relation connects two lexical items with a relation type (see next). Each relation has a confidence score associated to it, the higher this score, the better/more representative the relation.
    • Example: [id=893, from_item_id=52069, relation_id=7, to_item_id=34948, confidence_score=366.806]
  • Relation type. Specifies the type of relationship between two lexical items.
    • Example: [id=7, name=coord_noun, description=og]
  • Cluster. A cluster is an ordered list of lexical items belonging to the same semantic domain. Each item in a cluster has a score associated to it, indicating how well the item fits the corresponding cluster. Less than 10,000 items belong to a cluster.

The MerkOrCore API and command line interface can be used to query the database. Here are some examples of queries for the API /command line:

  • Does a word belong to more than one lexical item? (for instance if the query is fetch lexical items for the word 'skúr‘, then the results would be two lexical items, 'skúr‘ (shed) as a masculine noun on the one hand and 'skúr‘ (rain shower) as a feminine noun on the other hand)
  • Which relations exist for a certain word?
  • What are the relations with the highest confidence score for a certain word?
  • What are the relations with the highest confidence score for a certain relation type? (the best examples of certain relations)
  • To which cluster(s) does a word belong?
  • Are there clusters representing some certain semantic domain (like ÍÞRÓTTIR*)?
  • Which lexical items are connected to a certain domain?

The project was financed by the research project Viable Language Technology Beyond English which received a Grant of Excellence from the from The Icelandic Research Fund during the years 2009-2011.

DISCLAIMER: All the material in the MerkOr database is created with automatic analytical methods. Nothing in the results reflects the knowledge and opinion of the creator of MerkOr.

Contact