The Hjal Corpus

The Hjal project was a part of the LT initiative of the Ministry of Education, Science and Culture, which aimed at strengthening the support for Icelandic in various computer systems. The goal of the Hjal-project was to make a speech recognizer and so, several companies in software and telecommunications joined forces with the University of Iceland to create the data that is necessary for that to happen. The data that can be downloaded from this page is the result of this work.

  • Download the Hjal Corpus. Distributed with a CC BY 3.0 License

About the Hjal Corpus

The project

In the end of 2002, the University of Iceland and four leading companies in the telecommunication and software industry decided to join their efforts to build the first Icelandic speech recognizer. This project, which was called Hjal ( babble ), was sponsored by the Icelandic Language Technology Fund. The goal of the project was to collect sufficient material to train a speakerindependent isolated word recognition system. Since the Language Technology Fund is government funded, the products of the projects that it supports are supposed to be public domain. This means that anyone who wants to develop a speech recognizer for Icelandic can get access to this material.

The project partners established a steering group with one member from each participant. Sæmundur Þorsteinsson from Icelandic Telecom served as Chairman of the steering group. The project leader was Helga Waage, MS, of Hex Software. Professor Eiríkur Rögnvaldsson at the University of Iceland was responsible for the linguistic preparations. The project was performed in cooperation with ScanSoft, Inc. Their role was to train the speech recognizer on the basis of the material prepared in the project. ScanSoft is a well established company in the ASR industry, and they have already developed speech recognizers for almost 50 languages.

Linguistic preparations

ScanSoft used the SAMPA phonetic alphabet for phonemic transcription, so the first task in the linguistic preparations was to develop a SAMPA transcription standard for Icelandic. The next task was to make a detailed description of the phoneme inventory of Icelandic, including an exhaustive list of all possible diphones and a list of the most common triphones. No such list was available, so this took considerable effort. There turned out to be almost 800 different diphones in Icelandic.

The main task in the preparatory phase was to design caller sheets containing words, phrases and sentences for the participants to read. ScanSoft sent rough guidelines as to the structure and content of these sheets. They were to include words and phrases that are likely to be used in ASR applications; a certain number of person names, place names, company names, numerals, numbers (money amounts etc.), commands, and meaningful fillers (OK, please, etc.). Furthermore, each sheet should contain five phonetically rich sentences and three strings of isolated letters.

In designing the sheets, it was necessary to take into account the inflectional nature of Icelandic. Names, like other nouns, inflect for four cases, and some numbers (including all ordinal numbers) inflect for both case and gender (a few numbers even inflect for number as well). Hence, it was necessary to include more examples of these categories than proposed in the guidelines from ScanSoft.

The most difficult part was to construct the complete sentences. They were to be composed in such a way as to get enough samples of all occurring diphones and common triphones in Icelandic. The largest publishing house in Iceland, Edda Publishing, gave access to the text of more than 100 recent novels (approximately 64 megabytes of text). From this corpus, all sentences containing 5-12 words were extracted automatically. This gave almost 90,000 sentences. Then a frequency list was made of all the diphones occurring in these sentences. This list was used to select 3,000 sentences containing a sufficient number of all occurring diphones and common triphones.

After going through all these sentences and removing all sentences which contained foreign words (especially names) or some potentially offensive material, 1433 different sentences remained and were used in the caller sheets. 1,000 different sheets were then generated by randomly extracting a fixed number of items from each of the different lists (names, numbers, sentences, etc.).

The final task of the linguistic preparations was to make a word frequency list for Icelandic. This list was compiled from various sources; the newspaper Morgunblaðið, recent novels, and the Icelandic spoken language corpus Ístal. ScanSoft set the minimum size of the list to 30,000 word forms, but due to the inflectional character of Icelandic, it was concluded that a considerably larger list would be feasible, and so a list of almost 50,000 word forms was the end product.

Recordings and transcriptions

In order to be able to train a speech recognizer for Icelandic, ScanSoft needed to have speech data from at least 2,000 native speakers. In collecting these data, people were first asked to register as participants. These volunteers were then contacted and asked to call a toll free number. When they called in, they first had to answer a few questions and then were asked to read the caller sheets that had been sent to them. As mentioned above, 1,000 different sheets were generated, so on average, each sheet was read by two callers.

When the project was officially launched, it got good media coverage which created an opportunity to ask people to volunteer, to call in and participate. Gallup Iceland was also hired to assist in recruiting volunteers to call in. By the end of the data collection phase, almost 3,000 people had volunteered to call in. Since the population of Iceland is about 285,000, this amounts to 1% of the whole population. When the goal of 2,000 valid recordings, sufficiently well distributed with respect to gender, age groups, regional dialects, and type of telephone (mobile vs. fixed line), had been reached, no more volunteers were contacted.

The recordings were distributed over 90,000 sound files. They were transcribed using normal Icelandic orthographic conventions. The wordlist, on the other hand, was transcribed using the SAMPA phonetic alphabet. The transcriptions were done by students in Language Technology at the University of Iceland, under the auspices of Professor Eiríkur Rögnvaldsson.

Results

Linguistic preparations for the project started in February 2003, but most of the work was carried out in April and May. The recordings started in the end of May, and were completed by the middle of August. The transcribers began their work in early June, and finished in the end of August. In the beginning of September, all the recordings and the transcriptions had been sent to ScanSoft. The training of the Icelandic language model was completed by the end of October. The project was finished on budget and on schedule.

The speach recognizer was tested and the recognition rate appeared to be at least 97%. The system was able to tell apart very similar words, for instance, different inflectional forms of the same lexeme where the only difference lies in a vowel in an unstressed syllable (hestur 'horse' vs. hestar 'horses').

Using the Hjal Corpus

The zip-file that can be downloaded from this page contains 883 folders. Each folder contains material from one speaker, usually 47 sound files and one text file. Each sound file contains one utterance. The text file of the speaker contains text of all the sound files of that particular speaker. The SAMPA standard for phonetics was used. Sound files and text files are synchronized. The sound files are in the wav format. The text is written in a UTF8-standard. Expected users need to register and agree to the terms of a user license.

Contact

  • Eiríkur Rögnvaldsson
  • CLARIN Coordinator
  • The Árni Magnússon Institute for Icelandic Studies
  • Þingholtsstræti 29
  • 101 Reykjavík
  • Phone: +354-525-4037
  • E-mail: eirikur.rognvaldsson@arnastofnun.is

References

  • Rögnvaldsson, Eiríkur. 2004. The Icelandic Speech Recognition Project Hjal. In Holmboe, Henrik (ed.): Nordisk Sprogteknologi. Nordic Language Technology. Årbog 2003, pp. 239-242. Museum Tusculanums Forlag, Copenhagen.
  • Waage, Helga: Hjal - gerð íslensks stakorðagreinis. Samspil tungu og tækni. pp. 49-53. Ministry of Education, Science and Culture, Reykjavík.