The Thor Corpus

The Thor Corpus is an Icelandic speech corpus based on a read bi-phonetically balanced text. It is 2 hours in length with 4000 utterances (wav 44.1khz 16 bit) from 20 speakers (10m/10F).

Download the corpus here

About the Thor Corpus

General Information

The Thor Corpus is an Icelandic speech corpus based on a a read bi-phonetically balanced text. It is 2 hours in length with 4000 utterances (wav 44.1khz 16 bit) from 20 speakers (10M/10F).

The database includes 20 persons, 10 female and 10 male readers. The sound files for each person are in a subfolder. In folder 'm7' are the files for male reader number 7. Each reader reads roughly 200 sentences which correspond to weather information queries.

The text was translated from MIT´s JUPITER corpus. 1000 unique sentences were randomly chosen from the corpus and translated. The foreign places names were marked and exchanged for Icelandic place names chosen at random. A few foreign place names stayed unchanged.

The text collection contains questions about the weather (medium sized vocabulary). The total vocabulary for this particular topic is about 2000 wordforms. Each speaker reads 20 utterances which differ from speaker to speaker.

Recordings were conducted from April 2005 to October 2005 using the following equipment:

  • Recorder: SONY Digital Audio Tape Corder "TDC-D100" using 48kHz
  • Tape: SONY DAT Digital Audio Tape, "10DT-120RA J"
  • Mic: Sennheizer HMD 25-1

DAT tapes were converted to digital form using:

  • Sony Digital Audio Tape Deck "DTC-2000ES"
  • "DAT-Link+" from Townshend Computer Tools

The file transcriptions.rtf contains transcriptions of the spoken utterances in Icelandic.

The textfile "text.xml" is included in each subfolder where each line describes the particular sound file in the folder. Thus line 16 contains the transcription of "16.wav". It is, however, preferrable to use the file transcription.rtf.

A sound file with the extension ".wav.notused" was deemed not good enough to be included in the database.

Speaker information

Here is information about the readers, their age, the records position on the original DAT tape, and the lines read.

SpeakerID Gender Age DAT tape position Lines read Segmented
            (sec)
f1 F NA E2 NA 1 - 210 374
f2 F NA E2 NA 1 - 210 454
f3 F 21 E3 0:00:00 - 0:11:41 111 - 330 324
f4 F 22 E3 0:13:00 - 0:23:04 111 - 330 304
f5 F 22 E3 0:24:00 - 0:35:55 221 - 440 396

None of the speakers in the Thor corpus participated in the Jensson corpus or the RÚV corpus.


Contact

Arnar Þór Jensson
e-mail: arnarjensson@gmail.com