The Jensson Corpus

The Jensson Corpus is an Icelandic speech corpus based on a read bi-phonetically balanced text.

Download the corpus here

About the Jensson Corpus

The Jensson Corpus is 3.8 hours in length with 5,612 utterances (44.1khz 16 bit) from 20 speakers (13M/7F).

The read text contains words that were chosen with the aim of keeping the text as short as possible even though it contains most of the bi-phonetical utterances that exist in Icelandic. The text is in the form of questions. All the speakers read the same text, about 11 minutes of read text.

1. Speaker information

SpeakerID Gender Age
1-02-m03 M 30
2-03-m01 M 24
2-03-m02 M 25
2-03-m03 M 22
2-03-m04 M 22
2-04-f01 F 25
2-04-m05 M 29
2-04-m06 M 23
2-04-m07 M 27
2-05-f02 F 32
2-05-m08 M 27
2-05-m09 M 33
2-06-f04 F 50
2-06-f05 F 49
2-06-m10 M 24
2-07-f06 F 30
2-07-f07 F 26
2-07-f08 F 25
2-07-m11 M 33
2-08-m12 M 29

 

None of the speakers in the Jensson corpus participated in the Thor corpus or the RÚV corpus.

2. Data structure

The_Jensson_Corpus/SpeakerID/*.wav - Segmented wave files
intro*.wav - the speaker introduces himself/herself (not reading)
text*.wav - the actual bi-phonetical balanced utterances (read text)
woz*.wav - the speaker speaking naturally (not reading)

Transcriptions - The_Jensson_Corpus/SpeakerID/transcription.xml - Transcript of the all spoken utterances in Icelandic.

In addition these files are provided:

The_Jensson_Corpus/fileToPhonemeMapText.mlf - a phoneme transcription with reference to all the bi-phonetical utterances, i.e. all the SpeakerID/text*.wav files.

The_Jensson_Corpus/fileToTriPhonemeMapText.mlf - a tri-phoneme transcription with reference to all the bi-phonetical utterances, i.e. all the SpeakerID/text*.wav files.

The_Jensson_Corpus/fileToPhonemeMapWoz.mlf - a phoneme transcription with reference to all the woz evaluation files, i.e. all the SpeakerID/woz*.wav files.

The_Jensson_Corpus/jensson.phoneme.dictionary - all defined phonemes in Icelandic used in the corpus.


Contact

Arnar Þór Jensson
e-mail: arnarjensson@gmail.com


References

Arnar Thor Jensson, Koji Iwano, and Sadaoki Furui. Language model adaptation using machine-translated text for resource-deficient languages. Eurasip Journal on Audio, Speech, and Music Processing, vol. 2008, 2008