The Jensson Corpus is an Icelandic speech corpus based on a read bi-phonetically balanced text.
Download the corpus here
The Jensson Corpus is 3.8 hours in length with 5,612 utterances (44.1khz 16 bit) from 20 speakers (13M/7F).
The read text contains words that were chosen with the aim of keeping the text as short as possible even though it contains most of the bi-phonetical utterances that exist in Icelandic. The text is in the form of questions. All the speakers read the same text, about 11 minutes of read text.
1. Speaker information
2. Data structure
The_Jensson_Corpus/SpeakerID/*.wav - Segmented wave files
intro*.wav - the speaker introduces himself/herself (not reading)
text*.wav - the actual bi-phonetical balanced utterances (read text)
woz*.wav - the speaker speaking naturally (not reading)
Transcriptions - The_Jensson_Corpus/SpeakerID/transcription.xml - Transcript of the all spoken utterances in Icelandic.
In addition these files are provided:
The_Jensson_Corpus/fileToPhonemeMapText.mlf - a phoneme transcription with reference to all the bi-phonetical utterances, i.e. all the SpeakerID/text*.wav files.
The_Jensson_Corpus/fileToTriPhonemeMapText.mlf - a tri-phoneme transcription with reference to all the bi-phonetical utterances, i.e. all the SpeakerID/text*.wav files.
The_Jensson_Corpus/fileToPhonemeMapWoz.mlf - a phoneme transcription with reference to all the woz evaluation files, i.e. all the SpeakerID/woz*.wav files.
The_Jensson_Corpus/jensson.phoneme.dictionary - all defined phonemes in Icelandic used in the corpus.
Arnar Þór Jensson
Arnar Thor Jensson, Koji Iwano, and Sadaoki Furui. Language model adaptation using machine-translated text for resource-deficient languages. Eurasip Journal on Audio, Speech, and Music Processing, vol. 2008, 2008