The Jensson Corpus is an Icelandic speech corpus based on a read bi-phonetically balanced text.
Download the corpus here
The Jensson Corpus is 3.8 hours in length with 5,612 utterances (44.1khz 16 bit) from 20 speakers (13M/7F).
The read text contains words that were chosen with the aim of keeping the text as short as possible even though it contains most of the bi-phonetical utterances that exist in Icelandic. The text is in the form of questions. All the speakers read the same text, about 11 minutes of read text.
1. Speaker information
| SpeakerID | Gender | Age |
| 1-02-m03 | M | 30 |
| 2-03-m01 | M | 24 |
| 2-03-m02 | M | 25 |
| 2-03-m03 | M | 22 |
| 2-03-m04 | M | 22 |
| 2-04-f01 | F | 25 |
| 2-04-m05 | M | 29 |
| 2-04-m06 | M | 23 |
| 2-04-m07 | M | 27 |
| 2-05-f02 | F | 32 |
| 2-05-m08 | M | 27 |
| 2-05-m09 | M | 33 |
| 2-06-f04 | F | 50 |
| 2-06-f05 | F | 49 |
| 2-06-m10 | M | 24 |
| 2-07-f06 | F | 30 |
| 2-07-f07 | F | 26 |
| 2-07-f08 | F | 25 |
| 2-07-m11 | M | 33 |
| 2-08-m12 | M | 29 |
None of the speakers in the Jensson corpus participated in the Thor corpus or the RÚV corpus.
2. Data structure
The_Jensson_Corpus/SpeakerID/*.wav - Segmented wave files
intro*.wav - the speaker introduces himself/herself (not reading)
text*.wav - the actual bi-phonetical balanced utterances (read text)
woz*.wav - the speaker speaking naturally (not reading)
Transcriptions - The_Jensson_Corpus/SpeakerID/transcription.xml - Transcript of the all spoken utterances in Icelandic.
In addition these files are provided:
The_Jensson_Corpus/fileToPhonemeMapText.mlf - a phoneme transcription with reference to all the bi-phonetical utterances, i.e. all the SpeakerID/text*.wav files.
The_Jensson_Corpus/fileToTriPhonemeMapText.mlf - a tri-phoneme transcription with reference to all the bi-phonetical utterances, i.e. all the SpeakerID/text*.wav files.
The_Jensson_Corpus/fileToPhonemeMapWoz.mlf - a phoneme transcription with reference to all the woz evaluation files, i.e. all the SpeakerID/woz*.wav files.
The_Jensson_Corpus/jensson.phoneme.dictionary - all defined phonemes in Icelandic used in the corpus.
Arnar Þór Jensson
e-mail: arnarjensson@gmail.com
Arnar Thor Jensson, Koji Iwano, and Sadaoki Furui. Language model adaptation using machine-translated text for resource-deficient languages. Eurasip Journal on Audio, Speech, and Music Processing, vol. 2008, 2008