(Parliament Speech Corpus) is an Icelandic spoken language corpus that contains twenty hours of speeches from the Icelandic Parliament, in synchronized text- and sound files.

About Parliament Speech Corpus

Overview

The corpus contains recordings from discussion periods at the Icelandic Parliament, during the winter of 2004-2005. The recordings are nearly 21 hours in total and come with detailed transcriptions in text files. Information about the recordings and the speakers, such as their age and gender, are provided as well. The data is intended to reflect natural spoken Icelandic under formal conditions. The discussion periods were chosen as they primarily consist of unprepared speeches that are unlikely to have been written in advance and read out loud. In addition, the aim was on diversity of topics and speakers (w.r.t. their origin, age and gender).

Collecting and processing the material

The recordings were obtained directly from the Parliament. In addition to the audio files, the Parliament provided text files with a preliminary transcription which became the basis for further processing of the material. Then the recordings were listened to again and the transcriptions revised according to methods developed in transcribing Icelandic spoken language. The transcriptions give a word-for-word match of the speeches in normal standardized orthography. Turns are clearly distinguished and linked to different speakers and silences, interruptions, overlaps and certain background noises (laughter, clearing of the throat etc.) are registered in the transcriptions.

The next step was to transfer the text to the software Transcriber which is a tool for segmenting, labeling and transcribing speech. At the same time, the transcriptions were further revised and the sound and text files were synchronized. Transcriber returned so-called .trs-files which are text files in xml format that can easily be transfered to another format. The corpus consists of these xml-files along with the sound files.

In addition metadata containing information about the recordings (data, length, subject, number of participants etc.) and the speakers (age, gender, origin, etc.) was compiled.

In total, the transcription files consist of over 180 thousand running words.

Description of content

The corpus consists of twelve sequences recorded between October 2004 and May 2005. The recordings vary in length, ranging from a few minutes to a few hours. In total they are more than 20 hours in length. The audio files are in MP3 format.

Among the topics of the discussion sessions are the government budget, taxation, water laws, energy, schools and transportation.

Numeric summary

Total length of recordings (hours:min:sec) 20:52:23
Total length of transcription (number of running words) 182.562
Number of components (sound file+ text file) 12

Database

The transcribed files from the Parliament Speech Corpus, along with other spoken language material, are open for search in Íslenskt textasafn and also form a part of the Tagged Icelandic Corpus (MIM). A database and a web interface have been developed for MIM and the interface has been adjusted to accommodate the needs of spoken language corpora, so that the search will not only return examples from the transcribed text but also gives access to the relevant examples in the sound files. Spoken language material differs from typical written texts, in that each recording does not only contain the contribution of one „author“ as there are usually more participants, even in material like the Parliament discussions where there is usually only one party speaking at a time. Links to the metadata are, therefore, in many ways more complicated than they are with written texts.

Organizers and financing

The material was obtained and processed within the project Tilbrigði í setningagerð (funded with a Grant of Excellence from the Icelandic Research Fund 2005-2007) and later as part of a separate project aimed at encoding and completing Icelandic spoken language material (funded by The University of Iceland Research Fund 2008-2009). Ásta Svavarsdóttir was in charge of collecting the material from the Parliament and overseeing the transcriptions and the processing of the material, which has mainly been carried out by students. Helga Birgisdóttir revised the transcription files that were provided by the Parliament. Gunnar Hrafn Hrafnbjargarson laid the groundwork for the transfer of the material into the Transcriber software and further processing there, including synchronization of sound and text with xml-coding, as well as registering the metadata. Gunnar processed several files in that format but Sigrún Steingrímsdóttir, Sigrún Ammendrup and Hjördís Stefánsdóttir later took over and completed the work.

Sigrún Helgadóttir was in charge of the transfer of the material to the database and Guðmundur Örn Leifsson and Steinþór Steingrímsson prepared the data and adapted the search interface for the spoken language.

For the time being it is not possible to hear the recorded speech when search in MIM returns text from the Pariament Speech Corpus. (27.10.2014).


Contact

Ásta Svavarsdóttir
Senior Research Lecturer
Árni Magnússon Institute for Icelandic Studies
E-mail: asta.svavarsdottir [hjá] arnastofnun.is


References

Helgadóttir, Sigrún, Ásta Svavarsdóttir, Eiríkur Rögnvaldsson, Kristín Bjarnadóttir og Hrafn Loftsson. 2012. The Tagged Icelandic Corpus (MÍM). Proceedings of the Workshop on Language Technology for Normalisation of Less-Resourced Languages -SaLTMiL 8 - AfLaT2012. Istanbúl, Tyrklandi.

Svavarsdóttir, Ásta. 2007. Talmál og málheildir - talmál og orðabækur. [Spoken language and corpora - spoken language and dictionaries.] Orð og tunga 9: 25-50.

Thráinsson, Höskuldur, Ásgrímur Angantýsson, Ásta Svavarsdóttir, Thórhallur Eythórsson, Jóhannes Gísli Jónsson. 2007. The Icelandic (Pilot) Project in ScanDiaSyn. In Bentzen and Vangsnes (eds), Scandinavian Dialect Syntax 2005, special issue of Nordlyd - Tromsø University Working Papers in Language & Linguistics, pp. 87-124. Tromsø: The University Library of Tromsø.