##An Amharic Syllable-based Speech Corpus for Continuous Speech Recognition

##Prepared by: 
	Nirayo Hailu Gebreegziabher and Andreas Nürnberger
	Data & Knowledge Engineering Group
	Faculty of Computer Science
	Otto von Guericke University Magdeburg, Germany

##Overview

This is an Amharic speech corpus which is suitable for the development and evaluation 
of speech recognition and retrieval systems. The corpus contains 110 hours of speech data
with syllable and grapheme-based transcriptions collected from public domain audio-books,
news domain read-speech and multi-Genre radio programs. Experiment on the subset corpus shows
that the syllable-based triphone speech recognition system provides a lower word error rate of 16.82% 
compared with a morpheme-based system. Moreover, a hybrid, hidden Markov Model and deep neural network, 
syllable-based model provide a word error rate of 14.36%.

##License

The corpus is prepared from audio-books, news domain read-speech and multi-Genre radio programs which are in 
the public domain with permissive licenses. We have also used publicly available existing datasets. 
By downloading this corpus you agree that the corpus should only be used for research purposes.

##Description

The corpus is partitioned into training and validation set which contains smaller audio segments not longer than 28 seconds. 
Utterances in each partition are re-sampled with a sampling frequency of 16 kHz with a sample size of 16 bits, 256kbs bitrate 
with a mono channel and stored as a wav file. The syllable and grapheme-based transcriptions are provided in plain text and 
including the audio details as a json and csv files. For more details about the corpus, refer to the original publication.

##Citation

If you used this corpus for your research please cite the original paper.