##An Amharic Syllable-based Speech Corpus for Continuous Speech Recognition ##Prepared by: Nirayo Hailu Gebreegziabher and Andreas Nürnberger Data & Knowledge Engineering Group Faculty of Computer Science Otto von Guericke University Magdeburg, Germany ##Overview This is an Amharic speech corpus which is suitable for the development and evaluation of speech recognition and retrieval systems. The corpus contains 110 hours of speech data with syllable and grapheme-based transcriptions collected from public domain audio-books, news domain read-speech and multi-Genre radio programs. Experiment on the subset corpus shows that the syllable-based triphone speech recognition system provides a lower word error rate of 16.82% compared with a morpheme-based system. Moreover, a hybrid, hidden Markov Model and deep neural network, syllable-based model provide a word error rate of 14.36%. ##License The corpus is prepared from audio-books, news domain read-speech and multi-Genre radio programs which are in the public domain with permissive licenses. We have also used publicly available existing datasets. By downloading this corpus you agree that the corpus should only be used for research purposes. ##Description The corpus is partitioned into training and validation set which contains smaller audio segments not longer than 28 seconds. Utterances in each partition are re-sampled with a sampling frequency of 16 kHz with a sample size of 16 bits, 256kbs bitrate with a mono channel and stored as a wav file. The syllable and grapheme-based transcriptions are provided in plain text and including the audio details as a json and csv files. For more details about the corpus, refer to the original publication. ##Citation If you used this corpus for your research please cite the original paper.