AIR Lab

Building NLP Text and Speech Datasets for Low Resourced Languages in East Africa

Andrew Katumba, (PI), Joyce Nakatumba-Nabende, (CO-PI), Jenifer Winfred Namuyanja, Jeremy Tusubira Francis, Claire Babirye, Jonathan Mukiibi, Chodrine Mutebi, Hewiit Tusiime

Funder: Lacuna Fund – A Collaborative Fund between the Rockefeller Foundation, Google.org and Canada’s International Development Research Center

2022
ongoing
5 min read

The project will deliver open, accessible, and high-quality text and speech datasets for low-resource East African languages from Uganda, Tanzania, and Kenya. Taking advantage of the advances in NLP and voice technology requires a large corpora of high quality text and speech datasets. This project will aim to provide this data for these languages: Luganda, Runyankore-Rukiga, Acholi, Swahili, and Lumasaaba.

The speech data for Luganda and Swahilli will be geared towards training a speech-to-text engine for an SDG relevant use-case and general-purpose ASR models that could be used in tasks such as driving aids for people with disabilities and development of AI tutors to support early education. Monolingual and parallel text corpora will be used in several NLP applications that need NLP models, including natural language classification, topic classification, sentiment analysis, spell checking and correction, and machine translation.

Datasets

Library

Eric Peter Wairagala, Jonathan Mukiibi, Jeremy Francis Tusubira, Claire Babirye, Joyce Nakatumba-Nabende, Andrew Katumba, Ivan Ssenkungu Link

Models

Get in touch

joyce.nabende@mak.ac.ug
@AIR_lab_MUK
+256701726338

Find Us

Makerere University College of Computing and Information Sciences Block B, Level 6

Open from Monday - Friday

9AM - 5PM

Building NLP Text and Speech Datasets for Low Resourced Languages in East Africa

Get in touch

Find Us

More Information