Building NLP Text and Speech Datasets for Low Resourced Languages in East Africa

Andrew Katumba, (PI), Joyce Nakatumba-Nabende, (CO-PI), Jenifer Winfred Namuyanja, Jeremy Tusubira Francis, Claire Babirye, Jonathan Mukiibi, Chodrine Mutebi, Hewiit Tusiime

Funder: Lacuna Fund – A Collaborative Fund between the Rockefeller Foundation, Google.org and Canada’s International Development Research Center

  • 2022
  • ongoing
  • 5 min read

The project will deliver open, accessible, and high-quality text and speech datasets for low-resource East African languages from Uganda, Tanzania, and Kenya. Taking advantage of the advances in NLP and voice technology requires a large corpora of high quality text and speech datasets. This project will aim to provide this data for these languages: Luganda, Runyankore-Rukiga, Acholi, Swahili, and Lumasaaba.

The speech data for Luganda and Swahilli will be geared towards training a speech-to-text engine for an SDG relevant use-case and general-purpose ASR models that could be used in tasks such as driving aids for people with disabilities and development of AI tutors to support early education. Monolingual and parallel text corpora will be used in several NLP applications that need NLP models, including natural language classification, topic classification, sentiment analysis, spell checking and correction, and machine translation.

Datasets

Library

Models