Data generation and language technology for low-resourced African languages

The realization of developing natural language processing techniques in tasks such as Machine Translation (MT) requires the availability of monolingual and cross-lingual resources. Currently, the exploration of various advances in NLP techniques for low-resource languages and language pairs in the developing world is complicated by the lack of data resources. For example, in Uganda, where there are over 40 independent languages, there are no monolingual nor bi/multilingual resources for developing NLP systems such as those that significantly benefit well-resourced languages. Now, we are using both manual and existing automated methods to build bilingual corpora for several language pairs involving any low-resourced African language. We plan to use the corpora to explore several NLP applications involving any of the respective low-resourced African languages.

Work supported by a Google Research Award.