UCT researchers have developed a new artificial intelligence language model trained specifically on South Africa’s 11 official written languages. The model, called MzansiLM, aims to help close a major gap in AI tools that still struggle with many local languages.

The research also includes MzansiText, a curated multilingual dataset covering the same 11 languages. UCT says the work will be presented at the Language Resources and Evaluation Conference in Mallorca, Spain, this month.

Why This Matters for South Africans

Many popular AI tools work far better in English than in languages such as isiNdebele, Sepedi and other underrepresented South African languages. UCT researchers say the problem comes down to limited training data.

Dr Jan Buys, a senior lecturer in UCT’s Department of Computer Science, said languages are considered “low resource” when there are fewer and smaller text datasets available to train language models. He said MzansiText is still small when compared with English datasets, but is larger than previous datasets for South African languages.

Nine of South Africa’s 11 official written languages fall into this low-resource category. UCT says MzansiLM is believed to be the first publicly available decoder-only language model built to target all 11 official written languages.

Not a Chatbot, but a Foundation

MzansiLM is not designed to work like ChatGPT or Claude. UCT says it is a base model that developers and researchers can fine-tune for specific tasks.

Dr Francois Meyer, a lecturer in UCT’s Department of Computer Science, said developers could use the model to build tools that summarise information or annotate raw data in South African languages.

The model has 125 million parameters. While modest compared with major commercial systems, UCT says testing showed it performed competitively on specific tasks and outperformed much larger open-source models on benchmarks in several South African languages.

Open Research for Local AI

The project was led by Anri Lombard and Dr Buys, with Dr Meyer and other collaborators. UCT says both MzansiText and MzansiLM have been made publicly available, along with the research paper on arXiv.

For now, MzansiLM is a starting point. But it gives researchers and developers a clearer path to build AI tools that better serve South Africans in the languages they actually use.