

LETTER FROM AFRICA
Ask ChatGPT, the chatbot designed by OpenAI, to list the names of the African countries in English. So far, so good. Complicate things a little by asking it in Tigrinya, a language spoken in Eritrea and northern Ethiopia. "The result is gibberish: a mix of Amharic [another Ethiopian language], Tigrinya and made-up words that make no sense in either language," observed Ethiopian computer scientist Asmelash Teka Hadgu, after having given the chatbot this challenge.
The same experiment could just as easily have been carried out with Ewe (Ghana, Togo), Yoruba (Nigeria, Benin) or Tsonga (South Africa, Mozambique). The overwhelming majority of the 2,000 or so languages spoken on the continent are virtually non-existent on the internet, and therefore poorly recognized – or not at all – by artificial intelligence (AI) systems such as ChatGPT, Google Translate or Siri. These are known as "low-resource" languages, in contrast to the handful of "high-resource" languages, led by English, which currently dominate the global internet.
Like Hadgu, a growing number of African entrepreneurs and researchers have now set to work to fill these gaps. In 2019, Hagdu, who is based in Berlin, co-founded a start-up called Lesan, which is dedicated to the languages of his native country. Lesan has developed a tool that translates automatically between Tigrinya, Amharic and English, with plans to add Oromo and Somali soon. Due to a lack of a large number of online resources (for example, there are only 15,000 Wikipedia articles in Amharic, a language spoken by 30 to 50 million people), the team has to be creative in collecting its data.
Much of it is collected from books, magazines and documents, thanks to the help of local contributors, who identify the most relevant content, then digitize and translate it, assisted by an optical character recognition system. "It takes a lot of work, especially manual work," said Hagdu. "But we're finding that it's possible to build a qualitative model based on small, carefully selected data sets."
The tech giants also claim to want to play their part in promoting these under-represented languages at a time when, according to specialists, some 7,000 languages worldwide are threatened with invisibility or even digital death. ChatGPT version 4 includes some of these languages, such as Icelandic. Google Translate, for its part, has included some 15 African languages in updates in 2020 and 2022. But the level of translation offered is often insufficient and African researchers are questioning the relevance of a methodology that does not address the specificities of African languages.
You have 50% of this article left to read. The rest is for subscribers only.