Tuesday, January 31, 2023

How Microsoft's Project ELLORA is assisting small languages like Gondi and Mundari in becoming digitally literate


The goal is eventually to have an entire system in place so that speakers can use technology or access information in their native tongue by speaking, listening, or typing into their phones.

If a Hindi speaker needs to find something on the Internet, they can now type a query into their phone in Devnagari script or simply speak the command. But what about people who speak languages that are only spoken by a small number of people or that have very little or no online presence? With its Project ELLORA (Enabling Low Resource Languages) in India, Microsoft Research is assisting with these languages.

“We use technology to work on low-resource language, but we think that because these communities are also marginalized in other ways, they know what they want and need. According to Kalika Bali of Microsoft Research, who spoke with indianexpress.com, "We work with them to understand their pain points out and see how technology can help." Natural Language Processing, in which linguistics and artificial intelligence work together to teach computers to understand spoken and written languages, is a field that Bali is an expert in. 

Bali explained that the primary objective of ELLORA is to ensure that these languages, which do not have any digital presence at all and do not have any written resources, are not left behind when it comes to some of the advancements in language technology that are currently taking place as a result of the use of artificial intelligence (AI) and advanced natural language models. More importantly, establishing a digital presence may assist some of these languages in avoiding extinction.

For the time being, Microsoft Research (MSR) has decided to focus on three of these. Gondi, which is spoken by close to three million people in Madhya Pradesh, Maharashtra, Chhattisgarh, Andhra Pradesh, and Telangana; Mundari, which is spoken in Jharkhand, Odisha, and West Bengal; and Arunachal Pradesh's Idu Mishmi.

Bali claims that Gondi is where the company has collaborated with CGNet Swara as the partner in Chhattisgarh for some of its longest projects. CGNet Swara is a website that lets people who speak Gondi call in and report on local news in their language. 

“We have provided assistance with things like Adivasi radio, which served as a hub for obtaining telephone information in Gondi. Because one of the biggest issues is having access to information in their own languages, we have also been collaborating with them to develop a machine translation system,” Bali stated.

If this machine language-based translation system performs well, Gondi speakers will be able to access any Hindi-language information in their own language. MSR plans to test it out in the field soon. MSR is collaborating with Pratham Books to develop a digital dictionary for the Idu Mishmi language in Arunachal Pradesh.

MSR has collaborated with IIT-Kharagpur and the German Development Fund, GIZ, on Mundari. In the case of Mundari, the job is specific: due to the lack of resources, create educational materials for children. The entire pipeline should be built. We are developing a text-to-speech model that would enable the system to communicate with Mundari. Additionally, we are developing a machine translation model. In point of fact, we have a small machine translation model prepared,” Bali stated, adding that the model is currently being tested and speech recognition will also be worked on.

Mundari is supposed to eventually have a complete system in place so that Mundari speakers can use technology or access information in their own language by speaking, listening, or typing into their phones. Bali also emphasized that their models do not rely on word-for-word translations for languages like Mundari. Instead, they ask native Hindi speakers to translate sentences into their own languages, creating the resource and data set that the computer model will use.

Interneural Machine Translation (INMT), a tool they developed as part of their efforts, can assist in predicting the next word when translating between these languages, such as Hindi to Mundari. It provides me with Mundari-specific predictions. Similar to smartphone keyboards' predictive text, it works across two languages, Bali explained, adding that such tools will also improve the effectiveness of human translators.

Obviously, ensuring that the models function on low-end phones is also a challenge. The models will need to be optimized with this crucial factor in mind because marginalized communities have access to lower-end phones. We want these models to function on phones, which is one of the major issues. Bali explained, "We have spent a lot of time working on how to make, distill, and quantify these models into smaller models that actually work on the phone."

Bali stated that they had also tested some publicly available LLMs for some of their research in response to the current flurry of interest regarding Large Language Models (LLMs) and the role they play in translation tools. However, more work will be needed to get these models to work with such languages that have few or no data sets. How we can adapt these LLMs to work with some of the smaller languages is an open research question. Additionally, you are aware that constructing a distinct layer on top of this technology may be the solution. Or it could be as simple as having sufficient data to feed the base models. We're not entirely certain, I believe. She stated, "It is open research to see how we do this."

Project ELLORA's ultimate goal is still crystal clear for the time being: that the linguistic gap between those who have and those who do not does not widen further."

Catch Daily Highlights In Your Email

* indicates required

Post Top Ad