In New Zealand, a local broadcaster is now focused on preserving the Māori people’s indigenous language known as te reo with AI.

Preserving indigenous languages with AI

By Aaron Raj

Māori people’s indigenous language, known as te reo, is being preserved with AI.
Te Hiku Media is developing automatic speech recognition (ASR) models for te reo, which is a Polynesian language.
The efforts have spurred similar ASR projects now underway by Native Hawaiians and the Mohawk people in southeastern Canada.

The United Nations says that 96% of the world’s approximately 6,700 languages are spoken by only 3% of the world’s population. Although indigenous peoples make up less than 6% of the global population, they speak more than 4,000 of the world’s languages.

But what’s more concerning is that some 3,000 indigenous languages could disappear before the end of the century. Put simply, indigenous languages are on a path to becoming endangered or extinct if no proper preservations are planned.

Alibaba DAMO Academy announces the launch of SeaLLMs, pioneering large language models (LLM) that come with 13-billion-parameter and 7-billion-parameter versions.

ALIBABA DAMO ACADEMY INTRODUCES SEALLMS, INCLUSIVE AI LANGUAGE MODELS FOR SEA

Aaron Raj | 13 December, 2023

Traditionally, these languages are passed down from generation to generation. However, as society evolves and popular languages like English, Mandarin and Spanish begin dominating communication, society moves towards perfecting those languages instead.

When that happens, indigenous languages can end up being forgotten. Today, native speakers of some indigenous languages are down to a few hundred or less. While there is work to preserve and safeguard these languages, it will eventually become a memory – and then a historical curiosity – if no one uses it for conversation.

Broadcasting organization Te Hiku Media’s automatic speech recognition model transcribes te reo Māori with 92% accuracy using trustworthy AI and the Nvidia NeMo toolkit.

How AI can save indigenous languages

In New Zealand, a local broadcaster is now focused on preserving the Māori people’s indigenous language, known as te reo, with AI. Specifically, the broadcaster, Te Hiku Media, is developing automatic speech recognition (ASR) models for te reo, which is a Polynesian language. The model will be an ethical, transparent method of speech data collection and analysis to maintain data sovereignty for the Māori people.

Built using the open source Nvidia NeMo toolkit for ASR and Nvidia A100 Tensor Core GPUs, the speech-to-text models transcribe te reo with 92% accuracy. They can also transcribe bilingual speech using English and te reo with 82% accuracy. They’re pivotal tools, made by and for the Māori people, that are helping preserve and amplify their stories.

AI and natural language are shaping the future of software development.

GITHUB UNIVERSE 2023: HOW AI AND NATURAL LANGUAGE ARE SHAPING THE FUTURE OF DEVELOPMENT

Muhammad Zulhusni | 10 November, 2023

“There’s immense value in using Nvidia’s open source technologies to build the tools we need to ultimately achieve our mission, which is the preservation, promotion and revitalization of te reo Māori,” said Keoni Mahelona, chief technology officer at Te Hiku Media, who leads a team of data scientists and developers, as well as Māori language experts and data curators, working on the project.

“We’re also helping guide the industry on ethical ways of using data and technologies to ensure they’re used for the empowerment of marginalized communities,” added Mahelona, a Native Hawaiian now living in New Zealand.

Given the concerns of privacy, copyrights and other issues of uploading video and audio onto popular global platforms, Te Hiku Media decided to build its own content distribution platform. Called Whare Kōrero, the platform now holds more than 30 years’ worth of digitized, archival material featuring about 1,000 hours of te reo native speakers.

“It’s an invaluable resource of acoustic data,” Mahelona said.

Indigenous languages are being saved from dwindling out by AI and modern technology in the hands of dedicated preservers.

Te Hiku Media is developing automatic speech recognition (ASR) models for te reo, which is a Polynesian language.

Despite the success of the platform, the team soon realized that the manual transcription required lots of time and effort from limited resources. Hence, the organization decided to go with trustworthy AI efforts to accelerate its work using ASR.

“No one would have a clue that there are eight Nvidia A100 GPUs in our derelict, rundown, musky-smelling building in the far north of New Zealand — training and building Māori language models,” Mahelona said. “But the work has been game-changing for us.”

A clear and transparent process

To collect speech data in a transparent, ethically compliant, community-oriented way, Te Hiku Media began by explaining its cause to elders, garnering their support and asking them to come to the station to read phrases aloud.

“It was really important that we had the support of the elders and that we recorded their voices because that’s the sort of content we want to transcribe,” Mahelona said. “But eventually these efforts didn’t scale — we needed second-language learners, kids, middle-aged people and a lot more speech data in general.”

In addition to other open source AI tools, Te Hiku Media now uses the Nvidia NeMo toolkit’s ASR module for speech AI throughout its entire pipeline. The NeMo toolkit comprises building blocks called neural modules and includes pre-trained models for language model development.

The efforts have spurred similar ASR projects now underway by Native Hawaiians and the Mohawk people in southeastern Canada.

The success of preserving the languages with AI showcases one of the many benefits the technology can bring to society. While the initial process can be challenging, once the LLM has trained with the data, it should be able to learn and run at a much faster pace.

Just as there are now more translation capabilities being generated by AI use cases, its capabilities in preserving indigenous languages could be the solution needed to ensure these languages are not forgotten.

“It’s indigenous-led work in trustworthy AI that’s inspiring other indigenous groups to think: ‘If they can do it, we can do it, too,’” Mahelona concluded.

ARTIFICIAL INTELLIGENCE

LANGUAGE

By Aaron Raj

Aaron Raj

Aaron enjoys writing about enterprise technology in the region. He has attended and covered many local and international tech expos, events and forums, speaking to some of the biggest tech personalities in the industry. With over a decade of experience in the media, Aaron previously worked on politics, business, sports and entertainment news.

POPULAR TOPICS

Preserving indigenous languages with AI

ALIBABA DAMO ACADEMY INTRODUCES SEALLMS, INCLUSIVE AI LANGUAGE MODELS FOR SEA

How AI can save indigenous languages

GITHUB UNIVERSE 2023: HOW AI AND NATURAL LANGUAGE ARE SHAPING THE FUTURE OF DEVELOPMENT

A clear and transparent process

READ MORE

Preserving indigenous languages with AI

ALIBABA DAMO ACADEMY INTRODUCES SEALLMS, INCLUSIVE AI LANGUAGE MODELS FOR SEA

How AI can save indigenous languages

GITHUB UNIVERSE 2023: HOW AI AND NATURAL LANGUAGE ARE SHAPING THE FUTURE OF DEVELOPMENT

A clear and transparent process

READ MORE

TRENDING TOPICS