Finns pioneer small language processing

Computers are fast learners in natural language processing (NLP), but so far, the action has been focused on the world’s largest languages, primarily English and Chinese. With assistance from the LUMI supercomputer, operated by CSC, the national research and education network (NREN) of Finland, a Finnish research group has changed the scene. The group has published the first comprehensive Finnish language model.

“Language technology is essential to the survival of small languages,” says Sampo Pyysalo, Assistant Professor at the University of Turku, continuing:

“As a relatively small language area, there is very little interest in Finnish from the large, international commercial operators such as Google, Facebook and Baidu, who have developed the most advanced English and Chinese language models in the world.”

13 billion parameters

Nowadays, linguistic models underpin all artificial intelligence (AI) systems for language processing. In recent years, generative language models have been a focus. Especially has the Generative Pre-trained Transformer 3 (GPT-3) developed by Open AI broken new ground. After some text is entered, the model can predict the following words. The model can help with machine translations and document classification. The texts it produces are very difficult to distinguish from texts written by humans. The aim of the Finnish research group is to develop Finnish language models towards the GPT-3 level.

“It is likely that the most important AI applications of this decade will be built on these kind of language models. We are undergoing a pretty big and quick transition right now. The most significant applications have not yet been made,” says Sampo Pyysalo.

The NLP project was one of almost 30 pilot projects run in the GPU partition – the section using graphic processors – at the new LUMI supercomputer. During the project, the group created a GPT-3 model with 13 billion parameters based entirely on Finnish. This is the largest Finnish language model ever.

Swearing cut in half

Development of language models is based on huge data sets, which are used to train deep neural networks. In the project, the research group also created an identification system which filtered out the most problematic text segments from the data fed into the language model.

“We trained our language model with very high-quality data that meets EU requirements. By classifying different text types, we have a better-than-average understanding of what kind of data the model has read, and we were able to eliminate the most toxic and problematic texts from the model. For example, compared to previous models, we were able to cut the model’s spontaneous swearing in half,” Pyysalo illustrates.

The computing power of the LUMI supercomputer accelerated the work greatly, Pyysalo notes:

“This would never have been possible without a system like LUMI. With smaller systems, we would still calculate this model in 2025.”

Published as open source

The resulting model is open source, meaning available to everyone.

“These models and the technology based on them cause major changes in many sectors, and these models are owned exclusively by multinational companies. Our model is genuinely open and enables things that could not be built on the models developed by these large multinational companies,” says Sampo Pyysalo.

Having terminated the pilot project, the group will continue in a new project with 2 million GPU hours at LUMI.

“In this project, we focus on how multilingual and translation data can support the development of the largest Finnish-language models,” Pyysalo explains.

The text is inspired by the article “Research group created the largest Finnish language model ever with the LUMI supercomputer” by Anni Jakobsson, published on the CSC website.

Published: 06/2023

For more information please contact our contributor(s):

CSC (Finland)

Finns pioneer small language processing

13 billion parameters

Swearing cut in half

Published as open source

Topics

More Information

Regions