Adapting an LLM to the Swiss languages using the Cerebras Wafer-Scale Cluster

Authors: David Guzman Piedrahita and Jannis Vamvas , Department of Computational Linguistics, University of Zurich

Open-source large language models (LLMs) are useful tools for natural language processing, both in research and industry. In this blog post, we describe how we adapted an open-source LLM to the four languages of Switzerland by training it on a large corpus of Swiss news articles.

To perform this work, we were kindly given access to the Cerebras Systems CS-2 Wafer-Scale Cluster, part of the Edinburgh International Data Facility operated by EPCC . This blog post provides a summary of our experiences with this novel cluster, to help future researchers judge whether it will be the right infrastructure for training a language model of this size.

Why adapt a language model?

At the start of our project, Meta’s Llama 2 was one of the best open-source language models available. However, Llama 2 was mostly trained on English text, which limits its capabilities in other languages. Our goal was to improve Llama 2’s capabilities on the four national languages of Switzerland (German, French, Italian, and Romansh), to make this LLM more useful for natural language processing tasks in the context of Switzerland.

Our earlier work on adapting a masked language model (SwissBERT) informed us that continued pre-training on a large corpus of news articles in the four languages – Swissdox@LiRI – is an effective adaptation method. Another inspiration was the Tower project, where Llama 2 has been adapted to 10 languages through continued pre-training, with the goal of making the model more helpful for translation-related tasks. Since German, French and Italian were among the 10 languages of Tower, we decided to start the training with the TowerBase-7B checkpoint, hoping that this will jump start the adaptation process.

The scale of the model (7 billion parameters) and the available training data (17 billion tokens) required us to be careful in choosing the right training setup. Like Tower, we didn’t want to just adapt a small part of the parameters, but we instead wanted to train the complete model. To take on this challenge, we would need access to high-performance chips that could muster the compute needed for the task, and thanks to the generosity of the University of Edinburgh’s EPCC, we were granted beta access to their Cerebras CS-2 Wafer-Scale Cluster.

Training a language model on CS-2

What distinguishes the CS-2 from a traditional GPU cluster is that unlike GPU clusters that require partitioning large models across many small processors, the CS-2’s Wafer Scale Engine 2 (WSE-2) allows for entire models to be processed on a single chip. This is due to the large on-chip memory and compute resources, which eliminate the need for traditional blocking or partitioning, which results in simplified model scaling.

The Cerebras chip did indeed prove a worthy contender to traditional approaches. We conducted experiments with various batch sizes and ultimately chose a batch size of 1,024 articles, which corresponds to approximately 780,000 tokens. This setup allowed us to achieve processing speeds of 15,936 tokens per second.

We trained two variants of our model, each for one epoch. One variant included upsampling of the underrepresented languages in the corpus, with two different runs executed. The other variant did not use upsampling. On average they would each take around 2–3 weeks to train, depending on the load on the cluster from other users.

Setting up the training did not go without its challenges. Integrating our existing workflows with Cerebras hardware posed difficulties. Certain software tools and frameworks were not immediately compatible. For instance, if your experiment relies on the Hugging Face Transformers library for training, be ready to rewrite your pipeline directly in Cerebras PyTorch.

Cerebras provides a code repository called “Model Zoo”, which implements popular machine learning architectures within their framework. Fortunately, Model Zoo includes the architecture of Llama, allowing us to use it for continued pre-training. To continue training the TowerBase checkpoint, we simply converted the checkpoint into the format expected by the Model Zoo implementation. This conversion was quite an easy procedure that involved calling one function and waiting around 2 minutes for the process to run.

Be advised that the training dataset also needs to be transformed to a format compatible with Model Zoo, and said transformation may require non-negligible processing times depending on the scale of your data: for our 17-billion-token dataset it took around 12 hours using the EPCC’s CPU provided in their Cerebras cluster.

Once the model is done training, it can be transformed back to other formats (e.g., the one required by Hugging Face Transformers) so that the weights can be used with CPUs and GPUs and not only Cerebras chips. This transformation is fairly fast and painless in principle, but in our case it did slightly alter some configuration files which made our tokenizer temporarily malfunction. We were able to fix this by manually editing the configuration files.

Suffice it to say that transitioning to Cerebras required us to acquire quite a bit of knowledge specific to this hardware. The learning curve was steep, and while the EPCC’s support team mitigated some of these challenges, the initial period was marked by a slower pace of progress. We had to become proficient with Cerebras-specific tools, which was a time-consuming process.

Despite these challenges, the support we received from the EPCC was excellent. The same can be said of Cerebras, the company: whenever there was an issue pertaining to their technology, they always answered emails promptly and were available to help in video calls as well.

Debugging, however, was another complex aspect. The specialized nature of the hardware meant that our prior debugging experience, which applies to Hugging Face Transformers and standard PyTorch, does not always transfer to Cerebras’ codebase. We often needed to rely on Cerebras-specific diagnostics, with which we were less familiar. This added another layer of complexity to the training process.

A quick comparison to adaptation on GPU

TowerBase-7B, the adapted version of Llama 2 that we used as our starting point, was trained using traditional Nvidia GPUs. The reported speeds in the Tower paper therefore serve as a point of comparison. (The training and release of the Tower model is a joint effort of Unbabel, the SARDINE Lab at Instituto Superior Técnico, and the MICS lab at CentraleSupélec at the University of Paris-Saclay.)

More specifically, the creators of Tower trained their model on 8 A100-80GB GPUs using Megatron-LM, a framework to train billion+ LLMs optimized by Nvidia for GPUs. With this setup, it took them 10 days to go through 20 billion tokens, which equates to a speed of 23,148 tokens per second.

Compared to our training speed of 15,936 tokens per second, their approach is faster, but it does require the use of eight separate chips as opposed to just one in the case of Cerebras.

Did the adaptation work?

Simply looking at the validation perplexity was the most straightforward way to check whether the adaptation had an effect. But we also wanted to make sure that continued pre-training had improved the performance of the model on practical tasks.

Without instruction-tuning, one way to evaluate a language model is to generate text documents that are typical of the training corpus. Since we had adapted our model on news articles, our evaluation approach was providing the model with a headline and having it generate the corresponding news article.

Here’s how a generated article looks for a German headline:

Looking at a handful of such articles, our impression was that the model clearly imitated the adaptation corpus, incorporating common elements of news articles. However, coherence wasn’t quite as good as what we knew from the Grover model by Zellers at al. (2019).

In addition to generating news articles, we performed a quick instruction-tuning run on the Guanaco dataset using LoRA adapters. For this, we got help from our colleague Tannon Kew , who is the author of the paper “Turning English-centric LLMs Into Polyglots”.

We then asked the instruction-tuned model a handful of questions that would be typical of a Swiss citizenship test. For example:

Welches Ereignis wird am 1. August in der Schweiz gefeiert? [Which event is celebrated on 1 August in Switzerland?]

We used Llama 2 7B Chat as the baseline; this is a version of Llama that has been extensively instruction-tuned by Meta, but is not adapted to our Swiss corpus. The model gave the following answer in English:

“In Switzerland, August 1st is a significant date as it marks the anniversary of two important events in the country's history: 1. Swiss National Day (Schweizer Nationaltag): On August 1, 1291, the Old Swiss Confederation was founded in the town of St. Gallen. This event marked the beginning of Switzerland's long history of independence and self-governance. Therefore, August 1 is celebrated as Swiss National …”

While the gist of the answer is correct, there are several factual errors, which we highlighted in bold.

In contrast, the adapted model generated the following answer:

Am 1. August wird in der Schweiz das traditionelle "Fête nationale" gefeiert. Dieses Ereignis wird mit diversen Feierlichkeiten und Aktivitäten begangen, die auf die Gründung der Schweiz im Jahr 1291 zurückgehen. Die Feierlichkeiten finden in den verschiedenen Kantonen und Gemeinden statt und umfassen eine Vielzahl von Veranstaltungen, wie Reden, Konzerte, Feuer … [The traditional ‘Fête nationale’ is celebrated in Switzerland on 1 August. This event is celebrated with various festivities and activities that date back to the founding of Switzerland in 1291. The festivities take place in the various cantons and municipalities and include a variety of events such as speeches, concerts, bonfires, …]

This answer is in German (which is more appropriate for the German question, and might be explained by the multilingual instruction-tuning data we used). Unlike with the original Llama’s answer, we did not spot any factual errors in the answer.

We repeated this experiment for 11 other questions, and we found that Llama Chat generated a superior answer in 1/12 cases, our model generated a superior answer in 7/12 cases, and there was a tie in 4/12 cases.

While these quick experiments allowed us to get a feeling for the effect of the adaptation, they of course do not constitute a systematic and methodical model evaluation. It will be important to (1) include the other Swiss languages, (2) increase the number of samples, and (3) perform a blind annotation of the model output.

Conclusion

In conclusion, our training experiment showed us that EPCC’s Cerebras CS-2 Wafer-Scale Cluster is a compelling option for adapting large language models of this scale. However, potential users must be prepared for a steep learning curve and should set aside some time for integrating their existing tools with the Cerebras codebase.

Future work could evaluate the performance of our adapted model on the other Swiss languages. Furthermore, it could be interesting whether the adapted model has become more apt at processing Swiss German dialects, even though these were not the focus of the adaptation.

Researchers who are interested in experimenting with our adapted versions of Llama can send us an email (vamvas@cl.uzh.ch) and we will give you access to the Hugging Face model repositories.

(Cover photo by Xavier von Erlach on Unsplash)

Adapting an LLM to the Swiss languages using the Cerebras Wafer-Scale Cluster

Department of Computational Linguistics UZH

Computational linguistics is the science of understanding and modeling human language using computers.

Why adapt a language model?

Training a language model on CS-2

A quick comparison to adaptation on GPU

Did the adaptation work?

Conclusion

Explore topics