Accelerating text generation with Rust

Over the past few months, text generation capabilities using Transformer-based models have been democratized by open-source efforts such as Hugging Face’s Transformers [1] library. A broad range of models and applications have been made available, including:

These models offer state-of-the-art performance and can be set-up with just a few lines of code using ready-to-use pipelines in the Transformers library. This allowed a wide spread of the recent advances in the field of NLP to the industry where they can be effectively used to solve business-driven use cases. As these powerful models gain a broader adoption, the issue of their computational efficiency becomes ever more critical. A lot of research has been done to reduce the size of these models with minimal impact on their accuracy, including for example distillation, quantization or pruning. This article will focus on another avenue to improve the runtime of text generation models: an implementation in the Rust programming language.

The structure of the article is as follows:

  1. Overview of text generation tasks and Python baseline code examples from the Transformers [1] library
  2. Description of the text generation architecture used for summarization, translation and free text generation tasks
  3. Introduction to the rust-bert library, a project offering Rust-native implementation of these models in Rust
  4. Benchmarks between the baseline Python implementation from Transformers and the proposed Rust implementation for these text generation tasks

Readers familiar with text generation tasks and their implementation in modern language models may want to skip directly to section 3. The first two parts provide a high-level overview of the technology to better understand the scope of the proposed performance improvements and give a few references in order to dive deeper into the topic.

1. Overview of text generation tasks

To illustrate both translation and summarization capabilities, we’ll use a new article titled Astronomers find water vapour in atmosphere of exoplanet K2-18b from WikiNews shared under CC BY 2.5 license [8].

Summarization

In Python, this article can be summarized calling the following snippet from the Transformer’s Python library [1], defaulting to a BART model trained on the CNN-DailyMail dataset:

from transformers import pipeline
summarization_pipeline = pipeline("summarization")
summarization_pipeline(input_article)

returning:

K2-18b is the first such discovery in a planet in its star’s habitable zone. It is not too hot and not too cold for liquid water to exist on a planet that orbits a star 110 light years from Earth. Scientists from the University of Montreal and a team from UCL found water in the atmosphere of the planet. The Montreal team used data from the NASA’s Hubble telescope to assess changes in the light coming from the star as the planet passed between it and Earth.

Translation

Similarly, a translation model can be easily created to translate a sentence taken from this document to Spanish:

from transformers import pipeline
translation_pipeline = pipeline("translation_en_to_es")
translation_pipeline("They found that certain wavelengths of light, which are usually \
 absorbed by water, weakened when the planet was in the way, indicating not only \
 does K2-18b have an atmosphere, but the atmosphere contains water in vapour form.")

returning:

Encontraron que ciertas longitudes de onda de la luz, que generalmente son absorbidas por el agua, se debilitaban cuando el planeta estaba en el camino, lo que indica que no sólo K2-18b tiene una atmósfera, sino que la atmósfera contiene agua en forma de vapor.

Text generation

Text can be generated using a text generation pipeline:

from transformers import pipeline
text_generator = pipeline("text-generation")
print(text_generator("The majority of crustaceans are aquatic,", 
    max_length=64, 
    do_sample=False))

returning:

The majority of crustaceans are aquatic, meaning they live on land, rivers, and lakes. Carnivorous crustacean species, such as those found in the Pacific Northwest, are found in all parts of the world, including the United States, Canada, Australia, New Zealand, and Japan.

The next section will briefly describe the architecture powering these models before diving into a comparison of the baseline Python Transformers library with a proposed Rust-based implementation: rust-bert [9]

2. Overview of models and system architecture

Summarization and Translation

Translation and summarization both rely on a similar architecture, although the model weights naturally vary from application to application. They are essentially made of:

  1. A pre-processing pipeline mostly comprising of a tokenizer (such as Byte Pair Encoding or SentencePiece/Unigram-based) and an encoder (mapping individual tokens to vocabulary indices and other optional inputs (such as position indices)).

  2. A transformer-based model, based on an encoder-decoder architecture. If you are not familiar with Transformer-based encoder-decoder architecture, I highly recommend the blog post The Illustrated Transformer [10]. The encoder is comprised of a stack of self-attention and fully connected layers and encodes the input sequence (i.e. text to be translated or summarized) into a latent space. The decoder is made of a similar stack of self-attention layers completed with cross-attention to the encoder hidden states, allowing to leverage the representations generated during encoding. The decoder takes as an input the output sequence generated so far and the encoder output to generate the next tokens. The decoder therefore generates output tokens one at a time.

  3. A generation routine, which in its simplest form will keep calling the transformer-based models to generate tokens until the sequence is completed (output of an End Of Sequence token). Note that the encoder only needs to be run once in this iterative process: its output is cached and re-used at each decoder generation step. In practice, more advanced algorithms are used to improve the quality of the generation, including beam search, sampling, length and repetition penalties. These methods are summarized in an excellent article from Hugging Face [11]. Careful design of the decoder allows to not only cache the encoder states, but also parts of the keys and values in the decoder to avoid unnecessary re-calculation and speeds-up the decoding process.

This iterative process is illustrated at a high level in the figure below (with slight simplifications, especially for the end of generation condition):

Encoder decoder generation architecture

This process (and in the special case of BART and Marian - the model architecture itself) is identical between translation and summarization. Only the tokenization process and the model parameters differ between the two applications, showing the high versatility of this system.

Text generation

The process for text generation using GPT2 is very similar. However, GPT2 is a decoder-only model, and does not contain the encoder part of the transformers architectures. The model uses the starting prompt (and sequence generated so far) as the only input. While it therefore does not need to compute encoder states and cache them, it still relies on an efficient caching mechanism to avoid unnecessary re-computation of activations already computed during the generation process.

Decoder generation architecture

On the complexity of the generation routine

For both architectures, one may note the complexity of the generation routine, involving a significant number of operations beyond the model forward pass, helping to improve the quality of the generated text. However, these improvements do not come for free and incur an additional computational cost in both the model forward pass (beam search increases the effective batch size) and in post-processing operations.

The Python Transformer’s library already leverages Rust-based tokenizers for all its ready-to-use pipelines, therefore accelerating the preprocessing part of the system. Some benchmarks on question answering [9] indicate that the preprocessing can amount to 20% to 30% of the processing time for simple pipelines. The post-processing, also involving operations that go beyond tensor operations, is however implemented in Python in the Transformer’s library. The rest of this article assesses the impact of a high-performance implementation of the entire system in Rust (therefore covering the post-processing pipeline) using the rust-bert library [9].

3. A brief introduction to rust-bert

Rust-bert is essentially a Rust-native port of Hugging Face’s Transformers’ library [1]. Leveraging the rust_tokenizers [12] library for preprocessing, it proposes implementations for state-of-the-art transformers-based models and ready-to-use pipelines. De-noising auto-encoder (BERT, Electra), autoregressive (XLNet, GPT, GPT2) and encoder-decoder models (BART, T5) have been implemented with pre-trained sets of weights available on Hugging Face’s model hub [13]. Any Pytorch model trained on the Transformers’s library can be converted to a C-array format and used by the rust-bert library.

These models can be used in ready-to-use pipelines, including:

The last three text generation pipelines allow for a side-by-side comparison of the Python implementation (with Rust tokenizers) and the end-to-end Rust version. The three pipelines mentioned above can also be instantiated in Rust in a few lines of code:

Summarization

use rust_bert::pipelines::summarization::SummarizationModel;
let summarization_model = SummarizationModel::new(Default::default())?;
summarization_model.summarize(&input);

Translation

use rust_bert::pipelines::translation::{Language, TranslationConfig, TranslationModel};
let translation_config =
    TranslationConfig::new(Language::EnglishToSpanish, Device::cuda_if_available());
let model = TranslationModel::new(translation_config)?;

model.translate(&["They found that certain wavelengths of light, which are usually 
 absorbed by water, weakened when the planet was in the way, indicating not only 
 does K2-18b have an atmosphere, but the atmosphere contains water in vapour form."]);

Generation

use rust_bert::pipelines::text_generation::TextGenerationModel;
let model = TextGenerationModel::new(Default::default())?;
let input_context = "The majority of crustaceans are aquatic,";
model.generate(&[input_context], None);

These pipelines bring state-of-the-art NLP capabilities to the Rust community. Please check rust-bert’s repository, the associated paper [9], or reach out to me if you are interested in learning more about the capabilities of the library. The rest of this article will focus on the performance comparison of the original Python-based text generation pipelines (using the Transformers library [1]) and the proposed Rust-based implementation.

4. Benchmarks

The performance benchmarks proposed here will focus on the text generation task. Benchmarks have also been performed for simpler pipelines (for example classifications) and are available in [9]. For simple pipelines with low to no post-processing operations, there is little to gain from a Rust implementation. The forward pass through the neural network leverages the same backend (Rust bindings to the C++ Libtorch library [14]). Potential benefits could be gained from the pre-processing and tokenization step, but the Transformers’ library uses Rust-based tokenizers [15] for all its models starting from v4.0.0. The outcome is a virtually identical performance between the Rust and Python implementation for tasks such as classification, token classification or question answering.

The text generation pipelines, however, do include a complex post-processing pipeline which is implemented natively in Python. Because of the iterative process involving a model forward pass and the post-processing steps, a migration of the post-processing operations to Rust and use of bindings to Python (as is the case for the tokenizers) is more difficult. This is an area where a fully Rust-native, end-to-end Rust implementation may offer benefits. This section describes a few experiments aiming at quantifying how significant these benefits may be.

Experimental setup

The experimental setup for all experiments is unchanged and described below:

Hardware       Software  
CPU   AMD 2700X        OS   Windows 10 (Marian: Ubuntu 20.04)
GPU   Nvidia 2070 RTX        CUDA   10.2
RAM   32GB       Python   Python 3.7, Transformers v4.2.2
Drive   NVME 970 EVO        Rust   rust-bert v0.12.1
        C++      Opus-MT Marian Docker image


By default experiments are run in Windows 10, with the exception of Marian executed natively in Ubuntu 20.04 on the same hardware. All experiments are repeated at least 10 iterations and the mean is reported. In all benchmarks, a warm-up run is executed (loading model in the GPU buffer and running a forward pass) as the first GPU buffer allocation can be significantly slower. The source code for all benchmarks is available in the references [18] section.

Translation

Two models are used for benchmark purposes for translation: an English to Spanish model trained by the Opus-MT team [6], and the T5-base model allowing for translation (in this case, English to French) as part of its text-to-text capabilities. For all translation tasks, the source sentences are extracted from the example provided at the beginning of this article [8] (and of course identical between Python and Rust).

Setting Value
# beams    6
sampling    false
early stopping    true
output sequences    1


All sentences are processed in a single batch. To illustrate the impact of the batch size and padding, a sample of 10 sentences with various lengths and a single sentence are passed to the models. Note that since translation is done with 6 beams, the effective batch size is 6x the number of input sequences.

The figure below shows the results of the translation benchmark with the Marian English to Spanish model. For both input sizes, the Rust-based translation executes approximately 60% faster than its Python counterpart - regardless of the number of input sentences provided. Interestingly, the Rust and C++ (Marian) translation have the same performance, even though they do not share the same tensor operations backend (Marian uses its own optimized auto-differentiation engine [16] while the Rust version relies on bindings to the Libtorch library).

Marian Translation

The next figure illustrates the same experiment, taking the T5-base model (the only differences lies in the neural network architecture, the rest of the pipeline and settings remain identical). For a small effective batch size (3 input sentences), the benefits are in line with the Marian-based translation - approximately 50% reduced execution time for the Rust version. Interestingly, these benefits decrease significantly for larger effective batch sizes. This issue is still being investigated and may be caused by the handling of padding for sequences with varying length, or because T5 is a significantly larger model than Marian.

T5 Translation

Summarization

This section investigates the performance of summarization models using models based on a BART architecture. The article introduced at the beginning of this article is used for all experiments. Because of the higher computational cost of this pipeline, batch-wise summarization is not investigated. Note that since beam search is used, an effective batch size of length #beams is still used. In order to illustrate the impact of the model size, 3 different models (one baseline model and two distilled models) are used for benchmarks:

The experimental set-up remains similar to that of translation, with 4 beams and no sampling.

Setting Value
# beams    3
sampling    false
early stopping    true
output sequences    1


The figure below shows the execution time for each model for both Python and Rust. Rust-based translation consistently runs approximately 2x faster than its Python counterpart. The Rust-based version of the BART-large 406M parameters runs faster than the smallest distilled model with 230M parameters in Python, showing that while research efforts aimed at reducing the model size are critical to ensure sustainable use of NLP technologies, engineering-driven improvements allow reaching comparable speed-ups. More interestingly, this consistent 50%+ execution time reduction shows that the two approaches are complementary. Combining distillation and implementation in Rust, the summarization of a document can be accelerated by a factor of close to 5, from 2.57s down to less than 600ms.

Summarization

Text generation

The last experiment investigates the performance benefits of a Rust implementation for free text generation using a GPT2 language model. The architecture for this pipeline is slightly different (relies on a decoder model) and the model size is significantly smaller than for the previous pipelines. Sampling is turned on for these pipelines. In order to ensure a fair comparison (the sampling leads to non-deterministic output), the sequence output size is fixed to 64 tokens by fixing the sequence minimum and maximum lengths to this value. 5 output sequences with 5 beams are generated for each prompt, leading to an effective batch size equal to 25x the number of prompts. The text generation benchmark is illustrated for both a single prompt and 3 prompts processed in a single batch.

Setting Value
# beams    5
sampling    true
early stopping    true
output sequences    5
sequence length    64


The benefits of the Rust implementation for text generation are significantly higher than for the previous experiments, with the Rust pipeline running roughly 4x faster than its Python equivalent. Note that these experiments do not use the ready-to-use pipeline from the Transformers library as this does not support batched input yet. Instead, the inputs for the Python experiment have been manually encoded and padded to the left with <eos> tokens. The inputs are subsequently processed in a single batch in both the Python and Rust pipelines to allow for a fair comparison.

The text generation pipeline uses sampling, while the previous two experiments covering summarization and translation did not (a deterministic behaviour is expected in this case). This additional post-processing operation is likely to be the cause of the significant increase in performance difference between Python and Rust. For validation purposes, this sampling was turned off for both frameworks and the experiment repeated. The experiments without sampling validate this assumption, and Rust is approximately 2x faster than Python for deterministic generation, in line with the past benchmark.

This last experiment provides two insights:

Text generation

Final thoughts

These results highlight the potential of high-performance languages, including Rust, for serving text generation models under low latency. Bringing performance benefits in line with C++, its memory safety, concurrency and accessibility to Machine Learning engineers make it a powerful additional choice for the deployment of performant, machine-learning powered applications.

Research efforts aimed at reducing the computational cost of deep learning models have translated into significant gains in execution speed, at only a marginal cost in the performance of these models. The proposed Rust implementation synergizes very well with this work: while techniques such as distillation or quantization are effective at reducing the cost of the forward pass through the neural network, a Rust implementation can significantly speed up the auxiliary operations (whose relative cost increases as the neural network gets optimized). Combined with Rust safe concurrency capabilities, the combination of these techniques enables a significant acceleration of text generation pipelines using state-of-the-art models.

last revision: 2021/01/31, updated benchmark with Transformers v4.2.2

References