Byte Pair Encoding and Data Structures

Tokenization of input strings into sequences of words or sub-tokens is a central concept for modern Natural Language Processing techniques (NLP). This article focuses on a classic tokenization algorithm: Byte...

Accelerating text generation with Rust

Over the past few months, text generation capabilities using Transformer-based models have been democratized by open-source efforts such as Hugging Face’s Transformers [1] library. A broad range of models and...

A Rust SentencePiece implementation

Abstract Tokenization into words or sub-word units is a key component of Natural Language Processing pipeline. Modern approaches such as Byte Pair Encoding (Sennrich et al., 2015), WordPiece or SentencePiece...