Byte Pair Encoding and Data Structures
Tokenization of input strings into sequences of words or sub-tokens is a central concept for modern Natural Language Processing techniques (NLP). This article focuses on a classic tokenization algorithm: Byte...
Accelerating text generation with Rust
Over the past few months, text generation capabilities using Transformer-based models have been democratized by open-source efforts such as Hugging Face’s Transformers [1] library. A broad range of models and...
A Rust SentencePiece implementation
Abstract Tokenization into words or sub-word units is a key component of Natural Language Processing pipeline. Modern approaches such as Byte Pair Encoding (Sennrich et al., 2015), WordPiece or SentencePiece...
1