September 16, 2021

Byte Pair Encoding and Data Structures

Tokenization of input strings into sequences of words or sub-tokens is a central concept for modern Natural Language Processing techniques (NLP). This article focuses on a classic tokenization algorithm: Byte...

November 21, 2020

Accelerating text generation with Rust

Over the past few months, text generation capabilities using Transformer-based models have been democratized by open-source efforts such as Hugging Face’s Transformers [1] library. A broad range of models and...

May 30, 2020

A Rust SentencePiece implementation

Abstract Tokenization into words or sub-word units is a key component of Natural Language Processing pipeline. Modern approaches such as Byte Pair Encoding (Sennrich et al., 2015), WordPiece or SentencePiece...