Meta and Stanford Researchers Propose Fast Byte Latent Transformer That Reduces Inference Memory Bandwidth by Over 50% Without Tokenization

Spread the love

A team of researchers from Meta, Stanford University, and the University of Washington have introduced three new methods that significantly speed up generation in the Byte Latent Transformer (BLT) – a language model architecture that operates directly on raw bytes rather than tokens.

Table of Contents

Byte-level models are slow to predict

To understand what this new research solves, you need to understand the tradeoffs at the heart of byte-level language modeling.

Most language models today work on token – Fragments of text produced by subword tokenizers such as byte-pair encoding (BPE). A token typically represents several characters or even an entire word. Although it is efficient, tokenization comes with known drawbacks: sensitivity to input noise, poor handling of multilingual text, weak character-level understanding, and fragility on structured inputs such as codes and numbers.

Byte-level models circumvent all this by working directly on raw bytes – the lowest-level representation of text. The Byte Latent Transformer (BLT) was a big step forward: it matched the performance of token-based models by dynamically grouping bytes into variable-length chunks. patch Using entropy-based segmentation strategy. High-entropy (hard to predict) regions get smaller patches; More predictable spans become longer. Most calculations are finished latent token representationNot raw bytes – using three components: a local encoder, a large global transformer, and a local decoder – with an average patch size of 4 bytes and a maximum of 8.

The remaining problem is that of estimation speed. Even with the hierarchical design of BLT, the local decoder still autoregressively generates one byte at a time. Since a typical subword token corresponds to multiple bytes, BLT requires multiple decoder forward passes to produce the same amount of text produced by the token-level model in a single step. In modern LLM service, the handicap is often not calculated. Memory Bandwidth – Repeated loading of model weights and key-value caches from memory. More decoder forward passes means more memory load, which directly translates into slower generation.

Three Ways, One Goal: Low Forward Pass

The research team has introduced three techniques that mitigate this barrier, with each generation varying the speed of trade versus the quality of the data.

BLT Spread (BLT-D)

This is the main contribution and the fastest version. The main idea is to replace autoregressive byte-by-byte decoding block-wise discrete propagation In the local decoder.

During training, the decoder receives two inputs: a clean byte sequence (original text) and a corrupt A sequence of fixed length byte blocks. For each block, a continuous propagation timestep t is sampled from U(0,1), and each byte in the block is at liberty replaced with a [MASK] Token with probability t. This means that the degree of masking varies in each training example – low T leaves most bytes visible; A high T hides most of them. The block size B (set to 4, 8, or 16 bytes in the experiments) typically exceeds BLT’s average patch size of 4 bytes, which teaches the decoder to predict more bytes in the future than usual. The total training loss combines the standard autoregressive next-byte prediction loss on clean sequences and a masked-byte prediction loss on corrupted blocks – conceptually similar to how masked language modeling works in BERT, but applied at the byte level within BLT’s hierarchical architecture.

As expected, BLT-D starts a block [MASK] position and iteratively unmasks multiple byte positions per decoder step using one of two strategies: confidence-based unmasking (unmask positions whose predicted probability exceeds a threshold α) or entropy-bounded (EB) sampling (choose the largest subset of positions whose cumulative entropy remains below the threshold γ). Both strategies generate multiple bytes per forward pass instead of one. The encoder and global model – the expensive components of BLT – are invoked once per block rather than once per patch, reducing total model calls. BLT-D also supports KV caching, which benefits any technique that reduces the KV-cache memory footprint.

At 3B parameters, BLT-D-4 (block size 4) almost matches the function score of BLT, while requiring less than half the memory bandwidth. BLT-D-16 (block size 16) achieves an 87–92% reduction in estimated memory-bandwidth cost compared to BLT, making it the fastest configuration evaluated – albeit with lower Pass@1 scores on coding benchmarks (HumanEval, MBPP).

BLT Self-Speculation (BLT-S)

It takes a different path, moves forward speculative decoding – A technique where a cheap draft model proposes tokens and a larger model validates them in parallel. What makes BLT-S unusual is that it requires a separate draft model and no architectural changes or additional training. It reuses BLT’s existing lightweight local decoder as a drafter.

In the standard BLT heuristic, the decoder stops generating whenever the entropy-based patcher determines that a new patch threshold has been reached – typically every four bytes. BLT-S instead lets the decoder automatically generate entropy spikes, conditioning on the last available latent token, up to a fixed window size k (8 or 16 bytes in the experiments). After generating a draft of K bytes, the completed model re-encodes the candidate sequence through the encoder, global model, and decoder and produces next-byte predictions. Drafted bytes are accepted until the first mismatch; The first mismatched byte is replaced with the verified prediction.

Under greedy decoding, this process guarantees that verified outputs are Similar For standard autoregressive BLT decoding – no quality loss. The BLT-S decoder slightly increases the forward pass but substantially reduces encoder and global model calls. At 3B parameters with k=16, BLT-S can achieve up to 77% memory-bandwidth reduction with no loss in performance.

BLT Diffusion+Verification (BLT-DV)

It sits in the middle. Because BLT-D is trained with both a propagation objective and a standard next-byte prediction objective, the same model can be run autoregressively using the weight causal decoder mask – no separate models and no additional training is required. BLT-DV takes advantage of this: diffusion first drafts a block of bytes, then an autoregressive forward pass verifies the draft, accepting bytes until the first mismatch. Empirically, one-step propagation combined with validation obtained the fastest BLT-DV configuration. While one-step propagation alone typically results in rapid degradation of the generation quality, the validation step effectively prevents this. At 3B parameters, BLT-DV can achieve up to 81% memory-bandwidth reduction compared to BLT.

understanding numbers

All models were trained on the BLT-1T dataset (1 trillion tokens from public sources, including a subset of Datacomp-LM), with the 1B-parameter model trained for 240,000 steps and the 3B-parameter model trained for 480,000 steps. The evaluation included four generation tasks: French-to-English and German-to-English translation using the FLORES-101 benchmark (4-shot, SentencePiece BLEU) and two coding benchmarks – HumanEval (0-shot, pass@1) and MBPP (3-shot, pass@1).

Beyond generation tasks, the research team also evaluates BLT-D on five probability-based benchmarks: ARC-Easy, ARC-Challenge, PIQA, Hellasvag, and MMLU. Since BLT-D is trained with the propagation objective as well as the next-byte prediction objective, it can compute autoregressive probabilities by imposing a causal mask on the decoder – the same mechanism the verification phase of BLT-D relies on. The results show that the BLT-D variants achieve scores close to the BLT baseline on all five benchmarks, confirming that integrating block propagation does not compromise the autoregressive reasoning ability of the model.

Efficiency is reported through three proxy metrics: decoder network function evaluation (NFE), encoder/global model NFE, and the estimated memory-bandwidth figure in gigabytes obtained from parameter counting and forward-pass counting under 16-bit precision. The research team is clear that these are proxy metrics – converting NFE reductions into real wall-clock improvements requires highly optimized estimation implementations, which the research team identifies as the most important direction for future work.

Translation tasks benefit most from BLT-D across all block sizes. Coding tasks show greater sensitivity to block size: BLT-D-16 provides the largest efficiency gains but shows a decline in significance scores on HumanEval and MBPP. A notable additional finding comes from the generation diversity analysis: when using entropy-bounded sampling with top-p sampling at estimation, more decoder NFEs are correlated with higher type-token ratios (a measure of lexical diversity). This means that the efficiency-diversity tradeoff can be tuned at inference time without any retraining.

key takeaways

BLT-D introduced block-wise discrete propagation into BLT’s local decoder, training with combined next-byte prediction and masked-byte prediction loss to generate multiple bytes per forward pass instead of one at a time.
BLT-S uses BLT’s own lightweight decoder as a speculative drafter – no separate models, no architectural changes, no additional training – and produces the same output as standard BLT under greedy decoding.
BLT-DV combines diffusion encoding with an autoregressive validation step using the same BLT-D model weights, recovering the quality lost in diffusion decoding alone without additional training.

All methods can achieve an estimated memory-bandwidth cost up to 50% lower than BLT on generation tasks; BLT-D-16 can reach 87-92% reduction
The autoregressive capability of BLT-D remains robust on likelihood-based benchmarks (ARC-Easy, ARC-Challenge, PIQA, Hellasvag, MMLU), and its generation heterogeneity is tunable at predictable times via entropy-bound sampling thresholds.

check it out paper. Also, feel free to follow us Twitter And don’t forget to join us 150k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us

Source link

Related Stories

Access Denied

Study: Firms often use automation to control certain workers’ wages | MIT News

Anthropic Introduces Natural Language Autoencoders That Convert Claude’s Internal Activations Directly into Human-Readable Text Explanations

You may have missed

TeamPCP Compromises Checkmarx Jenkins AST Plugin Weeks After KICS Supply Chain Attack

BlackSky’s Lyn Chassagne on using satellite imagery to solve problems

EI Power IPO oversubscribed 30.8 times ahead of ACE Market debut

Meta and Stanford Researchers Propose Fast Byte Latent Transformer That Reduces Inference Memory Bandwidth by Over 50% Without Tokenization