Can we represent long texts as images and use VLM to achieve 3-4× token compression while preserving accuracy while scaling a 128K context towards 1M-token workloads? a team of researchers Zipu AI released glyphAn AI framework for increasing context length through visual-text compression. It represents long text sequences into images and processes them using vision-language models. The system renders ultra long text into page images, then a vision language model, VLM, processes those pages from start to finish. Each visual token encodes multiple characters, so the effective token sequence becomes shorter, while the semantics is preserved. Glyph can achieve 3-4x token compression on long text sequences without performance degradation, leading to significant gains in memory efficiency, training throughput, and inference speed.

Why glyph?
Traditional methods extend positional encoding or modify attention, computation, and memory still scales with token counting. Retrieval reduces input, but carries the risk of losing evidence and increases latency. Changes the glyph representation, it converts text to images and transfers the burden to the VLM which already learns OCR, layout and logic. This increases the information density per token, so a fixed token budget covers more of the original context. Under extreme compression, the research team shows a 128K context VLM can address tasks that arise from 1M token level text.

System Design and Training
This method has three stages, continuous pre-training, LLM-driven rendering search, and post-training. Continuous pre-training exposes the VLM to large collections of long text presented with diverse typography and styles. The purpose aligns visual and textual representations, and transfers long context skills from text tokens to visual tokens. Rendering search is a genetic loop driven by LLM. It changes the page size, DPI, font family, font size, line height, alignment, indents and spacing. It evaluates candidates on the validation set to jointly optimize accuracy and compression. Supervised fine tuning and reinforcement learning are used along with group relative policy optimization after training, as well as an auxiliary OCR alignment task. OCR loss improves character fidelity when fonts are small and spacing is low.

Results, Performance and Efficiency,
LongBench and MRCR establish accuracy and compression under long conversation histories and document tasks. The model achieves an average effective compression ratio of about 3.3 on Longbench, with some tasks achieving closer to 5, and about 3.0 on MRCR. These benefits increase with longer inputs, as each visual token contains more characters. The reported speedup compared to text backbone on 128K input is about 4.8x for prefill, about 4.4x for decoding, and about 2x for supervised fine tuning throughput. The Ruler benchmark confirms that higher DPI improves scores when inference, as crisper glyphs help with OCR and layout parsing. The research team reports DPI 72 with average compression 4.0 and max 7.7, DPI 96 with average compression 2.2 and max 4.4, and DPI 120 with average 1.2 and max 2.8 on specific subtasks. The 7.7 max is from the ruler, not the MRCR.

So what? Application
Glyph benefits multimodal document understanding. Training on rendered pages improves performance on MMLongBench Doc relative to the base visual model. This indicates that rendering purposes are a useful excuse for actual document work involving figures and layout. The main failure mode is sensitivity to aggressive typography. Too small a font and tight character spacing degrade accuracy, especially for rare alphanumeric strings. The research team has excluded the UUID subfunction on Ruler. This approach assumes server side rendering and a VLM with strong OCR and layout priors.
key takeaways
- Glyph renders long text into images, then a vision language model processes those pages. It reframes long-context modeling as a multimodel problem and preserves semantics while reducing tokens.
- The research team reports token compression with 3 to 4 times the accuracy compared to the robust 8B text baseline on long reference benchmarks.
- The prefill speedup is approximately 4.8x, the decoding speedup is approximately 4.4x, and the supervised fine tuning throughput is approximately 2x, measured on 128K inputs.
- The system uses continuous pretraining on rendered pages, LLM driven genetic search on rendering parameters, then supervised fine tuning and reinforcement learning with GRPO, as well as an OCR alignment objective.
- The evaluation includes Longbench, MRCR and Ruler, with in an extreme case a 128K reference VLM addressing 1M token level tasks. The code and model cards are public on GitHub and Hugging Face.
Glyph treats long context scaling as visual text compression, it presents long sequences in images and lets the VLM process them, reducing tokens while preserving semantics. The research team claims Qwen3 has 3 to 4 times the token compression with accuracy compared to the 8B baseline, approximately 4 times faster prefilling and decoding, and approximately 2 times faster SFT throughput. The pipeline is disciplined, with continuous pre-training on provided pages, LLM genetic rendering search on typography, then post-training. The approach is practical for millions of token workloads under extreme compression, yet it relies on OCR and typography options, which remain knobby. Overall, visual text compression provides a solid way to scale long context while controlling computation and memory.
check it out paper, weight and repoFeel free to check us out GitHub page for tutorials, code, and notebooksAlso, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletterwait! Are you on Telegram? Now you can also connect with us on Telegram.
Asif Razzaq Marktechpost Media Inc. Is the CEO of. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. Their most recent endeavor is the launch of MarketTechPost, an Artificial Intelligence media platform, known for its in-depth coverage of Machine Learning and Deep Learning news that is technically robust and easily understood by a wide audience. The platform boasts of over 2 million monthly views, which shows its popularity among the audience.
🙌 Follow MarketTechPost: Add us as a favorite source on Google.