NVIDIA Releases Dynamo v0.9.0: A Massive Infrastructure Overhaul Featuring FlashIndexer, Multi-Modal Support, and Removed NATS and ETCD

Spread the love

NVIDIA has just released Dynamo v0.9.0. This is the most significant infrastructure upgrade to date for the distributed inference framework. This update simplifies how models are deployed and managed at large scale. This release focuses on removing heavy dependencies and improving the way GPUs handle multi-modal data.

Table of Contents

Great simplification: removing NATS etc.

The biggest change in v0.9.0 is the removal of NATS And etcd. In previous versions, these tools handled service discovery and messaging. However, they added an ‘operational tax’ by requiring developers to manage additional clusters.

NVIDIA replaced these with a new one event plane and a discovery aircraft. The system now uses ZMQ (ZeroMQ) For high-performance transportation and messagepack For data serialization. For teams using Kubernetes, Dynamo now supports kubernetes-native service discovery. This change makes the infrastructure flexible and easier to maintain in a production environment.

Dynamo v0.9.0 extends multi-modal support to 3 main backends: VLLM, sglangAnd TensorRT-LLM. This allows models to process text, images, and videos more efficiently.

A major feature in this update is E/P/D (encode/prefill/decode) division. In a standard setup, a single GPU often handles all 3 stages. This may cause interruptions during heavy video or image processing. v0.9.0 introduction encoder separation. now you can run encoder on a different set of GPU prefill And explain worker. This allows you to scale your hardware based on the specific needs of your model.

Secret Preview: FlashIndexer

This release includes a brief preview of flashindexer. This component is designed to solve latency issues in distributed KV Cash management.

When working with large context windows, transferring key-value (KV) data between GPUs is a slow process. FlashIndexer improves the way the system indexes and retrieves these cached tokens. it results in less Time to First Token (TTFT). Although this is still a preview, it represents a major step towards making distributed inference as fast as local inference.

Smart Routing and Load Estimation

Managing traffic on 100 GPUs is difficult. Dynamo v0.9.0 introduces a smart planner who uses predicted load estimation.

The system uses a kalman filter Predicting the future load of a request based on past performance. It also supports routing signal From Kubernetes Gateway API Inference Extension (GAIE). This allows the network layer to communicate directly with the inference engine. If a specific GPU cluster is overloaded, the system can route new requests to idle workers with high precision.

Tech Stack at a Glance

The v0.9.0 release updates many core components to their latest stable versions. Here are the details of the supported backends and libraries:

Component	version
VLLM	v0.14.1
sglang	v0.5.8
TensorRT-LLM	v1.3.0rc1
NIXL	v0.9.0
rust core	dynamo-token crate

inclusion of dynamo-token crate, written in WarThis ensures that token handling remains high speed. For data transfer between GPUs, Dynamo continues to be leveraged NIXL (NVIDIA Estimate Transfer Library) For RDMA-based Communications.

key takeaways

Infrastructure Decoupling (Goodbye NATS and ETCD):The release completes the modernization of the communications architecture. By replacing NATS and ETCD with a new one event plane (using the ZMQ And messagepack) And kubernetes-native service discoveryThe system removes the ‘operational tax’ of managing external groups.

Full Multi-Modal Separation (E/P/D Split):Dynamo now supports full Encode/Prefill/Decode (E/P/D) All divided into 3 backends (vLLM, SGLang, and TRT-LLM). This allows you to run Vision or Video Encoder on separate GPUs, preventing compute-heavy encoding tasks from interrupting the text generation process.
FlashIndexer Preview for low latency ‘Sneak Preview’ of : flashindexer Features a special component for customization distributed KV cache management. It is designed to make indexing and retrieval of conversation ‘memory’ significantly faster, with the aim of further reducing Time to First Token (TTFT).
Better Scheduling with Kalman Filter:The system now uses predicted load estimation Powered by kalman filter. This allows the planner to more accurately forecast GPU load and proactively handle traffic spikes, supported by routing signal From Kubernetes Gateway API Inference Extension (GAIE).

check it out GitHub release here. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Source link

Related Stories

Vivo V70 Elite Review: A Capable Addition to Vivo’s Premium Segment

Personalization features can make LLMs more agreeable | MIT News

Agoda Open Sources APIAgent to Convert Any REST pr GraphQL API into an MCP Server with Zero Code

You may have missed

Police Continue Searches of Former Prince Andrew’s Home After Arrest

BeyondTrust Flaw Used for Web Shells, Backdoors, and Data Exfiltration

Access Denied

I-Bhd delivers robust Q4, full-year results for FY25