There is a new entry worth paying attention to the open-source AI landscape. Alibaba’s Qwen team has released the Qwen3.6-35B-A3B, the first open-weight model of the Qwen3.6 generation, and it’s making a compelling argument that parameter efficiency matters more than raw model size. With 35 billion total parameters, but only 3 billion active during inference, this model provides competitive agentive coding performance with dense models that are ten times its active size.
What is a sparse MOE model and why does it matter here?
The mixture of experts (MOE) model does not run all its parameters on each forward pass. Instead, the model routes each input token through a small subset of specialized sub-networks called ‘experts’. The rest of the parameters are sitting idle. This means that you can have a huge total parameter count when calculating inference – and hence inference cost and latency – is only proportional to the active parameter count.
Qwen3.6-35B-A3B is a causal language model with vision encoder, trained through both pre-training and post-training stages, with a total of 35 billion parameters and 3 billion active ones. Its MoE layer consists of a total of 256 experts, with 8 rooted experts and 1 shared expert active per token.
The architecture introduces an unusual hidden layout worth understanding: the model uses a pattern of 10 blocks, each consisting of 3 instances (Gated DeltaNet → MOE) followed by 1 instance (Gated Attention → MOE). In a total of 40 layers, gated DeltaNet sublayers handle linear attention – a computationally cheaper alternative to standard self-attention – while gated attention sublayers use Grouped Query Attention (GQA), with 16 attention heads for Q and only 2 for KV, significantly reducing KV-cache memory pressure during inference. The model supports a basic reference length of 262,144 tokens, which can be expanded to 1,010,000 tokens using YaRN (another RoPE extension) scaling.
Agent coding is where this model gets serious
On SWE-Bench Verified – the canonical benchmark for real-world GitHub problem solving – Qwen3.6-35B-A3B scores 73.4, compared to 70.0 for Qwen3.5-35B-A3B and 52.0 for Gemma4-31B. On Terminal-Bench 2.0, which evaluates the agent completing tasks inside a real terminal environment with a three-hour timeout, the Qwen3.6-35B-A3B has a score of 51.5 – the highest among all compared models, including the Qwen3.5-27B (41.6), Gemma4-31B (42.9), and Qwen3.5-35B-A3B (40.5).
Frontend code generation shows the most rapid improvement. On QwenWebBench, an internal bilingual front-end code generation benchmark covering seven categories including web design, web apps, games, SVG, data visualization, animation, and 3D, Qwen3.6-35B-A3B achieved a score of 1397 – well ahead of Qwen3.5-27B (1068) and Qwen3.5-35B-A3B (978). Is.
On STEM and reasoning benchmarks, the numbers are equally impressive. The Qwen3.6-35B-A3B scores 92.7 on AIME 2026 (full AIME I and II), and 86.0 on GPQA Diamond – a graduate-level scientific reasoning benchmark – both competitive with much larger models.
multimodal vision display
Qwen3.6-35B-A3B is not a text only model. It comes with a vision encoder and handles image, document, video and spatial reasoning tasks seamlessly.
On MMMU (Massive Multi-Discipline Multimodal Understanding), a benchmark that tests university-level reasoning in images, Qwen3.6-35B-A3B scores 81.7, outperforming Cloud-Sonnet-4.5 (79.6) and Gemma4-31B (80.4). On RealWorldQA, which tests scene understanding in real-world photographic contexts, the model scores 85.3, which is ahead of Qwen3.5-27B (83.7) and well above Claude-Sonnet-4.5 (70.3) and Gemma 4-31B (72.3).
Spatial intelligence is another area of measurable benefit. On ODINW13, an object detection benchmark, the QWEN3.6-35B-A3B scores 50.8, which is higher than the QWEN3.5-35B-A3B’s 42.6. For video understanding, it scores 83.7 on VideoMMMU, outperforming cloud-sonnet-4.5 (77.6) and gemma4-31b (81.6).

Way of thinking, way of non-thinking and a major behavioral change
One of the more practically useful design decisions in Qwen3.6 is explicit control over the reasoning behavior of the model. Qwen3.6 models operate in thinking mode by default, generating logic content enclosed within <think> Tag before giving final response. Developers who need fast, direct responses can disable it via the API parameter – Settings "enable_thinking": False Chat template in kwargs. However, AI professionals migrating from Qwen3 should pay attention to an important behavior change: Qwen3.6 does not officially support Qwen3’s soft switch, that is, /think And /nothink. Mode switching should be done via API parameter rather than inline prompt token.
A more recent addition is a feature called Thinking Preservation. By default, only the thought blocks generated for the latest user message are retained; Qwen3.6 has been additionally trained to preserve and take advantage of thinking traces from historical messages, which can be enabled by setting preserve_thinking Option. This capability is particularly beneficial for agent scenarios, where maintaining the full logic context can increase decision consistency, reduce redundant logic, and improve KV cache utilization in both thinking and non-thinking modes.
key takeaways
- Qwen3.6-35B-A3B is a sparse mixture model of experts with 35 billion total parameters, but only 3 billion are active at the time of inferenceThereby making it significantly cheaper to run than its total parameter count, without sacrificing performance on complex tasks.
- The model’s agentive coding capabilities are its strongest pointWith a score of 51.5 on Terminal-Bench 2.0 (the highest among all compared models), 73.4 on SWE-Bench Verified, and a leading score of 1,397 on QweenWebBench covering frontend code generation in seven categories including web apps, games, and data visualization.
- Qwen3.6-35B-A3B is a basically multimode modelSupports image, video, and document understanding with scores of 81.7 on MMMU, 85.3 on RealWorldQA, and 83.7 on VideoMMMU – outperforming cloud-sonnet-4.5 and gemma4-31b on each of these.
- The model introduces a new thinking protection feature This allows fragments of logic from previous interactions to be retained and reused in multi-step agent workflows, reducing redundant logic and improving KV cache efficiency in both thinking and non-thinking modes.
- This model, released under Apache 2.0, is completely open for commercial use And compatible with the leading open-source inference frameworks – SGlang, VLLM, KTransformers, and Hugging Face Transformers – KTransformers enable CPU-GPU heterogeneous deployments, especially for resource-constrained environments.
check it out technical details And Model Weight. Also, feel free to follow us Twitter And don’t forget to join us 130k+ ML subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us