For the past few years, the AI world has followed a simple rule: If you want a big language model (LLM) to solve a hard problem, build it. Chain of Thought (COT) Now. But new research University of Virginia And Google Proves that ‘thinking long’ is not the same as ‘thinking hard’.
Research team reveals that adding more tokens to a response can actually create AI Less Pure. Instead of counting words, Google researchers introduce a new measurement: Deep Thinking Ratio (DTR).

‘Token Maxing’ failure‘
Engineers often use token count as a proxy for the effort an AI puts into a task. However, the researchers found that raw token counts have an average correlation of R= -0.59 With accuracy.
This negative number means that as the model generates more text, it is more likely to be wrong. This is due to ‘overthinking’, where the model gets stuck in loops, repeating unnecessary steps, or exaggerating its mistakes. Relying on length alone wastes expensive computations on tokens with no information.
What are deep-thinking tokens?
The research team argued that the real ‘thinking’ happens inside the layers of the model, not just in the final output. When a model predicts a token, it processes the data through a series of transformer layers (l).
- Shallow Tokens: In simple terms, the model’s predictions stabilize quickly. The ‘guess’ doesn’t change much from layer 5 to layer 36.
- Deep Thinking Tokens: For difficult logic or math symbols, the prediction changes significantly in deeper layers.
how to measure depth
To identify these tokens, the research team uses a technique to look at the internal ‘draft’ of the model at each layer. They project to intermediate hidden states (H)t l) in the vocabulary space using the model Unembedding matrix (WYou). This produces a probability distribution (pt,l) for each layer.
then they calculate Jensen–Shannon divergence (JSD) Between the intermediate layer distribution and the final layer distribution (p.t,l):
Dt,l := jsd(pt,l || Pt,l)
one token is one symbol of deep thinking If its prediction is fixed only in the ‘late regime’ – defined by a Depth Degree (⍴). In their tests, they set ⍴= 0.85, which means that the token is stable only in the last 15% of layers.
Deep Thinking Ratio (DTR) is the percentage of these ‘hard’ tokens in the complete sequence. across models like DeepSeek-R1-70B, Qwen3-30B-ThinkingAnd GPT-OSS-120BDTR showed a strong average positive correlation r = 0.683 With accuracy.

Think@N: Better Accuracy at 50% Cost
The research team used this innovative approach to create think@nA new way to measure AI performance during inference.
Most developers use Self-Consistency (Cons@n)where they take the sample 48 Use majority voting to differentiate answers and choose the best one. This is very expensive because you have to generate one token for each answer.
Think@N changes the game by using ‘early halting’:
- The model starts generating multiple candidate answers.
- after the bus 50 prefix tokensThe system calculates DTR for each candidate.
- This immediately stops generating ‘unexpected’ candidates with low DTR.
- It caters only to candidates with high critical thinking scores.
Result on AIME 2025
| Method | accuracy | Average cost (k tokens) |
| cons@n (majority vote) | 92.7% | 307.6 |
| think@n (DTR-based selection) | 94.7% | 155.4 |
But target 25 Math benchmark, Think@n achieved high accuracy While reducing estimation costs compared to standard polling 49%.
key takeaways
- Token count is a poor predictor of accuracy: The length of raw output has an average negative correlation (r = -0.59) with performance, meaning that longer reasoning traces often indicate ‘overthinking’ rather than higher quality.
- Deep thinking tokens define true effort: Unlike simple tokens that are stable in early layers, deep thinking tokens are those whose internal predictions undergo significant modification in deeper model layers before convergence.
- A better metric is the Deep-Thinking Ratio (DTR): DTR measures the proportion of deeply thought tokens in a sequence and exhibits a strong positive correlation with accuracy (average r = 0.683), consistently outperforming length-based or confidence-based baselines.
- Think@n enables efficient test-time scaling: By prioritizing and eliminating only samples with high deep thinking proportions, the Think@N strategy matches or exceeds the performance of standard majority voting (Cons@N).
- Drastic reduction in costs through early stopping: Because DTR can be estimated from a small prefix of just 50 tokens, unequal generations can be discarded early, reducing the total estimation cost by approximately 50%.
check it out paper. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.