Reasoning Models Can Be Effective Without Thinking

Plus more about BitNet b1.58 2B4T Technical Report and ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Apr 14th ~ Apr 20th
#52 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 10k OpenAI announced o4-mini and o3, their smartest and most capable models yet. With o4-mini becoming the new state-of-the-art, and o3-(high) topping the aider LLM leaderboard. Both are multi-modal models capable of vision understanding, with their API price starting at $10 in/$40 out for o3, and $1.1 in/$4.4 out for o4-mini

    via Artificial Analysis

bycloud’s new project: search AI papers semantically!

Hey guys! It must have been a long wait for some of y’all, and I am very excited to share with you my latest project that I just shipped, called:

A semantic search engine for 300k+ AI research papers!

Outcompete Deep Research apps like Grok, OpenAI, Perplexity, and Gemini at finding relevant papers. Check out our demo on X.

Specifically, there are ~300,000 AI/ML research papers currently indexed in my engine, that’s about 1.19TB worth of PDFs as a knowledge base. 

By next month, we are planning to increase this by 4x, indexing the entire Arxiv.org. Findmypapers.ai is now available as a Patreon benefit or you can access it directly on the website.

But why ANOTHER search engine? So there are currently 2 problems for each existing solution:

  1. Generative AI models trained with papers is built for serving hallucination

  2. Deep Research Agents is good but wasting compute by browsing 80% SEO optimized slop

findmypapers.ai addresses both of these problems, and takes the best of both worlds.

I believe that surveying research shouldn’t be that hard. You can be as specific and technical about your search query, and it’ll not give you made up unuseful bs.

snippet of the output results

Before u try:

  • Search time is long (est. 1~3min depending on search range)

  • Limited to AI research papers, but will be fixed soon once we have money to upgrade our storage

  • Broad/wide search is really REALLY useful if you need big compilation of information like my own use cases

To Celebrate our launch, us code BETAN50 for 50% off for the next 2 months! (50 limited redeems). You can follow our official X account or join Discord for any updates.

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Feng et al. [ByteDance Seed]

♥ 400   LLM RL

How Reinforcement Learning Teaches LLMs to Think with Tools

LLMs have remarkable reasoning abilities and they can perform advanced actions from solving logic puzzles to generating step-by-step explanations. However, they still struggle with mathematical tasks like solving Olympiad-level equations or geometric proofs.

This is primarily because pure text-based reasoning struggles with exact calculations and symbolic manipulation. For instance, solving a combinatorics problem might require them to calculate permutations, a task where a single arithmetic error derails the entire solution. Code interpreters (CIs) offer a workaround by enabling precise, executable steps, but existing methods to combine LLMs with tools rely on imitation learning. 

This paper introduces ReTool, a framework that reimagines how models learn to integrate computational tools like code interpreters into their reasoning. ReTool tackles this problem by treating tool use as a skill to be learned, not just mimicked, using outcome-driven RL.

AIME 2024 & 2025 scores of ReTool and text-based RL baseline on the Qwen2.5-32B-Instruct model. The x-axis represents the training steps.

How ReTool Trains Models to "Think with Code"

ReTool operates in two phases. First, it builds a foundation through cold-start training: synthetic data teaches the model basic tool invocation by replacing manual calculations in existing reasoning traces with code snippets and their execution results. This dataset, refined through format and answer verification, primes the model to recognize when and how to insert code blocks during problem-solving.

In the second phase, we will use a tool-integrated RL that interacts with a code sandbox in real time (unlike standard RL, which generates text-only reasoning chains). As the model writes a reasoning step, it can pause to generate a code block (marked by <code> tags), execute it, and receive feedback (success results or errors) within <interpreter> tags. This dynamic loop allows the model to adjust its strategy based on whether the code worked, and whether the final answer was correct. For example, if a generated Python snippet throws a NameError, the model might revise its code in the next step, learning to define missing variables.

To streamline training, ReTool also uses asynchronous code sandboxes for parallel execution and masks interpreter feedback during loss calculation. This sparse signal drives the model to explore strategies that reliably reach solutions, prioritizing not just correctness but efficiency. Over time, the model discovers patterns like early tool invocation (to validate assumptions quickly) or chaining code snippets for multi-step proofs.

Results: Shorter Reasoning, Smarter Code, Better Accuracy

ReTool shows  impressive performance on the challenging AIME Olympiad benchmarks. A 32B model trained with ReTool achieved 67% accuracy on AIME 2024 in just 400 training steps, outperforming text-only RL baselines (40% accuracy, 1,080 steps). With a stronger backbone (DeepSeek-R1), accuracy rose to 72.5%, surpassing OpenAI’s o1-preview by 27.9%. Three findings stand out:

  1. Efficiency Gains: Responses became 40% shorter post-training, as models replaced verbose calculations with concise code.

  2. Strategic Tool Use: Code invocation shifted earlier in reasoning chains, and the variety of code purposes (e.g., verification, enumeration) expanded.

  3. Emergent Self-Correction: Models began debugging their own code. In one case, after a NameError, a model added missing function definitions and reran the code, a behavior never explicitly taught.

However, it still has a few limitations. The reliance on rule-based answer verification assumes problems have unambiguous solutions, which may not hold in open-ended domains. Additionally, while ReTool reduces hallucination, errors in code logic (e.g., off-by-one bugs) can still propagate if not caught by the interpreter.

BitNet b1.58 2B4T Technical Report

Ma et al. [Microsoft Research]

♥ 518   LLM Compression  

How 1-Bit LLMs Are Redefining Efficient AI

The race to build leaner, faster language models often feels like squeezing a mountain into a shoebox. Advanced LLMs excel at complex tasks but they require hefty computational resources which makes them impractical for edge devices or real-time applications. 

But to everyone’s surprise BitNet b1.58 2B4T, a 2-billion-parameter model that challenges the status quo by operating almost entirely with 1-bit weights, has proven that bit based LLMs might just scale very well. 

BitNet b1.58 2B4T advances the Pareto frontier

The Efficiency Revolution: Ternary Weights Meet Smarter Training

Many LLMs use quantization tricks to save resources, but this model uses a completely new architecture which was trained from scratch to deliver performance at low cost. Instead of compressing weights after training, it uses a new architecture that uses only ternary values (-1, 0, +1) for every layer from day one. This “native” 1-bit design reduces memory use while sidestepping the performance drops seen in retrofitted models.

It uses BitLinear layers, which replace standard linear layers in Transformers. These layers quantize weights to ternary values during forward passes using an absolute mean scheme which ensures numerical stability. In the next step, the activations get trimmed to 8-bit integers which further cuts compute costs. To keep training on track, the team borrowed techniques from high-performance LLMs: rotary positional embeddings, ReLU-squared activations for sparsity, and a bias-free design. 

For training this mode, the researchers used a two-phase learning rate schedule that started aggressively to achieve1-bit stability, then cooled down to refine high-quality data. They also used weight decay as a temporary guardrail early on before being disabled, letting parameters settle into precise configurations. The dataset contained a mix of 4 trillion tokens of web crawls, educational content, and synthetic math data, with later stages emphasizing curated examples. 

Comparison of BitNet b1.58 (2B) against Qwen 2.5 1.5B in its original bf16 precision and after INT4 post-training quantization.

BitNet b1.58 Performance Benchmarks

BitNet b1.58 performs well on benchmarks against similarly sized models like LLaMA 3.2 1B and Gemma-3 1B. It matches or exceeds performance in language understanding (MMLU), reasoning (ARC-Challenge), and math (GSM8K) while using 6x less memory (0.4GB vs. 2.6GB). Furthermore, it outperforms INT4-quantized versions of larger models like Qwen2.5-1.5B which proves that native 1-bit training beats post-hoc compression.

Performance comparison of BitNet b1.58 2B4T against other open-weight 1-bit models.

Reasoning Models Can Be Effective Without Thinking

Ma et al. [University of California, Allen Institute for AI]

♥ 385   LLM Reasoning   bycloud’s pick  

Can Language Models Skip the "Thinking" Step?

Modern LLMs often approach complex tasks by generating elaborate, step-by-step reasoning traces. These “chain-of-thought” processes are widely considered essential for solving challenging problems in mathematics, coding, and logic. While effective, this process significantly increases computational costs, latency, and token usage. Researchers have tried optimizing these thinking steps by shortening them or training models to prioritize concise reasoning, but all still assume explicit reasoning is indispensable.

This paper aims to reduce this computing cost by cutting out the thinking steps entirely and replacing it with a minimal, prefilled placeholder. The researchers are calling this method NoThinking, this method forces the model to skip its usual reflective process and jump directly to generating solutions. 

How NoThinking Approach Works

The NoThinking method uses a simple prompting tweak where instead of letting the model generate a verbose thinking block, the researchers artificially insert a short, empty placeholder (e.g., “Okay, I think I have finished thinking.”) and prompts the model to continue from there. This bypasses the model’s default behavior of generating extended self-dialogue and effectively truncates the reasoning process.

By capping the total tokens available for generation, both traditional “Thinking” and NoThinking methods are forced to prioritize efficiency. In low-budget settings (e.g., 700 tokens), NoThinking outperforms Thinking by wide margins, 51.3% vs. 28.9% accuracy on the AMC 2023 math benchmark. Additionally, NoThinking’s performance scales better as the number of sampled outputs (k) increases, suggesting its outputs are more diverse or complementary when aggregated.

Additionally, the authors paired NoThinking with a parallel scaling approach. In this method, they generated multiple independent solutions in parallel and selected the best one using task-specific verifiers (for theorem proving) or confidence-based rankings (for tasks without verifiers). 

Results and Implications of NoThinking Approach

The researchers tested NoThinking Approach across seven reasoning benchmarks, including mathematical problem solving (AIME, AMC), coding (LiveCodeBench), and formal theorem proving (MiniF2F):

  1. Token Efficiency: NoThinking consistently matches or surpasses Thinking when token usage is controlled. For example, on OlympiadBench (a challenging math dataset), NoThinking achieves 51.3% accuracy with 700 tokens, while Thinking manages only 28.9%. At higher token budgets, Thinking catches up in pass@1 metrics, but NoThinking still dominates in pass@k scenarios as k increases.

  2. Latency Reduction: Parallel scaling with NoThinking reduces inference latency by up to 9x compared to sequential Thinking. On theorem-proving tasks, it achieves similar accuracy with 4x fewer tokens and 7x lower latency.

Comparison of Best-of-N selection methods (majority voting, confidence+highest, and confidence+voting) on selected experiments.

However, the approach isn’t universally optimal. For coding tasks like LiveCodeBench, NoThinking lags behind Thinking in pass@1 accuracy, likely because code solutions require precise, verifiable outputs that benefit from iterative refinement.

Reply

or to participate.