Tongyi DeepResearch – open-source 30B MoE Model that rivals OpenAI DeepResearch

It makes me wonder if we'll see an explosion of purpose trained LLMs because we hit diminishing returns on invest with pre training or if it takes a couple of months to fold these advantages back into the frontier models.

Given the size of frontier models I would assume that they can incorporate many specializations and the most lasting thing here is the training environment.

But there is probably already some tradeoff, as GPT 3.5 was awesome at chess and current models don't seem trained extensively on chess anymore.

I made a 4B Qwen3 distill of this model (and a synthetic dataset created with it) a while back. Both can be found here: https://huggingface.co/flashresearch

Has anyone found these deep research tools useful? In my experience, they generate really bland reports don't go much further than summarization of what a search engine would return.

Sunday morning, and I find myself wondering how the engineering tinkerer is supposed to best self-host these models? I'd love to load this up on the old 2080ti with 128gb of vram and play, even slowly. I'm curious what the current recommendation on that path looks like.

Constraints are the fun part here. I know this isn't the 8x Blackwell Lamborghini, that's the point. :)

This whole series of work is quite cool. The use of `word-break: break-word;` makes this really hard to read though.

I hope the translation for this is actually "Agree" Deep research. Just a dig at "You are absolutely right!" sycophancy.

In my experience using these supposed expert models, they are all more or less the same given they all are trained on the same internet data. The differentiation and value is in the context window management and how relevant info from your session is pulled in. So it’s the interface to the model that makes all the difference. Even there the differences are quite minimal. That is because all these companies want to toe the line between providing functionality to keep the users engaged and pushing them to sign up for the subscription.

All this to ask the question, if I host these open source models locally, how is the user interface layer that remembers and picks the right data from my previous session and the agentic automation and others implemented? Do I have to do it myself or are the free options for that?

I recently got a 5090 with 64 GB of RAM (intel cpu). Was just looking for a strong model I can host locally. If I had performance of GPT4-o, I'd be content. Are there any suggestions or cases where people got disappointed?

Isn't OpenIA "Deep research" (not "DeepResearch") a methodology/tooling thing, and you'll get different responses depending on what specific model you use with it? As far as the UI allows you to, you could use Deep research with GPT-5, GPT-4o, o3 and so on, and that'll have an impact on the responses. Skimming the paper and searching for some simple terms makes it seem like they never expand on what exact models they've used, just that they've used a specific feature from ChatGPT?

This is over a month old, they released the weights a long time ago.

It's a Qwen 3 MoE fine tune...

Has anyone tried running this on a 5090 or 6000 pro? What throughput do you see?

It still feels to me like OpenAI has zero moat. There are like 5 paid competitors + open source models.

I switch between gemini and ChatGpt whenever I feel one fails to fully grasp what I want, I do coding in claude.

How are they supposed to become the 1 trillion dollar company they want to be, with strong competition and open source disruptions every few months?

Isnt it huge deal, that this 30B model can compare and surpass huge closed models?

Is China dominating the US in terms of AI? Given that they currently have a model that beats the best models at all formal quantitative benchmarks?

What is the state of AI in China? My personal feeling is that it doesn't dominate the zeitgeist in China as it does in the US and despite this because of the massive amount of intellectual capital they have just a small portion of their software engineering talent working on this is enough to go head to head with us even though it only takes a fraction of their attention.

Unfortunately soon China will take lead in AI.

Isnt it huge deal, that this 30B model can compare and surpass huge closed models?

Given the size of frontier models I would assume that they can incorporate many specializations and the most lasting thing here is the training environment.

But there is probably already some tradeoff, as GPT 3.5 was awesome at chess and current models don't seem trained extensively on chess anymore.

-> GPT 3.5 was awesome at chess I don't agree with this. I did try to play chess with GPT3.5 and it was horrible. Full of hallucinations.

Isn't the whole point of the MOE architecture exactly this?

That you can individually train and improve smaller segments as necessary

> if we'll see an explosion of purpose trained LLMs...

Domain specific models have been on the roadmap for most companies for years now for both competitive (why give up your moat to OpenAI or Anthropic) and financial (why finance OpenAI's margins) perspective.

Has anyone found these deep research tools useful? In my experience, they generate really bland reports don't go much further than summarization of what a search engine would return.

My experience is the same as yours. It feels to me (similar to most LLM writing) like they write for someone who’s not going to read it or use it but is going to glance at it and judge the quality that way and assume it’s good.

Not to different from a lot of consulting reports, in fact, and pretty much of no value if if you’re actually trying to learn something.

The reports are definitely bland, but I find them very helpful for discovering sources. For example, if I'm trying to ask an academic question like "has X been done before," sending something to scour the internet and find me examples to dig into is really helpful - especially since LLMs have some base knowledge which can help with finding the right search terms. It's not doing all the thinking, but those kind of broad overviews are quite helpful, especially since they can just run in the background.

Constraints are the fun part here. I know this isn't the 8x Blackwell Lamborghini, that's the point. :)

If you just want to get something running locally as fast as possible to play with (the 2080ti typically had 11gb of VRAM which will be one of the main limiting factors), the ollama app will run most of these models locally with minimum user effort:

https://ollama.com/

If you really do have a 2080ti with 128gb of VRAM, we'd love to hear more about how you did it!

I use a Macbook Pro with 128GB RAM "unified memory" that's available to both CPU and GPU.

It's slower than a rented Nvidia GPU, but usable for all the models I've tried (even gpt-oss-120b), and works well in a coffee shop on battery and with no internet connection.

I use Ollama to run the models, so can't run the latest until they are ported to the Ollama library. But I don't have much time for tinkering anyway, so I don't mind the publishing delay.

I've recently put together a setup that seemed reasonable for my limited budget. Mind you, most of the components were second-hand, open box deals, or deep discount of the moment.

This comfortably fits FP8 quantized 30B models that seem to be "top of the line for hobbyists" grade across the board.

- Ryzen 9 9950X

- MSI MPG X670E Carbon

- 96GB RAM

- 2x RTX 3090 (24GB VRAM each)

- 1600W PSU

I just use my laptop. A modern MacBook Pro will run ~30B models very well. I normally stick to "Max" CPUs (initially for more performance cores, recently also for the GPU power) with 64GB of RAM. My next update will probably be to 128GB of RAM, because 64GB doesn't quite cut it if you want to run large Docker containers and LLMs.

Im sure this guy has some helpful hints on that: https://youtube.com/@azisk

llama.cpp gives you the most control to tune it for your machine.

llama.cpp + quantized: https://huggingface.co/bartowski/Alibaba-NLP_Tongyi-DeepRese...

get the biggest one that will fit in your vram.

I hope the translation for this is actually "Agree" Deep research. Just a dig at "You are absolutely right!" sycophancy.

TIL the "full" name of Alibaba Qwen is 通義千問(romanized as "Tongyi Qianwen", something along "knows all thousand questions"), of which the first half without the Chinese accent flags is romanized identically to "同意", meaning "same intents" or "agreed".

The Chinese version of the link says "通义 DeepResearch" in the title, so doesn't look like the "agree" to be the case. Completely agreed that it would be hilarious.

1: https://www.alibabacloud.com/en/solutions/generative-ai/qwen...

this is a great question. what are the main use cases that you have for this? i’ve been working on a library for something similar and exposing it via an mcp interface. would love to pick your brain on this (@viksit on twitter)

Has anyone tried running this on a 5090 or 6000 pro? What throughput do you see?

It's a Qwen 3 MoE fine tune...

Not to different from a lot of consulting reports, in fact, and pretty much of no value if if you’re actually trying to learn something.

https://ollama.com/

If you really do have a 2080ti with 128gb of VRAM, we'd love to hear more about how you did it!

Unfortunately soon China will take lead in AI.

I have been very impressed with the Qwen3 series. I'm still evaluating them, and I generally take LLM benchmarks with a huge grain of salt, but their MoE models in particular seem to offer a lot of bang for the compute. But what makes you so sure they will take the lead?

Isn't this an indication they are already in the lead? They currently have the best model that beats everyone on all quantitative metrics? Are you implying that the US has a better model somewhere?

Unfortunately? May I ask why? What country would you like to be the lead in AI?

It still feels to me like OpenAI has zero moat. There are like 5 paid competitors + open source models.

I switch between gemini and ChatGpt whenever I feel one fails to fully grasp what I want, I do coding in claude.

How are they supposed to become the 1 trillion dollar company they want to be, with strong competition and open source disruptions every few months?

Yea, I agree.

Arguably LLMs are both (1) far easier to switch between models than it is today to switch from AWS / GCP / Azure systems, and (2) will be rapidly decreasing switching costs for your legacy systems to port to new ones - ie Oracle's, etc. whole business model.

Meanwhile, the whole world is building more chip fabs, data centers, AI software/hardware architectures, etc.

Feels more like we're headed to commodification of the compute layer more than a few giant AI monopolies.

And if true, that's actually even more exciting for our industry and "letting 100 flowers bloom".

Isn’t the moat in the product/UI/UX? I use Claude daily and love the “scratch notebook” feel of it. The barebone model does not get you any of this.

I don’t know if they can pull it off but a lot of companies are built on strong enterprise sales being able to sell free stuff with a bow on it to someone who doesn’t know better or doesn’t care.

Premium grade deals with Oracle. They will bullshit their way into government and enterprise environments where all the key decision makers are clueless and/or easily manipulated.

Is China dominating the US in terms of AI? Given that they currently have a model that beats the best models at all formal quantitative benchmarks?

I think the lesson of the Chinese catchup in AI is that there is a massive disadvantage in being first, in this domain. You can do all the hard work and your competitors can distill that work out of your model for pennies on the dollar. Why should anyone want to do the work?

-> GPT 3.5 was awesome at chess I don't agree with this. I did try to play chess with GPT3.5 and it was horrible. Full of hallucinations.

It was GPT-3 I think.

As far as I remember, it's post-training that kills chess ability for some reason (GPT-3 wasn't post-trained).

Isn't the whole point of the MOE architecture exactly this?

That you can individually train and improve smaller segments as necessary

Generally you train each expert simultaneously. The benefit of MoEs is that you get cheap inference because you only use the active expert parameters, which constitute a small fraction of the total parameter count. For example Deepseek R1 (which is especially sparse) only uses 1/18th of the total parameters per-query.

I think it's the exact opposite - you don't specifically train each 'expert' to be a SME at something. Each of the experts is a generalist but becomes better at portions of tasks in a distributed way. There is no 'best baker', but things evolve toward 'best applier of flour', 'best kneader', etc. I think explicitly domain-trained experts are pretty uncommon in modern schemes.

I use a Macbook Pro with 128GB RAM "unified memory" that's available to both CPU and GPU.

It's slower than a rented Nvidia GPU, but usable for all the models I've tried (even gpt-oss-120b), and works well in a coffee shop on battery and with no internet connection.

I use Ollama to run the models, so can't run the latest until they are ported to the Ollama library. But I don't have much time for tinkering anyway, so I don't mind the publishing delay.

I’d strongly advise ditching Ollama for LM Studio, and using MLX versions of the models. They run quite a bit faster on Apple Silicon. Also, LM Studio is much more polished and feature rich than Ollama.

How's the battery holding up during vibe coding sessions or occasional LLM usage? I've been thinking about getting a MacBook or a laptop with a similar Ryzen chip specifically for that reason.

Isn't this an indication they are already in the lead? They currently have the best model that beats everyone on all quantitative metrics? Are you implying that the US has a better model somewhere?

Yea, I agree.

Meanwhile, the whole world is building more chip fabs, data centers, AI software/hardware architectures, etc.

Feels more like we're headed to commodification of the compute layer more than a few giant AI monopolies.

And if true, that's actually even more exciting for our industry and "letting 100 flowers bloom".

Premium grade deals with Oracle. They will bullshit their way into government and enterprise environments where all the key decision makers are clueless and/or easily manipulated.

How's the battery holding up during vibe coding sessions or occasional LLM usage? I've been thinking about getting a MacBook or a laptop with a similar Ryzen chip specifically for that reason.

llama.cpp gives you the most control to tune it for your machine.

Isn’t the moat in the product/UI/UX? I use Claude daily and love the “scratch notebook” feel of it. The barebone model does not get you any of this.

I agree that the scaffolding around the model contributes greatly to the experience. But it doesn't take billions of dollars in GPUs to do that part.

That's not entirely correct. Most of moe right now are fully balanced, but there is an idea of a domain expert moe where the training benefits fewer switches. https://arxiv.org/abs/2410.07490

llama.cpp + quantized: https://huggingface.co/bartowski/Alibaba-NLP_Tongyi-DeepRese...

get the biggest one that will fit in your vram.

The Chinese version of the link says "通义 DeepResearch" in the title, so doesn't look like the "agree" to be the case. Completely agreed that it would be hilarious.

1: https://www.alibabacloud.com/en/solutions/generative-ai/qwen...

This is the way. I managed to run (super) tiny models on CPU only with this approach.

For people who don't read Chinese: the two 'yi' characters numpad0 mentioned (义 and 義) are the same, but written in different variants of Chinese script (Simplified/Traditional).

That's not entirely correct. Most of moe right now are fully balanced, but there is an idea of a domain expert moe where the training benefits fewer switches. https://arxiv.org/abs/2410.07490

This is the way. I managed to run (super) tiny models on CPU only with this approach.

For people who don't read Chinese: the two 'yi' characters numpad0 mentioned (义 and 義) are the same, but written in different variants of Chinese script (Simplified/Traditional).

> if we'll see an explosion of purpose trained LLMs...

I made a 4B Qwen3 distill of this model (and a synthetic dataset created with it) a while back. Both can be found here: https://huggingface.co/flashresearch

Just tried this out with my web search mcp, extremely impressed with it. Never seen deep research this good from a model so small.

At this point "deep research" is more of a pattern - OpenAI and Perplexity and Google Gemini all offer products with that name which work essentially the same way, and Anthropic and Grok have similar products with a slightly different name attached.

The pattern is effectively long-running research tasks that drive a search tool. You give them a prompt, they churn away for 5-10 minutes running searches and they output a report (with "citations") at the end.

This Tongyi model has been fine-tuned to be really good at using its search tool in a loop to produce a report.

This whole series of work is quite cool. The use of `word-break: break-word;` makes this really hard to read though.

I actually can’t read it for some reason? My brain just can’t connect the words

GPT-OSS-20B at 4- or 8-bits is probably your best bet? Qwen3-30b-a3b probably the next best option. Maybe there exists some 1.7 or 2 bit version of GPT-OSS-120B

5090 has 32GB of RAM. Not sure if that’s enough to fit this model.

This is over a month old, they released the weights a long time ago.

That's OK — not all of us follow all the progress on a daily basis, and a model that is a month old doesn't become useless just by being a month old!

And for those not so tightly in the loop: how does it compare?

I've recently put together a setup that seemed reasonable for my limited budget. Mind you, most of the components were second-hand, open box deals, or deep discount of the moment.

This comfortably fits FP8 quantized 30B models that seem to be "top of the line for hobbyists" grade across the board.

- Ryzen 9 9950X

- MSI MPG X670E Carbon

- 96GB RAM

- 2x RTX 3090 (24GB VRAM each)

- 1600W PSU

It was GPT-3 I think.

As far as I remember, it's post-training that kills chess ability for some reason (GPT-3 wasn't post-trained).

Im sure this guy has some helpful hints on that: https://youtube.com/@azisk

That's OK — not all of us follow all the progress on a daily basis, and a model that is a month old doesn't become useless just by being a month old!

And for those not so tightly in the loop: how does it compare?

I caught myself that most of my LLM usage is like this:

ask a loaded, "filter question" I more or less know the answer for, and mostly skip the prose and get to the links to its sources.

Does it offer more performance than a Macbook Pro that could be had for a comparable sum? Your build can be had for under $3k; a used MBP M3 with 64 GB RAM can be had for approximately $3.5k.

Unfortunately the RTX 3090 has no native FP8 support.

That's basically what I imagined would be my rig if I were to pull the trigger. Do you have an NVLink adapter as well?

Unfortunately? May I ask why? What country would you like to be the lead in AI?

The USA of course. Isn't it obvious? What other country is more Free and great? None. Why does this even need to be asked?

China is full of people who want communism to dominate the world with totalitarian control so no one wants China to dominate anything at all because they are bad...

This sounds like copium . If it was just about distillation,we'd be seeing many awesome models from Europe ,Japan and even India.

Fully agree to this. LM Studio is much nicer to use and with MLX faster on Apple Silicon

GPT-OSS-20B at 4- or 8-bits is probably your best bet? Qwen3-30b-a3b probably the next best option. Maybe there exists some 1.7 or 2 bit version of GPT-OSS-120B

I agree that the scaffolding around the model contributes greatly to the experience. But it doesn't take billions of dollars in GPUs to do that part.

I caught myself that most of my LLM usage is like this:

ask a loaded, "filter question" I more or less know the answer for, and mostly skip the prose and get to the links to its sources.

Just tried this out with my web search mcp, extremely impressed with it. Never seen deep research this good from a model so small.

This sounds like copium . If it was just about distillation,we'd be seeing many awesome models from Europe ,Japan and even India.

This Tongyi model has been fine-tuned to be really good at using its search tool in a loop to produce a report.

Yes, but I think my previous point still matter, namely what exact model is being used greatly affects the results.

So without specifying which model is being used, it's really hard to know what is better than something else, because we don't understand what the underlying model is, and if it's better because of the model itself, or the tooling, which feels like an important distinction.

5090 has 32GB of RAM. Not sure if that’s enough to fit this model.

LlamaCPP supports offloading some experts in a MoE model to CPU. The results are very good and even weaker GPUs can run larger models at reasonable speeds.

n-cpu-moe in https://github.com/ggml-org/llama.cpp/blob/master/tools/serv...

It should fit enough of the layers to make it reasonably performant.

Does it offer more performance than a Macbook Pro that could be had for a comparable sum? Your build can be had for under $3k; a used MBP M3 with 64 GB RAM can be had for approximately $3.5k.

That's basically what I imagined would be my rig if I were to pull the trigger. Do you have an NVLink adapter as well?

I actually can’t read it for some reason? My brain just can’t connect the words

The USA of course. Isn't it obvious? What other country is more Free and great? None. Why does this even need to be asked?

China is full of people who want communism to dominate the world with totalitarian control so no one wants China to dominate anything at all because they are bad...

Fully agree to this. LM Studio is much nicer to use and with MLX faster on Apple Silicon

Unfortunately the RTX 3090 has no native FP8 support.

Yes, but I think my previous point still matter, namely what exact model is being used greatly affects the results.

LlamaCPP supports offloading some experts in a MoE model to CPU. The results are very good and even weaker GPUs can run larger models at reasonable speeds.

n-cpu-moe in https://github.com/ggml-org/llama.cpp/blob/master/tools/serv...

I'm not sure, I did not run any benchmarks. As a ballpark figure -- with both cards throttled down to 250W, running a Qwen-30B FP8 model (variant depending on task), I get upwards of 60 tok/sec. It feels on par with the premium models, tbh.

Of course this is in a single-user environment, with vLLM keeping the model warm.

No NVLink; it took me a long time to compose the exact hardware specs, because I wanted to optimize performance. Both cards are on x8 PCIe direct CPU channels, close to their max throughput anyway. It runs hot with the CPU engaged, but it runs fast.

so it appears the entire text has been Translated with non-breaking space unicode x00a0 instead of normal spaces x0020, so the web layout is considering all paragraph text as a super-long single word ('the\00a0quick\00a0\brown\00a0fox' instead of 'the quick brown fox') - the non-breaking space character appears identically to breaking-space when rendered but underlying coding breaks the concept of "break at end of word" because there is no end as 00a0 literally means "non-breaking"). per Copilot spending a half hour explaining this to me, apparently this can be fixed by opening web browser developer view, and copy/pasting this code into the console.

function replaceInTextNodes(node) { if (node.nodeType === Node.TEXT_NODE) { node.nodeValue = node.nodeValue .replace(/\u00A0/g, ' '); } else { node.childNodes.forEach(replaceInTextNodes); } }

replaceInTextNodes(document.body);

That’s why typography matters. You can’t read it because a very basic convention has been broken here and that throws everything off.

USA is threatening to invade Europe so not sure it can be considered great.

The USA is being led by a criminal pedo atm. There is military in the streets and SA-like, masked thugs are kidnapping people. Billionaires sit behind the wheels to profit from all those developments. Many of them are somehow related to AI. You can image what that will be/is used (see Palantir).

The whole country is going down the drain right now. There is nothing about it, sane people outside the Republican bubble would consider "freedom".

Of course this is in a single-user environment, with vLLM keeping the model warm.

function replaceInTextNodes(node) { if (node.nodeType === Node.TEXT_NODE) { node.nodeValue = node.nodeValue .replace(/\u00A0/g, ' '); } else { node.childNodes.forEach(replaceInTextNodes); } }

replaceInTextNodes(document.body);

It should fit enough of the layers to make it reasonably performant.

The whole country is going down the drain right now. There is nothing about it, sane people outside the Republican bubble would consider "freedom".

That’s why typography matters. You can’t read it because a very basic convention has been broken here and that throws everything off.

USA is threatening to invade Europe so not sure it can be considered great.

GITHUB HUGGINGFACE MODELSCOPE SHOWCASE

From Chatbot to Autonomous Agent

We are proud to present Tongyi DeepResearch, the first fully open‑source Web Agent to achieve performance on par with OpenAI’s DeepResearch across a comprehensive suite of benchmarks. Tongyi DeepResearch demonstrates state‑of‑the‑art results, scoring 32.9 on the academic reasoning task Humanity’s Last Exam (HLE), 43.4 on BrowseComp and 46.7 on BrowseComp‑ZH in extremely complex information‑seeking tasks, and achieving a score of 75 on the user‑centric xbench‑DeepSearch benchmark, systematically outperforming all existing proprietary and open‑source Deep Research agents.

Beyond the model, we share a complete and battle‑tested methodology for creating such advanced agents. Our contribution details a novel data synthesis solution applied across the entire training pipeline, from Agentic Continual Pre‑training (CPT) and Supervised Fine‑Tuning (SFT) for cold‑starting, to the final Reinforcement Learning (RL) stage. For RL, we provide a full‑stack solution, including algorithmic innovations, automated data curation, and robust infrastructure. For inference, the vanilla ReAct framework showcases the model’s powerful intrinsic capabilities without any prompt engineering, while the advanced Heavy Mode (test‑time‑scaling) demonstrates the upper limits of its complex reasoning and planning potential.

Continual Pre‑training and Post‑training Empowered by Fully Synthetic Data

Continual Pre‑training Data

We introduce Agentic CPT to deep research agent training, creating powerful agentic foundation models for post‑training. We propose AgentFounder, a systematic and scalable solution for large‑scale data synthesis that creates a data flywheel with data from the post‑training pipeline.

Data Reorganization and Question Construction. We continuously collect data from various sources, including documents, publicly available crawled data, knowledge graphs, and historical trajectories and tool invocation records (e.g., search results with links). As shown in the figure, these diverse data sources are restructured into an entity‑anchored open‑world knowledge memory. Based on randomly sampled entities and their corresponding knowledge, we generate multi‑style (question,answer) pairs.

Action Synthesis. Based on diverse problems and historical trajectories, we construct first‑order action synthesis data and higher‑order action synthesis data. Our method enables large‑scale and comprehensive exploration of the potential reasoning‑action space within offline environments, thereby thereby eliminating the need for additional commercial tool API calls. Specifically, for the higher‑order action synthesis, we remodel trajectories as multi‑step decision‑making processes to enhance the model’s decision‑making capabilities.

Post-training Data

High-quality synthetic QA pairs

We develop an end‑to‑end solution for synthetic data generation. This fully automated process requires no human intervention to construct super‑human quality datasets, designed to push the boundaries of AI agent performance. Through long‑term exploration and iteration‑from early methods like reverse‑engineering QA pairs from clickstreams (WebWalker) to the more systematic graph‑based synthesis (WebSailor and WebSailor‑V2), then the formalized task modeling (WebShaper)‑our approach ensures both exceptional data quality and massive scalability, breaking through the upper limits of model capabilities.

To address complex, high‑uncertainty questions, we synthesize web‑based QA data through a novel pipeline. The process begins by constructing a highly interconnected knowledge graph via random walks and isomorphic tables towards tabular data fusion from real‑world websites , ensuring a realistic information structure. We then sample subgraphs and subtables to generate initial questions and answers. The crucial step involves intentionally increasing difficulty by strategically obfuscating or blurring information within the question. This practical approach is grounded in a complete theoretical framework, where we formally model QA difficulty as a series of controllable “atomic operations” (e.g., merging entities with similar attributes) on entity relationships, allowing us to systematically increase complexity.

To further reduce inconsistencies between the organized information structure and the reasoning structure of QA, enable more controllable difficulty and structure scaling of reasoning, we proposed a formal modeling of the information‑seeking problem based on set theory. With this formalization, we developed agents that expands the problem in a controlled manner, and minimizes reasoning shortcuts and structural redundancy, leading to further improved QA quality. Moreover, this formal modeling also allows for efficient verification of QA correctness, effectively addressing the challenge of validating synthetic information‑seeking data for post‑training.

Furthermore, we have developed an automated data engine to scale up the creation of PhD‑level research questions. This engine begins with a multi‑disciplinary knowledge base, generating “seed” QA pairs that require multi‑source reasoning. Each seed then enters a self‑guided loop of “iterative complexity upgrades”, where a question‑crafting agent is equipped with a powerful toolset including web search, academic retrieval, and a Python execution environment. In each iteration, the agent expands knowledge boundaries, deepens conceptual abstraction, and even constructs computational tasks, creating a virtuous cycle where the output of one round becomes the more complex input for the next, ensuring a controllable and systematic escalation of task difficulty.

Unleashing Agent Capabilities with Diverse Reasoning Pattern

To bootstrap the model’s initial capabilities, we constructed a set of trajectories via rejection sampling, based on the ReAct and IterResearch frameworks (for details, see below). On one hand, ReAct, as a classic and foundational multi-turn reasoning format, instills rich reasoning behaviors and reinforces the model’s ability to adhere to structured formats.

On the other hand, we introduce IterResearch, an innovative agent paradigm (detailed below). It unleashes the model’s full reasoning potential by dynamically reconstructing a streamlined workspace in each turn, ensuring that every decision is deliberate and well-considered. Leveraging IterResearch, we constructed a set of trajectories that integrate reasoning, planning, and tool-use, thereby strengthening the model’s capacity for sustained planning when confronted with Long-Horizon tasks.

Rollout Mode

We have conducted extensive exploration into the rollout paradigms for DeepResearch‑type agents. As a result, our final model supports multiple rollout formats, including the native ReAct Mode and the context‑managing Heavy Mode.

Native ReAct Mode

Our model demonstrates excellent performance using the native ReAct reasoning paradigm without any prompt engineering. It strictly adheres to the Thought‑Action‑Observation cycle, performing multiple iterations to solve problems. With a model context length of 128K, it can handle a large number of interaction rounds, fully achieving scaling in its interaction with the environment. ReAct’s simplicity and universality provide the clearest benchmark for a model’s intrinsic capabilities and the efficacy of our training pipeline.

Our choice of ReAct is heavily informed by “The Bitter Lesson”, which posits that general methods leveraging scalable computation ultimately outperform approaches that rely on complex, human‑engineered knowledge and intricate designs.

Heavy Mode

In addition to the native ReAct mode, we have developed a “Heavy Mode” for complex, multi‑step research tasks. This mode is built on our new IterResearch paradigm, designed to push the agent’s capabilities to their limit.

The IterResearch paradigm was created to solve the “cognitive suffocation” and “noise pollution” that occurs when agents accumulate all information into a single, ever‑expanding context. Instead, IterResearch deconstructs a task into a series of “research rounds”.

In each round, the agent reconstructs a streamlined workspace using only the most essential outputs from the previous round. Within this focused workspace, the agent analyzes the problem, integrates key findings into a continuously evolving central report, and then decides its next action‑either gathering more information or providing a final answer. This iterative process of “synthesis and reconstruction” allows the agent to maintain a clear “cognitive focus” and high reasoning quality throughout long tasks.

Building on this, we propose the Research‑Synthesis framework. In this model, multiple Research Agents use the IterResearch process to explore a problem in parallel. A final Synthesis Agent then integrates their refined reports and conclusions to produce a more comprehensive final answer. This parallel structure enables the model to consider a wider range of research paths within a limited context window, pushing its performance to the limit.

End-to‑End Agent Training Pipeline

Training an agentic model like this required us to rethink the entire model training pipeline, from pre‑training to fine‑tuning to reinforcement learning. We established a new paradigm for agent model training that connects Agentic CPT → Agentic SFT → Agentic RL, creating a seamless end‑to‑end training loop for an AI agent. Here’s how we tackled the final stage with reinforcement learning, which was crucial for aligning the agent’s behavior with high‑level goals:

On‑Policy Agent Reinforcement Learning (RL)

Constructing a high‑quality agent through RL is a complex system engineering challenge; if this entire development process is viewed as a “reinforcement learning” loop, any instability or lack of robustness in its components can lead to erroneous “reward” signals. We will now share our practices in RL, covering both the algorithmic and infrastructure sides.

For RL algorithm, we made several algorithmic breakthroughs, using a customized on‑policy Group Relative Policy Optimization (GRPO). We employ a strictly on‑policy training regimen, ensuring that the learning signal is always relevant to the model’s current capabilities. The training objective is optimized using a token‑level policy gradient loss. Second, to further reduce variance in the advantage estimation, we adopt a leave‑one‑out strategy. Furthermore, we employ a conservative strategy for negative samples, having observed that an unfiltered set of negative trajectories significantly degrades training stability. This can manifest as a “format collapse” phenomenon after extended training. To mitigate this, we selectively exclude certain negative samples from the loss calculation, for instance, those that do not yield a final answer because they exceed a length limit. For the sake of efficiency, we do not employ dynamic sampling. We instead leverage larger batch and group sizes, which serve to maintain smaller variance and provide adequate supervision.

The training dynamics demonstrate effective learning, with a consistent upward trend in reward. Meanwhile, policy entropy remains consistently high, indicating sustained exploration and preventing premature convergence. We attribute this to the non‑stationary nature of the web environment, which naturally fosters a robust, adaptive policy and obviates the need for explicit entropy regularization.

We consider that the algorithm is important but not the only decisive factor in the success of Agentic RL. We have experimented with many different algorithms and tricks, and find that data and stability of the training environment are likely the more critical components in determining whether the RL works. Interestingly, we have tested to train the model directly on the BrowseComp testing set, but the results are substantially poorer than when using our synthetic data. We hypothesize that this disparity arises because the synthetic data offers a more consistent distribution, which allows the model to be more effectively tailored. Conversely, the human‑annotated data (such as BrowseComp) is inherently noisier. Given its limited scale, it is difficult to approximate a learnable underlying distribution, which consequently hinders the model to learn and generalize from it.

On the infrastructure side, training an agent with tools required us to develop a highly stable and efficient environment:

Synthetic Training Environment: Relying on live web APIs for development is expensive, slow, and inconsistent. We addressed this by creating a simulated training environment using an offline Wikipedia database and a custom tool suite. By adapting our data pipeline to generate high‑quality, complex tasks for this environment, we created a cost‑effective, fast, and controllable platform that dramatically accelerates our research and iteration.
Stable & Efficient Tool Sandbox: To ensure reliable tool use during agent training and evaluation, we developed a unified sandbox. The sandbox handles concurrency and failure gracefully by caching results, retrying failed calls, and using redundant providers as fallbacks (e.g., a backup search API). This provides the agent with a fast and deterministic experience, which is crucial for preventing tool errors from corrupting its learning trajectory.
Automatic Data Curation: Data is the core driver of model capability enhancement; its importance even surpasses that of the algorithm. The quality of the data directly determines the upper bound on the model’s ability to generalize to out‑of‑distribution scenarios through self‑exploration. To address this challenge, we optimize data in real time, guided by training dynamics. This optimization is achieved through a fully automated data synthesis and filtering pipeline that dynamically adjusts the training set. By closing the loop between data generation and model training, this approach not only ensures training stability but also delivers substantial performance gains.
On‑Policy Asynchronous Framework: We implemented a custom step‑level asynchronous RL training loop on top of rLLM. Multiple agent instances interact with the (simulated or real) environment in parallel, each producing trajectories.

Through these measures, we “closed the loop” on agent training. Starting from a raw model, we did Agentic pre‑training to initialize tool‑use skills, then supervised finetuning on expert‑like data to cold start, and finally on‑policy RL to let the model conduct self‑evolution. This full‑stack approach ‑ now proven with Tongyi DeepResearch ‑ presents a new paradigm for training AI agents that can robustly solve complex tasks in dynamic environments.

(Our RL approach is inspired by several past work from Agentica. We adapt their rLLM framework and extend it to train our web agents.)

Real‑World Applications and Impact

Tongyi DeepResearch is not just a research showcase; it’s already powering real applications within Alibaba and beyond, demonstrating its value in practical scenarios:

Gaode Mate (Map & Navigation Agent): Collaborating with Amap (Gaode) Team, we co‑developed “Xiao Gao,” an AI copilot that leverages the app’s rich toolset. It can execute complex travel planning commands, like creating a multi‑day driving tour that includes specific scenic spots and pet‑friendly hotels. Through multi‑step reasoning, Xiao Gao autonomously researches and integrates information to produce a detailed, personalized itinerary, offering an intelligent planning experience that far surpasses standard navigation.

Tongyi FaRui (Legal Research Agent): Empowered by our DeepResearch architecture, FaRui now functions as a true legal agent. It autonomously executes complex, multi‑step research tasks that mirror a junior attorney’s workflow‑systematically retrieving case law, cross‑referencing statutes, and synthesizing analysis. Crucially, all conclusions are grounded in verifiable judicial sources and delivered with precise case and statute citations, ensuring professional‑grade accuracy and credibility.

Limitations

Our future work will address three key limitations. First, the current 128k context length is still insufficient for the most complex long‑horizon tasks, requiring us to explore expanded context windows and more sophisticated information management. Second, our training pipeline’s scalability remains unproven on foundation models significantly larger than our 30B‑scale MoE, and we plan to validate our methods on larger‑scale models. Lastly, we aim to improve the efficiency of our reinforcement learning framework by investigating techniques like partial rollouts, which will necessitate solving the challenges of off‑policy training, such as distributional shift.

Series Work

Tongyi DeepResearch also has an extensive deep research agent family. You can find more information in the following papers:

[1] WebWalker: Benchmarking LLMs in Web Traversal

[2] WebDancer: Towards Autonomous Information Seeking Agency

[3] WebSailor: Navigating Super‑human Reasoning for Web Agent

[4] WebShaper: Agentically Data Synthesizing via Information‑Seeking Formalization

[5] WebWatcher: Breaking New Frontier of Vision‑Language Deep Research Agent

[6] WebResearch: Unleashing reasoning capability in Long‑Horizon Agents

[7] ReSum: Unlocking Long‑Horizon Search Intelligence via Context Summarization

[8] WebWeaver: Structuring Web‑Scale Evidence with Dynamic Outlines for Open‑Ended Deep Research

[9] WebSailor‑V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning

[10] Scaling Agents via Continual Pre‑training

[11] Towards General Agentic Intelligence via Environment Scaling

Our team has a long‑standing commitment to the research and development of deep research agents. Over the past six months, we have consistently published one technical report per month, totaling five to date. Today, we are excited to simultaneously release six new reports and share our Tongyi DeepResearch‑30B‑A3B model with the community.

Stay tuned for our next generation of agentic models.

@misc{tongyidr,
  author={Tongyi DeepResearch Team},
  title={Tongyi DeepResearch: A New Era of Open-Source AI Researchers},
  year={2025},
  howpublished={\url{https://github.com/Alibaba-NLP/DeepResearch}}
}

Hacker Times