Nvidia Cosmos 3

SOTA open source model for image and vid generation. Beats all others but is too big to run on most people’s computers at 64b params.

Still impressive nonetheless given its artificially generated training sets.

Beats nano banana 1 but not yet competitive with 2 or seedance2, grok imagine,etc.

  This release unifies those capabilities with a Mixture-of-Transformers (MoT) architecture built around two towers. 
  Reasoner tower: A vision-language model (VLM) ... This serves as the ‘brain’ that reasons about the world before any generation happens.
  Generator tower: Generates future observations and action sequences. This tower uses a diffusion-based process to generate physics-aware video and action outputs that are conditioned on the reasoner tower’s understanding.

This sort of approach (and others i've seen like it) always appeal to my inner engineer, trying to optimize and balance tradeoffs between model architectures and combine two things to yield the best of both worlds

But based on my understanding of the Bitter Lesson (http://www.incompleteideas.net/IncIdeas/BitterLesson.html), this is precisely the wrong approach in the long term. I'm linking the actual text of the bitter lesson because I think it's misunderstood (or I just don't agree with how i've seen it used in discourse). Specifically:

  The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

This architecture feels specifically like "trying to build knowlege into the agent that will help in the short term" but will plateau long term. That's not to say that there won't be some interesting learnings or things built on top of it, but I doubt that there's a lot of juice to squeeze with this kind of approach IMO.

> Cosmos 3 Nano is the compact version with 16B parameters and optimized for efficient inference. It’s designed to run on workstation-grade compute, like the NVIDIA RTX PRO 6000 GPU for real-time robotics inference and physical AI applications.

Looking forward to trying this out on my $10000+ workstation grade GPU that I need an equally expensive set up to run.

The warehouse safety video example is really funny, because the people don't react at all.

It is funny that after all their tech advancements, the site is struggling under heavy load.

I'm struggling to understand what this does.

> Generates future observations and action sequences.

Is that just a complicated way of saying video gen?

Most of the examples they've chosen seem.. not good? What an odd mix of bad game engine and AI slop. I can't imagine that this stuff makes good training data for real-world applications.

SOTA open source model for image and vid generation. Beats all others but is too big to run on most people’s computers at 64b params.

Still impressive nonetheless given its artificially generated training sets.

Beats nano banana 1 but not yet competitive with 2 or seedance2, grok imagine,etc.

It's sadly ironic I no longer even bother clicking on HN posts that are obvious product announcements from large corporations and instead just go to the replies. Corporate product announcements somehow fail to even clearly communicate the basic facts you did in your first nine words.

Great summary. I find image and video generation models are a more understandable reality check for how close local models are to frontier models.

It is funny that after all their tech advancements, the site is struggling under heavy load.

Most of the examples they've chosen seem.. not good? What an odd mix of bad game engine and AI slop. I can't imagine that this stuff makes good training data for real-world applications.

The warehouse safety video example is really funny, because the people don't react at all.

The car video is silly as well, the crossing van clearly runs a red light. The big shadow of the light pole in the intersection also makes no sense...

Looking forward to trying this out on my $10000+ workstation grade GPU that I need an equally expensive set up to run.

I have the GPU but no robot. What’s the minimum viable robot needed to play with this?

Good news, Nvidia will happily sell you one of their new RTX Spark laptops to run this.

  This release unifies those capabilities with a Mixture-of-Transformers (MoT) architecture built around two towers. 
  Reasoner tower: A vision-language model (VLM) ... This serves as the ‘brain’ that reasons about the world before any generation happens.
  Generator tower: Generates future observations and action sequences. This tower uses a diffusion-based process to generate physics-aware video and action outputs that are conditioned on the reasoner tower’s understanding.

  The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

This feels like the opposite to me? The MoT architecture looks like the ideal that the Bitter Lesson alludes to - just take all of your data in all of your formats (audio, image, text, action, video) and dump it all into a single shared latent space. Then let the model sort things out, with just enough structure to handle the different requirements/output formats needed (e.g. autoregressive stuff for sequence modeling/prediction, diffusion stuff for generation).

This is mostly a decompression, it’s fairly standard nowadays. The point is to get the data from the internal compressed version into the human usable version.

We can technically reason at pixel or char level encodings but it’s going to be much more expensive generally. Think of the overall technique as a way to get computer go faster.

You see it with Qwen talker, most multimodal projectors, etc

Except this model has a broader domain than text-LLM models. More than the old omni models too since it takes video input. The architecture is exotic but I don't see tuning here that is more extreme than open models released every day.

I'm struggling to understand what this does.

> Generates future observations and action sequences.

Is that just a complicated way of saying video gen?

As I understand it, they mean both computer vision and video gen, linked by a pretty robust world model. One of their hosted examples is purely analysing an existing video, the other is predicting (i.e. video gen) from a static image to a video

Look at the table of supported modalities. It can take in input of image/video/text/actions and output image/video/text/actions.

It can be used to generate synthetic data to train physical AI for robots, cars, drones, etc. The world can be simulated from first person perspective to generate training data without sending robots to peoples homes.

You can fine-tune it so, given an image and a task description, it generates a corresponding set of actions.

Great summary. I find image and video generation models are a more understandable reality check for how close local models are to frontier models.

Good news, Nvidia will happily sell you one of their new RTX Spark laptops to run this.

I have the GPU but no robot. What’s the minimum viable robot needed to play with this?

The car video is silly as well, the crossing van clearly runs a red light. The big shadow of the light pole in the intersection also makes no sense...

Cars run red lights in real life. Driving defensively requires anticipating it. Anyone expecting them not to is more likely to get in a crash.

The rest I can't speak to.

Look at the table of supported modalities. It can take in input of image/video/text/actions and output image/video/text/actions.

That just raises more questions. What kind "observation or action" image does input generate? What is an action output if it's not text?

You can fine-tune it so, given an image and a task description, it generates a corresponding set of actions.

This is mostly a decompression, it’s fairly standard nowadays. The point is to get the data from the internal compressed version into the human usable version.

We can technically reason at pixel or char level encodings but it’s going to be much more expensive generally. Think of the overall technique as a way to get computer go faster.

You see it with Qwen talker, most multimodal projectors, etc

Cars run red lights in real life. Driving defensively requires anticipating it. Anyone expecting them not to is more likely to get in a crash.

The rest I can't speak to.

That just raises more questions. What kind "observation or action" image does input generate? What is an action output if it's not text?

Physical AI systems must understand the real world before they can act within it. Robots, autonomous vehicles, and smart spaces need to understand what’s happening in their world, predict what’s likely to happen next, and generate actions for specific environments, embodiments, and tasks.

NVIDIA Cosmos 3 is a frontier foundation model for physical AI that combines physical reasoning, world generation, and action generation within a single open model.

NVIDIA is open sourcing Cosmos 3 models, training scripts, deployment tools, and datasets to make physical AI development more open and reproducible. This blog post covers the fundamentals of Cosmos 3, highlights key concepts from the technical report, guides through technical workflows, and shows how teams robotic manipulation systems, autonomous vehicles, and warehouse monitoring solutions can get started.

A video clip generated by Cosmos 3 for the autonomous driving domain. The video is from a vehicle’s point-of-view at an intersection. Another car crosses the intersection in front of this vehicle, and then the vehicle takes a left turn. The video looks realistic and shows houses, trees, and cars in the surroundings.

Figure 1. A clip of a video generated by Cosmos 3 for the autonomous driving domain

A video shows a corridor with shelves of boxes on either side and a pile of boxes on the ground. Three people are standing next to the pile of boxes. There’s a small explosion from one of the boxes on the floor, and it starts smoking.

Figure 2. A video generated using Cosmos 3 for warehouse safety data.

Key highlights of this release include:

NVIDIA Cosmos 3 Nano and NVIDIA Cosmos 3 Super model checkpoints on Hugging Face with code on GitHub.
Open datasets for physical AI applications like robotics and autonomous driving.
Open post-training scripts for adapting Cosmos 3 to your domain.
Cosmos NIM microservices for easy, optimized deployment on NVIDIA GPUs.

What’s new in Cosmos 3

Previous Cosmos releases separated world generation, physical understanding, and controlled scene generation into different models and workflows. This release unifies those capabilities with a Mixture-of-Transformers (MoT) architecture built around two towers.

Reasoner tower: A vision-language model (VLM) that interprets multimodal observations like images, videos, and text. This tower uses an autoregressive architecture to interpret the input and understand motion, object interactions, and other physical context. This serves as the ‘brain’ that reasons about the world before any generation happens.
Generator tower: Generates future observations and action sequences. This tower uses a diffusion-based process to generate physics-aware video and action outputs that are conditioned on the reasoner tower’s understanding. The reasoner can be called independently, but the generator always activates both towers for guided generation.

Cosmos 3 architecture diagram: an autoregressive reasoner tower that takes in text, image, video, audio, and action inputs is connected to a diffusion-based generator tower that outputs text, image, video, audio, and action. Information from the reasoner tower feeds unidirectionally into the generator tower, which enables coherent generation.

Figure 3. Cosmos 3 architecture

This architecture enables a single model to do reasoning and generation tasks, simplifying development by eliminating orchestration between multiple models and inference pipelines.

Choose the right model size

Two Cosmos 3 models are currently available:

Cosmos 3 Nano is the compact version with 16B parameters and optimized for efficient inference. It’s designed to run on workstation-grade compute, like the NVIDIA RTX PRO 6000 GPU for real-time robotics inference and physical AI applications.
Cosmos 3 Super is a 64B parameter model designed for maximum quality and capability. It delivers the highest benchmark scores and targets datacenter deployment on NVIDIA Hopper and NVIDIA Blackwell GPUs, making it suitable for large-scale synthetic data generation and advanced physical reasoning workloads.

Supported modalities

Cosmos 3 supports the following input and output modalities through its unified architecture:

Action-conditioned world model	Output	Application
Text	Image	Physically-plausible Image generation
Text \| Video	Video	World model for rare edge case video data generation
Text \| Image	Video	World model for prediction
Text \| Image \| Video	Text	VLM for reasoning
Action \| Video \| Text	Video	Action-conditioned world model
Video \| Text	Video \| Action	World action model, video action model, vision language action model, policy model for robot learning

Table 1. Input and output modalities supported by Cosmos 3 for different applications

Open datasets for physical AI

With the Cosmos 3 release, NVIDIA is open-sourcing six synthetic data generation (SDG) datasets on Hugging Face. These cover robotics, physics simulation, spatial reasoning, human motion, driving, and warehouse environments, and can be used for post-training Cosmos 3 and other models:

Physical AI World Model Synthetic Datasets include:

A collection of videos in the Embodied Robot Scenes dataset. The videos show different humanoid robots doing manipulation tasks in different environments.

Figure 4. Manipulation examples from the Embodied Robot Scenes dataset

A collection of videos in the Physical Interaction Scenes dataset. The videos show simulated scenes like a wrecking ball hitting objects, a toy tower collapsing, and dominoes falling. For each scene, the dataset has corresponding ground-truth physics annotations like per-object velocity, center-of-mass displacement, and per-frame semantic segmentation.

Figure 5. Examples from the Physical Interaction Scenes dataset

A collection of images showing the Spatial Reasoning dataset, including scenes like kitchens, corridors, offices, and utility rooms. It also includes question-answer pairs like, “How far is the coffee table from the sofa?” and “What is the best route for the robot to reach the study room?”

Figure 6. Examples from the Spatial Reasoning dataset

A collection of videos in the Digital Human Scenes dataset. The videos show some simulated indoor and outdoor environments with digital people standing and moving. These videos provide diverse human appearance, motion, scene context, lighting, and camera motion.

Figure 7. Examples from the Digital Human Scenes dataset

A collection of videos from the Autonomous Driving Scenarios dataset. The videos are from the ego point of view of an autonomous vehicle and show the vehicle driving on roads in different scenarios. The videos show diverse weather and lighting conditions and driving behaviors like lane changing and pedestrian interactions.

Figure 8. Examples from the Autonomous Driving Scenarios dataset

A collection of videos from the Warehouse Operations Scenes dataset. The videos show simulated warehouse scenes from different camera angles. Some videos show a forklift moving and colliding with people or objects. In another video, a person drops a cardboard box on the floor.

Figure 9. Examples from the Warehouse Operations Scenes dataset

NVIDIA Cosmos Human Evaluation benchmark

The NVIDIA Cosmos Human Evaluation (HUE) framework assesses Cosmos 3 generator quality across representative domain tasks.

As SOTA video generation models saturate existing automated leaderboards, score differences between releases are often too narrow for meaningful comparison. HUE shifts evaluation from subjective grading to objective fact verification, enabling fine-grained comparison between top-tier models. The result is a more reliable quality signal for both rapid iteration and rigorous release decisions backed by full human evaluation.

HUE evaluates video generation quality using atomic binary verification. Each generated video is decomposed into single-fact yes/no questions across four dimensions—semantic alignment, physical laws, geometric reasoning, and visual integrity—spanning seven Physical AI domains, including robotics, autonomous vehicles, and physics. These questions are generated by a VLM pipeline, refined by human experts, and released as open source on Hugging Face.

Benchmark results

Cosmos 3 has been evaluated across multiple benchmark suites covering physical AI reasoning, generation quality, and domain-specific performance.

Reasoning benchmarks

Cosmos 3 Super and Cosmos 3 Nano lead on VANTAGE-Bench at the 32B tier and the 8B tier, respectively:

VANTAGE-Bench: First public benchmark for evaluating vision-language models on real-world fixed-camera footage across warehouses, transportation, and smart spaces.
Traffic Anomaly Reasoning (TAR): A new leaderboard for detecting and reasoning anomalous events in transportation footage and the official leaderboard for AI City Challenge 2026 Track 3.

Generator benchmarks

Cosmos 3 is the open-source SOTA and currently leads on PAI-Bench, R-Bench Physics-IQ, and RoboLab across public leaderboards:

Artificial Analysis: A benchmarking platform that ranks AI models for text, image, and video generation. Cosmos 3 is the leading open source model on the Text to Image leaderboard and Image to Video (no audio) leaderboard.
R-Bench: A benchmark for evaluating video-based world models in robotic video generation. It assesses task completion and visual quality through sub-metrics like structural consistency, physical plausibility, and execution completeness.
PAI-Bench: A unified benchmark evaluating physical AI across video understanding and video generation, spanning domains like robotics, autonomous vehicles, and physics common sense.
Physics-IQ: A benchmark of real-world videos that tests whether generative video models truly understand physical principles, rather than just achieving visual realism.
RoboLab: A simulation benchmark for evaluating task-generalist robot policies.

Training recipes

A central component of the Cosmos 3 release is a fully open set of training recipes. Beyond model checkpoints, this release provides code, configs, and workflows for adapting Cosmos 3 to new domains, embodiments, and datasets.

Supervised Fine-Tuning post-training

Supervised Fine-Tuning (SFT) enables developers to adapt a Cosmos 3 model to their own data. The released recipes include vision generation post-training for custom video datasets, as well as action-oriented recipes for robotics and physical AI workflows. Developers can customize Cosmos 3 for their target domains across robotics, autonomous driving, and warehouse automation.

The post-training code and configs are available on GitHub.

Action post-training

Action post-training adapts Cosmos 3 for action-aware Physical AI applications, including forward dynamics, inverse dynamics, and policy generation. Developers can post-train Cosmos 3 on action-labeled data. For robotics applications, this includes several important workflows: generating future observations conditioned on robot actions, inferring the actions behind observed demonstrations, and predicting action sequences from current observations and task prompts. This makes Cosmos 3 a strong foundation for world action modeling and policy learning.

Video 1. Tutorial video showing how to post-train Cosmos 3

Deploy with NVIDIA NIM Microservices

Cosmos 3 models are also available as NVIDIA NIM microservices for optimized, production-ready deployment. NIM microservices package the model with optimized inference runtimes, delivering high performance without the need to manually tune serving infrastructure. NIM microservices are easier to use for inference workflows compared to the Cosmos 3 repo on GitHub, which is preferred for post-training workflows.

The Cosmos 3 Reasoner NIM is available today, delivering the reasoning capabilities of the Cosmos 3 model. Keep posted for the Cosmos 3 Generator NIM, which provides full generation capabilities of the Cosmos 3 model.

Optimizations made to accelerate inference

Quantization: Cosmos 3 NIM supports selecting BF16, FP8, or NVFP4 quantized checkpoints. The NVFP4 quantization reduces the model’s numerical precision from BF16 to 4-bit floating point, achieving up to 2x inference speedup.
vLLM: Is an open source inference engine that uses techniques like continuous batching, paged attention, and tensor parallelism to serve LLMs efficiently. The Cosmos 3 Reasoner NIM serving stack is built on vLLM for higher throughput compared to conventional serving approaches. Cosmos 3 Nano is ready to run with vLLM-omni and NVIDIA Dynamo for top performance.
Efficient Video Sampling (EVS): This technique reduces the number of video tokens fed into the VLM during inference, speeding up the Cosmos Reason NIM. EVS works at the chunk level, keeping the most unique chunks of each frame and pruning the rest. Smaller GPUs tend to benefit more from this technique.

How to run the NIM

An NVIDIA NGC API key is required to pull the containers and download the Cosmos 3 models from NGC.

To pull and run the Cosmos 3 Nano Reasoner NIM. For the Cosmos 3 Super Reasoner NIM, specify NIM_MODEL_SIZE=super.

docker run --gpus=all \ -e NGC_API_KEY=$NGC_API_KEY \ -e NIM_MODEL_SIZE=nano \ -p 8000:8000 \ nvcr.io/nim/nvidia/cosmos3-reasoner:latest

Find details on API usage and more in the documentation.

Video 2. Tutorial video showing how to use the Cosmos Reasoner NIM

Get started

Download the Cosmos 3 Nano and Super checkpoints on Hugging Face.
Find examples and code on the Cosmos 3 GitHub.
Try the Cosmos 3 Nano Reasoner model experience and the Cosmos 3 Nano model experience.
Join the community, open issues, and contribute to the Cosmos ecosystem on GitHub and Discord.

Acknowledgments

_Cosmos 3 is the result of amazing collaboration between many teams and people across NVIDIA, including Adeline Aubame, Aditya Mahajan, Aigul Dzhumamuratova, Akash Gokul, Akul Santhosh, Aleksandr Efitorov, Alex Sotelo, Alexander Schwarz, Alperen Degirmenci, Amol Fasale, Andrew Tham, Ankur Handa, Arihant Jain, Arslan Ali, Artur Zolkowski, Aryaman Gupta, Asawaree Bhide, Ashkan Mirzaei, Ashley Chow, Ashna Khetan, Atharva Joshi, Barnaby Simkin, Benedikt Falk, Brett Hamilton, Carlos Casanova, Chaeyeon Chung, Charles Zhou, Chen-Hsan Lin, Chen-Hsuan Lin, Chhavi Nijhawan, Chieh-Yun Chen, Chintan Shah, Chris Helvig, Chris Pruett, Cindy Zha, Cyrus Hogg, Dahjung Chung, Dan Blick, David Wehr, Dawid Majchrowski, DeLesley Hutchins, Delin Qu, Dennis Lynch, Diego Garzon, Dima Zhylko, Durra Mohsin, Egor Krivov, Ekram Mukbil, Eric Cameracci, Fangyin Wei, Fengzhe Zhou, Francesco Ferroni, Freya Li, George Kurian, Gwanghyun Kim, Haaland Hao Liang, Hai Loc Lu, Hans Yang, Hao Liang, Hao Wang, Hesam Rabeti, Hugo Hadfield, Hyejin Moon, Itai Zadok, Jayjun Lee, Jeana Choi, JF Lafleche, Jiangran Lyu, Jiaojiao Fan, Jiaxiang Tang, Jibin Varghese, Jim Fan, Jingyi Jin, Jinwei Gu, Jon Allen, Joshua Bapst, Joyjit Daw, Julia Kiczka, Julian Ouyang, Kaichun Mo, Kayley Ting, Ke Ding, Kedi Wu, Kevin Brady, Kirill Motkov, Kristen Rumley, Krzysztof Tomala, Liang Feng, Liangkai Zhang, Ling Li, Louis Marcoux, Maciej Bala, Madison Huang, Magdalena Dadela, Mahesh Patekar, Marco Di Lucca, Marilyn Reeb, Mark Carlson, Martin Antolini, Mateusz Sieniawski, Matt Cragun, Meredith Price, Michael Huang, Miguel Guerrero, Miguel Martin, Min Shi, Ming-Yu Liu, Mohammad Harrim, Morteza Ramezanali, Mukesh Beladiya, Nalin Dadhich, Naomi Eigbe, Nathan Hayes-Roth, Nicole Drumheller, Nikhilesh Joshi, Omar Laymoun, Paris Zhang, Paula Ramos, Pawel Morkisz, Peter Gambrill, Pooya Jannaty, Pooya Khaloo, Pranjali Joshi, Qi Wang, Qianli Ma, Qiao Wang, Qing Miao, Qizhi Chen, Rahul Heinrich Steiger, Raju Wagwani, Robert Denomme, Rodrigo Vieira Del Monte, Roy Anthony, Ruqing Xu, Ryan Bernard, Ryan Ji, Saeid Motiian, Sandip Bhaskar, Sandra Skaff, Santanu Dutta, Saurav Kumar, Sehwi Park, Sergiy Fefilatyev, Shangkun Sun, Shangru Li, Shilin Zhu, Shreyas Misra, Shun Zhang, Shuran Song, Simon Yuen, Simon Zhang, Slawek Kierat, Smita Ithape, Soha Pouya, Sophia Huang, Stefanie Manzinger, Steven Baughman, Suneel Indupuru, Sunil Srinivasa, Sunny Kim, Tavish Chen, Thabang Ngazimbi, Thomas Volk, Tianwei She, Tiffany Cai, Ting-Chun Wang, TJ Galda, Tolou Tavakkoli, Tomasz Kornuta, Trung Pham, Tsung-Yi Lin, Vanni Brighella, Varun Praveen, Wei-Cheng Tseng, Wenjie Luo, Wesley Li, Wojciech Kutak, Wojciech Rymer, Xiangyu Lu, Xiaodong Yang, Xiaotong Chen, Xin Kong, Xinquan Xu, Xiu Chia, Xuning Yang, Yan Chang, Yan Wang, Yanan Jian, Yao Xu, Yashraj Narang, Yeongho Seol, Yichu Yang, Yifan Ding, Yihuai Gao, Yilin Zhao, Yin Cui, Yogesh Balaji, Yu Wang, Yu-Wei Chao, Yue Tang, Yufan Huang, Yuke Zhu, Yuliya Zhautouskaya, Yurong You, Yuzhu Dong, Zaid Pervaiz Bhat, Zekun Hao, Zhaoshuo Li, Zhizheng Zhang.
_

Hacker Times