NOBODY in "the West" has anything to contribute?
Get out of here with this.
💫 Project Page | Models & Bench 🤗 🤖 | 🚀 Demo | 📚 Cookbooks
We present RynnBrain, an embodied foundation model grounded in physical reality. RynnBrain is available in two dense variants (2B and 8B) and one mixture-of-experts (MoE) model (30B-A3B). In addition, we release three post‑trained models: RynnBrain‑Plan (robot task planning), RynnBrain‑Nav (vision-language navigation), and RynnBrain‑CoP (chain-of-point reasoning).
RynnBrain employs a unified encoder-decoder architecture (supporting both Dense and MoE variants) to transform omni-vision inputs and textual instructions into multi-modal outputs, including spatial trajectories, physical pointing, and action planning. Through massive training on rich spatio-temporal, physical-space, and general knowledge data, RynnBrain maintains robust general-purpose capabilities while specializing in diverse, fine-grained embodied reasoning and complex planning tasks.
| Model | Base Model | HuggingFace | ModelScope |
|---|---|---|---|
| RynnBrain-2B | Qwen3-VL-2B-Instruct | Link | Link |
| RynnBrain-8B | Qwen3-VL-8B-Instruct | Link | Link |
| RynnBrain-30B-A3B | Qwen3-VL-30B-A3B-Instruct | Link | Link |
| RynnBrain‑CoP-8B | RynnBrain-8B | Link | Link |
| RynnBrain‑Plan-8B | RynnBrain-8B | Link | Link |
| RynnBrain‑Plan-30B-A3B | RynnBrain-30B-A3B | Link | Link |
| RynnBrain‑Nav-8B | RynnBrain-8B | Link | Link |
Minimal dependencies:
pip install transformers==4.57.1
Run text generation:
from transformers import AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained("")
...
Checkout the cookbooks that showcase RynnBrain's capabilities in cognition, localization, reasoning, and planning.
| Category | Cookbook name | Description |
|---|---|---|
| Cognition | 1_spatial_understanding.ipynb | Shows the model's ability for spatial understanding in the video scene. |
| Cognition | 2_object_understanding.ipynb | Shows how the model understands object categories, attributes, and relations and counting ability. |
| Cognition | 3_ocr.ipynb | Examples of optical character recognition and text understanding in videos. |
| Location | 4_object_location.ipynb | Locates specific objects with bounding boxes in an image or video based on instructions. |
| Location | 5_area_location.ipynb | Identifies and marks specified regions by points in an image or video. |
| Location | 6_affordance_location.ipynb | Finds areas or objects with specific affordances in an image or video. |
| Location | 7_trajectory_location.ipynb | Infers and annotates trajectories or motion paths in an image or video. |
| Location | 8_grasp_pose.ipynb | Presents the model's ability to predict robotic grasp poses from images. |
| Reasoning | 9_thinking_with_time_space.ipynb | Explores an interleaved reasoning mechanism that alternates between textual reasoning and spatial grounding. |
| Planning | 10_manipulate_planning.ipynb | Performs multi-step task decomposition and action planning from goals and scenes. |
| Planning | 11_visual_language_navigation.ipynb | Combines vision and language instructions to perform navigation and path planning. |
Pretraining & Evaluation
Please refer to RynnScale for details of pretraining and evaluation.
Finetuning
Reasoning: RynnBrain introduces an interleaved reasoning approach that combines grounding with textual information directly within egocentric video streams. This paradigm effectively bridges the cognitive gap between language and the physical world, ensuring the reasoning process is robustly anchored in reality.
Navigation: We trained a vision-language navigation model based on the RynnBrain base model. Empirical evaluation demonstrates that fine-tuning the vision-language model on RynnBrain yields superior performance compared to fine-tuning on other foundational models.
Planning: RynnBrain integrates the location information of affordance, areas, and objects directly into its planning outputs. Consequently, even highly intricate and fine-grained tasks can be effectively addressed within our hierarchical RynnBrain-VLA system architecture.
We introduce RynnBrain-Bench, a high-dimensional benchmark for embodied understanding that evaluates models across four key dimensions: object cognition, spatial cognition, grounding, and pointing—highlighting fine-grained understanding and spatiotemporal localization across episodic video sequences.
For details, please refer to RynnBrain-Bench.
RynnEC: Bringing MLLMs into Embodied World
Ronghao Dang*, Yuqian Yuan*, Yunxuan Mao*, Kehan Li*, Jiangpin Liu, Zhikai Wang, Fan Wang, Deli Zhao, Xin Li
![]()
![]()
![]()
RynnScale
RynnScale Team
![]()
![]()
RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation
Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Kehan Li, Jiayan Guo, Kexiang Wang, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li
![]()
![]()
![]()
RynnVLA-002: A Unified Vision-Language-Action and World Model
Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, Fan Wang, Deli Zhao, Hao Chen
![]()
![]()
![]()
RynnRCP: Open Robotics Context Protocol and RobotMotion
RynnBot Team
![]()
![]()
RynnMotion: All-In-One Toolkit for Fast Robot Prototyping and Heterogeneous Teleoperation
RynnBot Team
![]()
![]()
Our RynnBrain is built on top of Qwen3-VL. We also learned a lot from the implementation of RynnEC and VideoRefer. If your work is used in RynnBrain but not mentioned in either this repo or the technical report, feel free to let us know :heart:.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.