DreamWorld combines EEG brain signals with world models to generate immersive video worlds from neural activity.
This project bridges two exciting research areas:
- DreamDiffusion: Encoding EEG signals and aligning them with CLIP features to generate images from brain activity
- World Models: Recent advances in open-source world generation (e.g., Nvidia's Cosmos-Predict2.5)
- EEG to Image: Use DreamDiffusion to generate an image from EEG data
- Image to Caption: Apply BLIP-2 to create a text description of the generated image
- World Generation: Feed the image and caption into a world model to generate dynamic video worlds
This is a successful example where the eeg encoder captures the correct semantic information and the world model generates a sensible video.
| Ground Truth | EEG -> Image | Image + prompt -> video |
|---|---|---|
![]() |
![]() |
test3-3.mp4 |
BLIP 2 caption + guidance prompt
a computer monitor with a picture of a city on it. Dynamic camera movement sweeping through the scene with fluid motion. Start with a smooth dolly forward, then transition into an energetic orbital pan that circles completely around the subject. The camera glides and flows continuously, capturing multiple angles. Bright, warm illumination bathes everything in golden light as the camera maintains constant, purposeful movement throughout the entire shot.This is an example where the eeg encoder captures the semantics of the eeg signal correctly but the world model creates a nightmare instead of a dream.
| Ground Truth | EEG -> Image | Image + prompt -> video |
|---|---|---|
![]() |
![]() |
test8-5.6.mp4 |
BLIP 2 caption + guidance prompt
a group of chairs and tables outside a store. Dynamic camera movement sweeping through the scene with fluid motion. Start with a smooth dolly forward, then transition into an energetic orbital pan that circles completely around the subject. The camera glides and flows continuously, capturing multiple angles. Bright, warm illumination bathes everything in golden light as the camera maintains constant, purposeful movement throughout the entire shot.This is an example where the eeg encoder fails and only captures the "silver color". I think it might be because the object is almost unrecognizable. The world model does a fantastic job though.
| Ground Truth | EEG -> Image | Image + prompt -> video |
|---|---|---|
![]() |
![]() |
test7-3.4.mp4 |
BLIP 2 caption + guidance prompt
a pair of silver shoes sitting on a marble floor. Dynamic camera movement sweeping through the scene with fluid motion. Start with a smooth dolly forward, then transition into an energetic orbital pan that circles completely around the subject. The camera glides and flows continuously, capturing multiple angles. Bright, warm illumination bathes everything in golden light as the camera maintains constant, purposeful movement throughout the entire shot.A promising next step is to bypass the intermediate image generation step and directly use the EEG encoder embeddings to condition world generation, creating an end-to-end EEG-to-world pipeline.
In a distant future where we are able to pick up eeg signals more easily, this demo should show us where we might be headed. Creative professionals such as film makers, architects or designers might no longer have the need to compress their imagination into low dimensional data such as text or speech.
- NVIDIA GPUs with Ampere architecture (RTX 30 Series, A100) or newer
- NVIDIA driver >=570.124.06 compatible with CUDA 12.8.1
- Linux x86-64
- glibc>=2.31 (e.g Ubuntu >=22.04)
- Python 3.10
Clone the repository:
git clone git@github.com:nvidia-cosmos/cosmos-predict2.5.git
cd cosmos-predict2.5Install system dependencies:
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/envInstall the package into a new environment:
uv sync
source .venv/bin/activateOr, install the package into the active environment (e.g. conda):
uv sync --active --inexact- Get a Hugging Face Access Token with
Readpermission - Install Hugging Face CLI:
uv tool install -U "huggingface_hub[cli]" - Login:
hf auth login - Accept the NVIDIA Open Model License Agreement.
Checkpoints for cosmos-predict2.5 are automatically downloaded during inference and post-training. To modify the checkpoint cache location, set the HF_HOME environment variable.
The datasets folder and pretrains folder for DreamDiffusion are not included in this repository. Please download eeg data from eeg and put it in the root directory of this repository as shown below. We also provide a copy of the Imagenet subset which may be used for eval imagenet.
The finetuned DreamDiffusion checkpoint: ckpt
File path | Description
DreamDiffusion/pretrains
β£ π models
β β π config15.yaml
β β π checkpoint.pth (pre-trained EEG encoder)
DreamDiffusion/datasets
β£ π imageNet_images (subset of Imagenet)
β π block_splits_by_image_all.pth
β π block_splits_by_image_single.pth
β π eeg_5_95_std.pth
DreamDiffusion/code
β£ π sc_mbm
β β π mae_for_eeg.py
β β π trainer.py
β β π utils.py
β£ π dc_ldm
β β π ldm_for_eeg.py
β β π utils.py
β β£ π models
β β β (adopted from LDM)
β β£ π modules
β β β (adopted from LDM)
β π stageA1_eeg_pretrain.py (main script for EEG pre-training)
β π eeg_ldm.py (main script for fine-tuning stable diffusion)
β π gen_eval_eeg.py (main script for generating images)
β π dataset.py (functions for loading datasets)
β π eval_metrics.py (functions for evaluation metrics)
β π config.py (configurations for the main scripts)
This project builds upon several open-source works:
- DreamDiffusion for EEG-to-image generation
- NVIDIA Cosmos for world model generation
- BLIP-2 for image captioning
@article{bai2023dreamdiffusion,
title={DreamDiffusion: Generating High-Quality Images from Brain EEG Signals},
author={Bai, Yunpeng and Wang, Xintao and Cao, Yanpei and Ge, Yixiao and Yuan, Chun and Shan, Ying},
journal={arXiv preprint arXiv:2306.16934},
year={2023}
}@misc{nvidia2025cosmosworldfoundationmodel,
title={Cosmos World Foundation Model Platform for Physical AI},
author={NVIDIA and : and Niket Agarwal and Arslan Ali and Maciej Bala and Yogesh Balaji and Erik Barker and Tiffany Cai and Prithvijit Chattopadhyay and Yongxin Chen and Yin Cui and Yifan Ding and Daniel Dworakowski and Jiaojiao Fan and Michele Fenzi and Francesco Ferroni and Sanja Fidler and Dieter Fox and Songwei Ge and Yunhao Ge and Jinwei Gu and Siddharth Gururani and Ethan He and Jiahui Huang and Jacob Huffman and Pooya Jannaty and Jingyi Jin and Seung Wook Kim and Gergely KlΓ‘r and Grace Lam and Shiyi Lan and Laura Leal-Taixe and Anqi Li and Zhaoshuo Li and Chen-Hsuan Lin and Tsung-Yi Lin and Huan Ling and Ming-Yu Liu and Xian Liu and Alice Luo and Qianli Ma and Hanzi Mao and Kaichun Mo and Arsalan Mousavian and Seungjun Nah and Sriharsha Niverty and David Page and Despoina Paschalidou and Zeeshan Patel and Lindsey Pavao and Morteza Ramezanali and Fitsum Reda and Xiaowei Ren and Vasanth Rao Naik Sabavat and Ed Schmerling and Stella Shi and Bartosz Stefaniak and Shitao Tang and Lyne Tchapmi and Przemek Tredak and Wei-Cheng Tseng and Jibin Varghese and Hao Wang and Haoxiang Wang and Heng Wang and Ting-Chun Wang and Fangyin Wei and Xinyue Wei and Jay Zhangjie Wu and Jiashu Xu and Wei Yang and Lin Yen-Chen and Xiaohui Zeng and Yu Zeng and Jing Zhang and Qinsheng Zhang and Yuxuan Zhang and Qingqing Zhao and Artur Zolkowski},
year={2025},
eprint={2501.03575},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.03575},
}@misc{li2023blip2bootstrappinglanguageimagepretraining,
title={BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models},
author={Junnan Li and Dongxu Li and Silvio Savarese and Steven Hoi},
year={2023},
eprint={2301.12597},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2301.12597},
}




