Skip to content

Conversation

fracapuano
Copy link
Contributor

Congratulations! You've made it this far! Once merged, the article will appear at https://huggingface.co/blog. Official articles
require additional reviews. Alternatively, you can write a community article following the process here.

Preparing the Article

You're not quite done yet, though. Please make sure to follow this process (as documented here):

  • Add an entry to _blog.yml.
  • Add a thumbnail. There are no requirements here, but there is a template if it's helpful.
  • Check you use a short title and blog path.
  • Upload any additional assets (such as images) to the Documentation Images repo. This is to reduce bloat in the GitHub base repo when cloning and pulling. Try to have small images to avoid a slow or expensive user experience.
  • Add metadata (such as authors) to your md file. You can also specify guest or org for the authors.
  • Ensure the publication date is correct.
  • Preview the content. A quick way is to paste the markdown content in https://huggingface.co/new-blog. Do not click publish, this is just a way to do an early check.

Here is an example of a complete PR: #2382

Getting a Review

Please make sure to get a review from someone on your team or a co-author.
Once this is done and once all the steps above are completed, you should be able to merge.
There is no need for additional reviews if you and your co-authors are happy and meet all of the above.

Feel free to add @pcuenca as a reviewer if you want a final check. Keep in mind he'll be biased toward light reviews
(e.g., check for proper metadata) rather than content reviews unless explicitly asked.

@fracapuano fracapuano requested a review from lhoestq September 18, 2025 14:09
Copy link
Member

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't complete the review but please do run a spell check on this, and overall convert the tense from present continuous to present.

We can make this a bit simpler and just add it as part of the original v3 blogpost.

- user: aractingi
---

**TL;DR** We introduce streaming mode for `LeRobotDataset`, allowing users to iterate over massive robotics datasets without ever having to download them. `StreamingLeRobotDataset` is a new dataset class fully integrated with `lerobot` enabling fast, random sampling and on-the-fly video decoding to deliver high throughput with a small memory footprint. We also add native support for time-window queries via `delta_timestamps`, powered by a custom backtrackable iterator that steps both backward and forward efficiently. All datasets currently released in `LeRobotDataset:v3.0` can be used in streaming mode, by simply using `StreamingLeRobotDataset`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be a list instead, large chunk of paragraphs can be distracting (Specially as a TL;DR)

## Installing `lerobot`

[`lerobot`](https://github.com/huggingface/lerobot) is the end-to-end robotics library developed at Hugging Face, supporting real-world robotics as well as state of the art robot learning algorithms.
The library allows to record datasets locally directly on real-world robots, and to store datasets on the Hugging Face Hub.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The library allows to record datasets locally directly on real-world robots, and to store datasets on the Hugging Face Hub.
The library allows you to record datasets locally directly on real-world robots, and to store datasets on the Hugging Face Hub.


[`lerobot`](https://github.com/huggingface/lerobot) is the end-to-end robotics library developed at Hugging Face, supporting real-world robotics as well as state of the art robot learning algorithms.
The library allows to record datasets locally directly on real-world robots, and to store datasets on the Hugging Face Hub.
You can read more about the robots we currently support [here](https://huggingface.co/docs/lerobot/), and browse the thousands of datasets already contributed by the open-source community on the Hugging Face Hub [here 🤗](https://huggingface.co/datasets?modality=modality:timeseries&task_categories=task_categories:robotics&sort=trending).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can read more about the robots we currently support [here](https://huggingface.co/docs/lerobot/), and browse the thousands of datasets already contributed by the open-source community on the Hugging Face Hub [here 🤗](https://huggingface.co/datasets?modality=modality:timeseries&task_categories=task_categories:robotics&sort=trending).
You can read more about the robots we currently support [here](https://huggingface.co/docs/lerobot/), and browse the thousands of datasets already contributed by the open-source community on the Hugging Face Hub [here](https://huggingface.co/datasets?modality=modality:timeseries&task_categories=task_categories:robotics&sort=trending).

The library allows to record datasets locally directly on real-world robots, and to store datasets on the Hugging Face Hub.
You can read more about the robots we currently support [here](https://huggingface.co/docs/lerobot/), and browse the thousands of datasets already contributed by the open-source community on the Hugging Face Hub [here 🤗](https://huggingface.co/datasets?modality=modality:timeseries&task_categories=task_categories:robotics&sort=trending).

We [recently introduced](https://huggingface.co/blog/lerobot-datasets-v3) a new dataset format enabling streaming mode. Both functionalities will ship with `lerobot-v0.4.0`, and you can access it right now building the library from source! You can find the installation instructions for lerobot [here](https://huggingface.co/docs/lerobot/en/installation).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be better if you put this all in one blogpost v3: https://huggingface.co/blog/lerobot-datasets-v3

## Why Streaming Datasets

Training robot learning algorithms using large-scale robotics datasets can mean having to process terabytes of multi-modal data.
For instance, a popular manipulation dataset like [DROID](https://huggingface.co/datasets/lerobot/droid_1.0.1/tree/main), containing 130K+ episodes amounting to a total of 26M+ frames results in 4TB of space: a disk and memory requirement which is simply unattainable for most institutions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For instance, a popular manipulation dataset like [DROID](https://huggingface.co/datasets/lerobot/droid_1.0.1/tree/main), containing 130K+ episodes amounting to a total of 26M+ frames results in 4TB of space: a disk and memory requirement which is simply unattainable for most institutions.
For instance, a popular manipulation dataset like [DROID](https://huggingface.co/datasets/lerobot/droid_1.0.1/tree/main), contains 130K+ episodes amounting to a total of 26M+ frames results in 4TB of space: a disk and memory requirement which is simply unattainable for most institutions.

- On-the-fly video decoding using the [`torchcodec`](https://docs.pytorch.org/torchcodec/stable/generated_examples/decoding/file_like.html) library


These two factors allow to step through an iterable, retrieving frames on the fly and exclusively locally via a series of `.next()` calls, without ever loading the dataset into memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
These two factors allow to step through an iterable, retrieving frames on the fly and exclusively locally via a series of `.next()` calls, without ever loading the dataset into memory.
These two factors allow us to step through an iterable, retrieving frames on the fly and exclusively locally via a series of `.next()` calls, without ever loading the dataset into memory.

</center>
</p>

Indeed, we can measure the correlation coefficient of the streamed `index` and the `iteration_index` to measure the randomness of the streaming procedure, where high levels of randomness correspond to a low (absolute) correlation coefficient and low levels of randomness result in high (either positive or negative) correlation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe flesh this out a bit more - this would be incomprehensible to someone who isn't as initiated about robotics


Low randomness when streaming frames is very problematic in those use cases where datasets are processed for training purposes.
In such context, items need to typically be shuffled so to mitigate the inherent inter-dependancy between successive frames recorded via demonstrations.
Similarily to the `datasets 🤗` library, we solve this issue maintaining a buffer of frames in memory, typically much smaller than the original datasets (1000s of frames versus 100Ms or 1Bs).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Similarily to the `datasets 🤗` library, we solve this issue maintaining a buffer of frames in memory, typically much smaller than the original datasets (1000s of frames versus 100Ms or 1Bs).
Similar to the `datasets 🤗` library, we solve this issue maintaining a buffer of frames in memory, typically much smaller than the original datasets (1000s of frames versus 100Ms or 1Bs).


![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobotdataset-v3/random-from-iterable.png)

Because the `.next()` call for the dataset is now stacked on top of a process to fill in an intermediate buffer, an initialization overhead is introduced, to allow the buffer to be filled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Because the `.next()` call for the dataset is now stacked on top of a process to fill in an intermediate buffer, an initialization overhead is introduced, to allow the buffer to be filled.
Since the `.next()` call for the dataset is now stacked on top of a process to fill in an intermediate buffer, an initialization overhead is introduced, to allow the buffer to be filled.

```
While we expect our randomness measurements to be robust across deployment scenarios, the samples throughput is likely going to vary depending on the connection speed.

## Starting simple: Streaming Single Frames
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for both the variants of streaming it'd be better to explain them visually as well a bit


## Why Streaming Datasets

Training robot learning algorithms using large-scale robotics datasets can mean having to process terabytes of multi-modal data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We update L2D yesterday with our next release R3 of 100K episodes in dataset v3 format. Rough size estimate is 20M * 6 (6 cameras) frames and 4.8 T. R3 works with StreamingLeRobotDataset. Shall we add a usage here ? @fracapuano

https://huggingface.co/datasets/yaak-ai/L2D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is going to be merged with the other datasets blogpost, we're we are already mentioning you guys so I guess we should be fine :)) Wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah righto. Yes ofc :))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants