Skip to content

Use torchcodec for loading #3964

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 32 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
cd3d440
Use torchcodec for loading
Jul 8, 2025
74135c8
Add torchcodec to CI installer
Jul 9, 2025
a4576a7
Use torchcodec in examples and integration tests too
Jul 9, 2025
62c7fe6
Test torchcodec installation
NicolasHug Jul 10, 2025
e7b9da6
empty
NicolasHug Jul 10, 2025
ae9baff
dont even build audio
NicolasHug Jul 10, 2025
758ff52
Try ffmpeg 4.4.2
NicolasHug Jul 10, 2025
f7a2654
force ffmpeg<5
NicolasHug Jul 10, 2025
e929d65
UGH
NicolasHug Jul 10, 2025
b95e3c8
Put back building torchaudio
NicolasHug Jul 10, 2025
a1c086f
Put back rest of dependencies, and run tests
NicolasHug Jul 10, 2025
c3690ff
Merge branch 'installation' into codec_use
Jul 10, 2025
6ec7718
Ignore tests with ffmpeg bugs
Jul 10, 2025
1255bd1
Move pytest import
Jul 10, 2025
9e0e89a
Load torchcodec lazily
Jul 11, 2025
ea37fcd
Remove hack
Jul 11, 2025
01dda4a
Skip ffmpeg failing tests
Jul 11, 2025
1194ff8
Move failing test ids file to same directory
Jul 11, 2025
3ef7c55
Add torchcodec to some requirements
Jul 11, 2025
02d11af
Try requirements index url option
Jul 11, 2025
f853397
Add more ffmpeg failing tests
Jul 11, 2025
86c40b8
Install torchcodec at same time as torch for docs
Jul 11, 2025
78bbf70
Add options from old loader
Jul 11, 2025
1c38f95
Give installation error message if torchcodec not installed
Jul 11, 2025
98fbd03
Remove hide_seek wrapping for torchcodec
Jul 11, 2025
b2b5f40
Wrap boto3 response in bytesio
Jul 11, 2025
f3a1f82
Use torchcodec url streaming
Jul 12, 2025
9a00ebb
Use urls for load_codec
Jul 12, 2025
4e83a7a
Allow keyword arguments to load_torchcodec
Jul 12, 2025
380eaa7
Remove frame_offset arguments from load_torchcodec
Jul 13, 2025
500ad06
Fix typo
Jul 13, 2025
7cf43b3
Remove use of num_frames for load_torchcodec
Jul 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .github/scripts/unittest-linux/install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ case $GPU_ARCH_TYPE in
;;
esac
PYTORCH_WHEEL_INDEX="https://download.pytorch.org/whl/${UPLOAD_CHANNEL}/${GPU_ARCH_ID}"
pip install --progress-bar=off --pre torch --index-url="${PYTORCH_WHEEL_INDEX}"
pip install --progress-bar=off --pre torch torchcodec --index-url="${PYTORCH_WHEEL_INDEX}"


# 2. Install torchaudio
Expand All @@ -85,6 +85,9 @@ export BUILD_CPP_TEST=1
python setup.py install

# 3. Install Test tools
conda install -y "ffmpeg<5"
python -c "import torch; import torchaudio; import torchcodec; print(torch.__version__, torchaudio.__version__, torchcodec.__version__)"

printf "* Installing test tools\n"
NUMBA_DEV_CHANNEL=""
if [[ "$(python --version)" = *3.9* || "$(python --version)" = *3.10* ]]; then
Expand All @@ -94,7 +97,7 @@ if [[ "$(python --version)" = *3.9* || "$(python --version)" = *3.10* ]]; then
fi
(
set -x
conda install -y -c conda-forge ${NUMBA_DEV_CHANNEL} sox libvorbis parameterized 'requests>=2.20' 'ffmpeg>=6,<7'
conda install -y -c conda-forge ${NUMBA_DEV_CHANNEL} sox libvorbis parameterized 'requests>=2.20'
pip install kaldi-io SoundFile librosa coverage pytest pytest-cov scipy expecttest unidecode inflect Pillow sentencepiece pytorch-lightning 'protobuf<4.21.0' demucs tinytag pyroomacoustics flashlight-text git+https://github.com/kpu/kenlm

# TODO: might be better to fix the single call to `pip install` above
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/build_docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ jobs:

GPU_ARCH_ID=cu126 # This is hard-coded and must be consistent with gpu-arch-version.
PYTORCH_WHEEL_INDEX="https://download.pytorch.org/whl/${CHANNEL}/${GPU_ARCH_ID}"
pip install --progress-bar=off --pre torch --index-url="${PYTORCH_WHEEL_INDEX}"
pip install --progress-bar=off --pre torch torchcodec --index-url="${PYTORCH_WHEEL_INDEX}"

echo "::endgroup::"
echo "::group::Install TorchAudio"
Expand Down
1 change: 1 addition & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
Jinja2<3.1.0
matplotlib<=3.8
pyparsing<3,>=2.0.2
torchcodec

# C++ docs
breathe==4.34.0
Expand Down
4 changes: 2 additions & 2 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ Tutorials

.. customcarditem::
:header: Loading waveform Tensors from files and saving them
:card_description: Learn how to query/load audio files and save waveform tensors to files, using <code>torchaudio.info</code>, <code>torchaudio.load</code> and <code>torchaudio.save</code> functions.
:card_description: Learn how to query/load audio files and save waveform tensors to files, using <code>torchaudio.info</code>, <code>torchaudio.utils.load_torchcodec</code> and <code>torchaudio.save</code> functions.
:image: https://download.pytorch.org/torchaudio/tutorial-assets/thumbnails/audio_io_tutorial.png
:link: tutorials/audio_io_tutorial.html
:tags: I/O
Expand Down Expand Up @@ -399,7 +399,7 @@ In BibTeX format:
.. code-block:: bibtex

@misc{hwang2023torchaudio,
title={TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch},
title={TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch},
author={Jeff Hwang and Moto Hira and Caroline Chen and Xiaohui Zhang and Zhaoheng Ni and Guangzhi Sun and Pingchuan Ma and Ruizhe Huang and Vineel Pratap and Yuekai Zhang and Anurag Kumar and Chin-Yun Yu and Chuang Zhu and Chunxi Liu and Jacob Kahn and Mirco Ravanelli and Peng Sun and Shinji Watanabe and Yangyang Shi and Yumeng Tao and Robin Scheibler and Samuele Cornell and Sean Kim and Stavros Petridis},
year={2023},
eprint={2310.17864},
Expand Down
7 changes: 4 additions & 3 deletions examples/asr/emformer_rnnt/mustc/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import torch
import torchaudio
import yaml
from torchaudio.utils import load_torchcodec


FOLDER_IN_ARCHIVE = "en-de"
Expand Down Expand Up @@ -31,15 +32,15 @@ def __init__(
self.idx_target_lengths = []
self.wav_list = []
for idx, item in enumerate(file_list):
offset = int(item["offset"] * SAMPLE_RATE)
duration = int(item["duration"] * SAMPLE_RATE)
offset = item["offset"]
duration = item["duration"]
self.idx_target_lengths.append((idx, item["duration"]))
file_path = wav_dir / item["wav"]
self.wav_list.append((file_path, offset, duration))

def _get_mustc_item(self, idx):
file_path, offset, duration = self.wav_list[idx]
waveform, sr = torchaudio.load(file_path, frame_offset=offset, num_frames=duration)
waveform, sr = load_torchcodec(file_path, start_seconds=offset, stop_seconds=offset + duration)
assert sr == SAMPLE_RATE
transcript = self.trans_list[idx].replace("\n", "")
return (waveform, transcript)
Expand Down
4 changes: 2 additions & 2 deletions examples/avsr/data_prep/data/data_module.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
import torch
import torchaudio
import torchvision

from torchaudio.utils import load_torchcodec

class AVSRDataLoader:
def __init__(self, modality, detector="retinaface", resize=None):
Expand Down Expand Up @@ -39,7 +39,7 @@ def load_data(self, data_filename, transform=True):
return video

def load_audio(self, data_filename):
waveform, sample_rate = torchaudio.load(data_filename, normalize=True)
waveform, sample_rate = load_torchcodec(data_filename, normalize=True)
return waveform, sample_rate

def load_video(self, data_filename):
Expand Down
3 changes: 2 additions & 1 deletion examples/avsr/lrs3.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import torchaudio
import torchvision
from torch.utils.data import Dataset
from torchaudio.utils import load_torchcodec


def _load_list(args, *filenames):
Expand Down Expand Up @@ -31,7 +32,7 @@ def load_audio(path):
"""
rtype: torch, T x 1
"""
waveform, sample_rate = torchaudio.load(path, normalize=True)
waveform, sample_rate = load_torchcodec(path, normalize=True)
return waveform.transpose(1, 0)


Expand Down
7 changes: 4 additions & 3 deletions examples/dnn_beamformer/datamodule.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from torch import Tensor
from torch.utils.data import Dataset
from utils import CollateFnL3DAS22
from torchaudio.utils import load_torchcodec

_PREFIX = "L3DAS22_Task1_"
_SUBSETS = {
Expand Down Expand Up @@ -46,10 +47,10 @@ def __getitem__(self, n: int) -> Tuple[Tensor, Tensor, int, str]:
noisy_path_B = str(noisy_path_A).replace("_A.wav", "_B.wav")
clean_path = noisy_path_A.parent.parent / "labels" / noisy_path_A.name.replace("_A.wav", ".wav")
transcript_path = str(clean_path).replace("wav", "txt")
waveform_noisy_A, sample_rate1 = torchaudio.load(noisy_path_A)
waveform_noisy_B, sample_rate2 = torchaudio.load(noisy_path_B)
waveform_noisy_A, sample_rate1 = load_torchcodec(noisy_path_A)
waveform_noisy_B, sample_rate2 = load_torchcodec(noisy_path_B)
waveform_noisy = torch.cat((waveform_noisy_A, waveform_noisy_B), dim=0)
waveform_clean, sample_rate3 = torchaudio.load(clean_path)
waveform_clean, sample_rate3 = load_torchcodec(clean_path)
assert sample_rate1 == _SAMPLE_RATE and sample_rate2 == _SAMPLE_RATE and sample_rate3 == _SAMPLE_RATE
with open(transcript_path, "r") as f:
transcript = f.readline()
Expand Down
5 changes: 4 additions & 1 deletion examples/hubert/dataset/hubert_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@
from torch import Tensor
from torch.utils.data import BatchSampler, Dataset, DistributedSampler

from torchaudio.utils import load_torchcodec


sys.path.append("..")
from utils import _get_label2id

Expand Down Expand Up @@ -299,7 +302,7 @@ def _load_audio(self, index: int) -> Tensor:
(Tensor): The corresponding waveform Tensor.
"""
wav_path = self.f_list[index]
waveform, sample_rate = torchaudio.load(wav_path)
waveform, sample_rate = load_torchcodec(wav_path)
assert waveform.shape[1] == self.len_list[index]
return waveform

Expand Down
5 changes: 3 additions & 2 deletions examples/hubert/utils/feature_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from torch.nn import Module

from .common_utils import _get_feat_lens_paths
from torchaudio.utils import load_torchcodec

_LG = logging.getLogger(__name__)
_DEFAULT_DEVICE = torch.device("cpu")
Expand Down Expand Up @@ -53,7 +54,7 @@ def extract_feature_mfcc(
Returns:
Tensor: The desired feature tensor of the given audio file.
"""
waveform, sr = torchaudio.load(path)
waveform, sr = load_torchcodec(path)
assert sr == sample_rate
feature_extractor = torchaudio.transforms.MFCC(
sample_rate=sample_rate, n_mfcc=13, melkwargs={"n_fft": 400, "hop_length": 160, "center": False}
Expand Down Expand Up @@ -88,7 +89,7 @@ def extract_feature_hubert(
Returns:
Tensor: The desired feature tensor of the given audio file.
"""
waveform, sr = torchaudio.load(path)
waveform, sr = load_torchcodec(path)
assert sr == sample_rate
waveform = waveform.to(device)
with torch.inference_mode():
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

import torch
import torchaudio

from torchaudio.utils import load_torchcodec

class Pipeline(torch.nn.Module):
"""Example audio process pipeline.
Expand All @@ -17,15 +17,15 @@ class Pipeline(torch.nn.Module):

def __init__(self, rir_path: str):
super().__init__()
rir, sample_rate = torchaudio.load(rir_path)
rir, sample_rate = load_torchcodec(rir_path)
self.register_buffer("rir", rir)
self.rir_sample_rate: int = sample_rate

def forward(self, input_path: str, output_path: str):
torchaudio.sox_effects.init_sox_effects()

# 1. load audio
waveform, sample_rate = torchaudio.load(input_path)
waveform, sample_rate = load_torchcodec(input_path)

# 2. Add background noise
alpha = 0.01
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from greedy_decoder import Decoder
from torch.utils.mobile_optimizer import optimize_for_mobile
from torchaudio.models.wav2vec2.utils.import_fairseq import import_fairseq_model
from torchaudio.utils import load_torchcodec

TORCH_VERSION: Tuple[int, ...] = tuple(int(x) for x in torch.__version__.split(".")[:2])
if TORCH_VERSION >= (1, 10):
Expand Down Expand Up @@ -58,7 +59,7 @@ def _parse_args():

class Loader(torch.nn.Module):
def forward(self, audio_path: str) -> torch.Tensor:
waveform, sample_rate = torchaudio.load(audio_path)
waveform, sample_rate = load_torchcodec(audio_path)
if sample_rate != 16000:
waveform = torchaudio.functional.resample(waveform, float(sample_rate), 16000.0)
return waveform
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import torchaudio
from greedy_decoder import Decoder
from torchaudio.models.wav2vec2.utils.import_huggingface import import_huggingface_model
from torchaudio.utils import load_torchcodec

TORCH_VERSION: Tuple[int, ...] = tuple(int(x) for x in torch.__version__.split(".")[:2])
if TORCH_VERSION >= (1, 10):
Expand Down Expand Up @@ -49,7 +50,7 @@ def _parse_args():

class Loader(torch.nn.Module):
def forward(self, audio_path: str) -> torch.Tensor:
waveform, sample_rate = torchaudio.load(audio_path)
waveform, sample_rate = load_torchcodec(audio_path)
if sample_rate != 16000:
waveform = torchaudio.functional.resample(waveform, float(sample_rate), 16000.0)
return waveform
Expand Down
3 changes: 2 additions & 1 deletion examples/self_supervised_learning/data_modules/_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import torchaudio
from torch import Tensor
from torch.utils.data import BatchSampler, Dataset, DistributedSampler
from torchaudio.utils import load_torchcodec

from ..lightning_modules import Batch

Expand Down Expand Up @@ -295,7 +296,7 @@ def _load_audio(self, index: int) -> Tensor:
(Tensor): The corresponding waveform Tensor.
"""
wav_path = self.f_list[index]
waveform, sample_rate = torchaudio.load(wav_path)
waveform, sample_rate = load_torchcodec(wav_path)
assert waveform.shape[1] == self.len_list[index]
return waveform

Expand Down
3 changes: 2 additions & 1 deletion examples/source_separation/utils/dataset/wsj0mix.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import torch
import torchaudio
from torch.utils.data import Dataset
from torchaudio.utils import load_torchcodec

SampleType = Tuple[int, torch.Tensor, List[torch.Tensor]]

Expand Down Expand Up @@ -37,7 +38,7 @@ def __init__(
self.files.sort()

def _load_audio(self, path) -> torch.Tensor:
waveform, sample_rate = torchaudio.load(path)
waveform, sample_rate = load_torchcodec(path)
if sample_rate != self.sample_rate:
raise ValueError(
f"The dataset contains audio file of sample rate {sample_rate}, "
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@
import matplotlib.pyplot as plt
from torchaudio.models.decoder import ctc_decoder
from torchaudio.utils import download_asset
from torchaudio.utils import load_torchcodec

######################################################################
#
Expand Down Expand Up @@ -98,7 +99,7 @@
# i really was very much afraid of showing him how much shocked i was at some parts of what he said
#

waveform, sample_rate = torchaudio.load(speech_file)
waveform, sample_rate = load_torchcodec(speech_file)

if sample_rate != bundle.sample_rate:
waveform = torchaudio.functional.resample(waveform, sample_rate, bundle.sample_rate)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@

import torch
import torchaudio
from torchaudio.utils import load_torchcodec

print(torch.__version__)
print(torchaudio.__version__)
Expand Down Expand Up @@ -96,7 +97,7 @@ def download_asset_external(url, key):
#

speech_file = download_asset("tutorial-assets/ctc-decoding/1688-142285-0007.wav")
waveform, sample_rate = torchaudio.load(speech_file)
waveform, sample_rate = load_torchcodec(speech_file)
assert sample_rate == 16000
IPython.display.Audio(speech_file)

Expand Down
17 changes: 9 additions & 8 deletions examples/tutorials/audio_data_augmentation_tutorial.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@

import torch
import torchaudio
from torchaudio.utils import load_torchcodec
import torchaudio.functional as F

print(torch.__version__)
Expand Down Expand Up @@ -52,7 +53,7 @@
#

# Load the data
waveform1, sample_rate = torchaudio.load(SAMPLE_WAV, channels_first=False)
waveform1, sample_rate = load_torchcodec(SAMPLE_WAV, channels_first=False)

# Define effects
effect = ",".join(
Expand Down Expand Up @@ -159,7 +160,7 @@ def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None):
# and clap your hands.
#

rir_raw, sample_rate = torchaudio.load(SAMPLE_RIR)
rir_raw, sample_rate = load_torchcodec(SAMPLE_RIR)
plot_waveform(rir_raw, sample_rate, title="Room Impulse Response (raw)")
plot_specgram(rir_raw, sample_rate, title="Room Impulse Response (raw)")
Audio(rir_raw, rate=sample_rate)
Expand All @@ -179,7 +180,7 @@ def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None):
# we convolve the speech signal with the RIR.
#

speech, _ = torchaudio.load(SAMPLE_SPEECH)
speech, _ = load_torchcodec(SAMPLE_SPEECH)
augmented = F.fftconvolve(speech, rir)

######################################################################
Expand Down Expand Up @@ -219,8 +220,8 @@ def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None):
# To add noise to audio data per SNRs, we
# use :py:func:`torchaudio.functional.add_noise`.

speech, _ = torchaudio.load(SAMPLE_SPEECH)
noise, _ = torchaudio.load(SAMPLE_NOISE)
speech, _ = load_torchcodec(SAMPLE_SPEECH)
noise, _ = load_torchcodec(SAMPLE_NOISE)
noise = noise[:, : speech.shape[1]]

snr_dbs = torch.tensor([20, 10, 3])
Expand Down Expand Up @@ -275,7 +276,7 @@ def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None):
# a Tensor object.
#

waveform, sample_rate = torchaudio.load(SAMPLE_SPEECH, channels_first=False)
waveform, sample_rate = load_torchcodec(SAMPLE_SPEECH, channels_first=False)


def apply_codec(waveform, sample_rate, format, encoder=None):
Expand Down Expand Up @@ -332,7 +333,7 @@ def apply_codec(waveform, sample_rate, format, encoder=None):
#

sample_rate = 16000
original_speech, sample_rate = torchaudio.load(SAMPLE_SPEECH)
original_speech, sample_rate = load_torchcodec(SAMPLE_SPEECH)

plot_specgram(original_speech, sample_rate, title="Original")

Expand All @@ -345,7 +346,7 @@ def apply_codec(waveform, sample_rate, format, encoder=None):
# Because the noise is recorded in the actual environment, we consider that
# the noise contains the acoustic feature of the environment. Therefore, we add
# the noise after RIR application.
noise, _ = torchaudio.load(SAMPLE_NOISE)
noise, _ = load_torchcodec(SAMPLE_NOISE)
noise = noise[:, : rir_applied.shape[1]]

snr_db = torch.tensor([8])
Expand Down
Loading
Loading