Perceptually Aligned Music2Latent

This repository is a fork of Music2Latent created by Marco Pasini Queen Mary University of London, in partnership with Sony Computer Science Laboratories Paris, licensed under CC BY-NC 4.0. It includes model weights and the modifications of Perceptually Aligning Representations of Music via Noise-Augmented Autoencoders, Bjare et al., NeurIPS - AI for Music Workshop, 2025 and remains under the same CC BY-NC 4.0 license.

If you use this project in your research, please cite the following paper:

@inproceedings{
   bjare2025perceptually,
   title={Perceptually Aligning Representations of Music via Noise-Augmented Autoencoders},
   author={Mathias Rose Bjare and Giorgia Cantisani and Marco Pasini and Stefan Lattner and Gerhard Widmer},
   booktitle={NeurIPS - AI for Music Workshop},
   year={2025},
   url={https://openreview.net/forum?id=rXUKO0ysUy}
}

and the original Music2Latent paper.

Original Music2Latent readme below.

Music2Latent

Encode and decode audio samples to/from compressed representations! Useful for efficient generative modelling applications and for other downstream tasks.

Read the ISMIR 2024 paper here. Listen to audio samples here.

Under the hood, Music2Latent uses a Consistency Autoencoder model to efficiently encode and decode audio samples.

44.1 kHz audio is encoded into a sequence of ~10 Hz, and each of the latents has 64 channels. 48 kHz audio can also be encoded, which results in a sequence of ~12 Hz. A generative model can then be trained on these embeddings, or they can be used for other downstream tasks.

Music2Latent was trained on music and on speech. Refer to the paper for more details.

Installation

pip install music2latent

The model weights will be downloaded automatically the first time the code is run.

How to use

To encode and decode audio samples to/from latent embeddings:

audio_path = librosa.example('trumpet')
wv, sr = librosa.load(audio_path, sr=44100)

from music2latent import EncoderDecoder
encdec = EncoderDecoder()

latent = encdec.encode(wv)
# latent has shape (batch_size/audio_channels, dim (64), sequence_length)

wv_rec = encdec.decode(latent)

To extract encoder features to use in downstream tasks:

features = encoder.encode(wv, extract_features=True)

These features are extracted before the encoder bottleneck, and thus have more channels (contain more information) than the latents used for reconstruction. It will not be possible to directly decode these features back to audio.

music2latent supports more advanced usage, including GPU memory management controls. Please refer to tutorial.ipynb.

License

This library is released under the CC BY-NC 4.0 license. Please refer to the LICENSE file for more details.

This work was conducted by Marco Pasini during his PhD at Queen Mary University of London, in partnership with Sony Computer Science Laboratories Paris. This work was supervised by Stefan Lattner and George Fazekas.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
music2latent		music2latent
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
architecture.png		architecture.png
music2latent.png		music2latent.png
setup.py		setup.py
tutorial.ipynb		tutorial.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Perceptually Aligned Music2Latent

Music2Latent

Installation

How to use

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

CPJKU/pa-music2latent

Folders and files

Latest commit

History

Repository files navigation

Perceptually Aligned Music2Latent

Music2Latent

Installation

How to use

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages