Replies: 3 comments
-
|
It shouldn't be limited to 10s, the max sequence length of the model is 2048 tokens which is ~163 seconds If you have a lot of context, e.g. previous turns or voice prompts it'll reduce the max generation you can create. Also take a look at |
Beta Was this translation helpful? Give feedback.
-
|
Thanks Zack. Then the code below should work just fine (taken from the example)? import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
model_id = "sesame/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
# another equivalent way to prepare the inputs
conversation = [
{"role": "0", "content": [{"type": "text", "text": "long set of text pasted here in actuality"}]},
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to(device)
# infer the model
audio = model.generate(**inputs, output_audio=True)
processor.save_audio(audio, "copypasta.wav")This is resulting in only 10s of output despite the input text being much longer than 10s worth of audio. |
Beta Was this translation helpful? Give feedback.
-
|
We didn't implement the transformer's version of CSM, you'll want to check their docs https://huggingface.co/docs/transformers/model_doc/csm In the transformers library there are standard ways to change the max output length, e.g. in GenerationConfig |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I can seem to only generate 10s of audio (running on Google Colab. Looking for the solution to be able to generate longer audio files.
Beta Was this translation helpful? Give feedback.
All reactions