You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
insert the code above: def build_interface(): ( line 196) and save the changes.
##### Plugin code Start #####defbatch_generate_audio(
model_choice: str,
text_dir: str,
output_dir: str,
language: str="en-us",
speaker_audio_path: str=None,
cfg_scale: float=2.0,
min_p: float=0.15,
seed: int=420,
# Fixed emotion parameters matching Gradio defaultse1: float=1.0, # Happinesse2: float=0.05, # Sadnesse3: float=0.05, # Disguste4: float=0.05, # Feare5: float=0.05, # Surprisee6: float=0.05, # Angere7: float=0.1, # Othere8: float=0.2, # Neutral
):
"""Batch generate audio from text files with consistent tone"""# Load model oncemodel=load_model_if_needed(model_choice)
device=model.device# Force deterministic settingstorch.backends.cudnn.deterministic=Truetorch.backends.cudnn.benchmark=Falsetorch.manual_seed(seed)
iftorch.cuda.is_available():
torch.cuda.manual_seed_all(seed)
# Create output directoryoutput_path=Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
# Process speaker embedding once if providedspeaker_embedding=Noneifspeaker_audio_path:
wav, sr=torchaudio.load(speaker_audio_path)
speaker_embedding=model.make_speaker_embedding(wav, sr)
speaker_embedding=speaker_embedding.to(device, dtype=torch.bfloat16)
# Fixed emotion tensoremotion_tensor=torch.tensor(
[e1, e2, e3, e4, e5, e6, e7, e8],
device=device,
dtype=torch.float32
)
# Process text filestext_files=list(Path(text_dir).glob("*.txt"))
fortext_fileintext_files:
withopen(text_file, "r", encoding="utf-8") asf:
text=f.read().strip()[:500]
# Create consistent conditioningcond_dict=make_cond_dict(
text=text,
language=language,
speaker=speaker_embedding,
emotion=emotion_tensor,
device=device,
# Only disable these if your model supports themunconditional_keys=["vqscore_8", "dnsmos_ovrl"]
)
conditioning=model.prepare_conditioning(cond_dict)
# Deterministic generationwithtorch.no_grad():
codes=model.generate(
prefix_conditioning=conditioning,
max_new_tokens=86*30,
cfg_scale=cfg_scale,
batch_size=1,
sampling_params=dict(min_p=min_p),
)
# Audio processing and saving (keep previous fix)wav_out=model.autoencoder.decode(codes).cpu().detach()
ifwav_out.dim() ==3: wav_out=wav_out.squeeze(0)
ifwav_out.dim() ==1: wav_out=wav_out.unsqueeze(0)
ifwav_out.dim() ==2andwav_out.size(0) >1:
wav_out=wav_out[0:1, :]
output_file=output_path/f"{text_file.stem}.wav"torchaudio.save(str(output_file), wav_out, model.autoencoder.sampling_rate)
returnf"Processed {len(text_files)} files with consistent tone to {output_dir}"##### Plugin code End #####
Step 2: Create batch_process.py
In the same directory as gradio.interface.py, create a new file, insert the following code, and save it.
##### batch_process.py #####importargparsefromgradio_interfaceimportbatch_generate_audioif__name__=="__main__":
parser=argparse.ArgumentParser(description='Batch process text files into audio')
parser.add_argument('--model', type=str, default="Zyphra/Zonos-v0.1-transformer",
help='Model name from Hugging Face Hub')
parser.add_argument('--text-dir', type=str, required=True,
help='Directory containing text files')
parser.add_argument('--output-dir', type=str, required=True,
help='Directory to save audio files')
parser.add_argument('--speaker-audio', type=str, default=None,
help='Path to speaker audio for voice cloning')
parser.add_argument('--language', type=str, default="en-us",
help='Language code for synthesis')
parser.add_argument('--seed', type=int, default=420,
help='Random seed for reproducibility')
args=parser.parse_args()
result=batch_generate_audio(
model_choice=args.model,
text_dir=args.text_dir,
output_dir=args.output_dir,
language=args.language,
speaker_audio_path=args.speaker_audio,
seed=args.seed
)
print(result)
Step 3: Prepare your text
You can either manually divide your text into small chunks that are a maximum of 27 to 30 seconds long or use a text splitter to segment your text into smaller portions. Zonos works with a maximum of 30 seconds of audio generation and does not count the words. Some sentences with fewer words could generate longer audio based on the settings. For the default setting, I found 40 to 50 words work well. The code below can split your text based on sentences that end with "'.', '!', '?" (You can adjust the filter) and are not longer than 50 words. You can adjust the number of words by changing the value of "word_limit."
put your text into a text file and save it as data.txt.
you can change the file name but update the input_file" value.
Step 4: Split the text
create new 'text_split.py'
Create a new file and copy and paste the following code. python text_split.py
This script will create a folder named 'text_chunks' and save text files there. If any sentences exceed 50 words, it will generate 'chunk_plus_50_1.txt' for you to manually address.
##### text_split.py #####importosimportrefrompathlibimportPathdefsplit_into_sentences(text):
# Split text into sentences and keep track of line numberslines=text.split('\n')
sentences= []
line_mappings= [] # Store (sentence, start_line, end_line)current_sentence= []
start_line=1forline_num, lineinenumerate(lines, 1):
words=line.strip().split()
ifnotwords: # Skip empty linesifcurrent_sentence:
sentence=' '.join(current_sentence)
sentences.append(sentence)
line_mappings.append((sentence, start_line, line_num))
current_sentence= []
start_line=line_num+1continuecurrent_sentence.extend(words)
# Check for sentence endingsifwords[-1].endswith(('.', '!', '?')):
sentence=' '.join(current_sentence)
sentences.append(sentence)
line_mappings.append((sentence, start_line, line_num))
current_sentence= []
start_line=line_num+1# Handle any remaining textifcurrent_sentence:
sentence=' '.join(current_sentence)
sentences.append(sentence)
line_mappings.append((sentence, start_line, len(lines)))
returnsentences, line_mappingsdefcount_words(text):
returnlen(text.split())
defcreate_mapping_file(chunk_mappings, output_dir):
withopen(output_dir/'chunk_mapping.txt', 'w', encoding='utf-8') asf:
f.write("Chunk Mapping Reference:\n")
f.write("="*80+"\n\n")
forchunk_infoinchunk_mappings:
f.write(f"File: {chunk_info['filename']}\n")
f.write(f"Lines: {chunk_info['start_line']} to {chunk_info['end_line']}\n")
f.write(f"Preview: {chunk_info['preview'][:100]}...\n")
f.write("-"*80+"\n\n")
defcreate_chunks(input_file, word_limit=50):
output_dir=Path('text_chunks')
output_dir.mkdir(exist_ok=True)
withopen(input_file, 'r', encoding='utf-8') asf:
text=f.read()
sentences, line_mappings=split_into_sentences(text)
current_chunk= []
current_word_count=0chunk_number=1long_sentence_number=1chunk_mappings= []
sentence_index=0whilesentence_index<len(sentences):
sentence=sentences[sentence_index]
sentence_word_count=count_words(sentence)
current_mapping=line_mappings[sentence_index]
# Handle sentences longer than word limitifsentence_word_count>word_limit:
# Save any accumulated chunk firstifcurrent_chunk:
chunk_text=' '.join(current_chunk)
filename=f'chunk_{chunk_number}.txt'withopen(output_dir/filename, 'w', encoding='utf-8') asf:
f.write(chunk_text)
chunk_mappings.append({
'filename': filename,
'start_line': line_mappings[sentence_index-len(current_chunk)][1],
'end_line': line_mappings[sentence_index-1][2],
'preview': chunk_text
})
chunk_number+=1current_chunk= []
current_word_count=0# Save long sentencefilename=f'chunk_plus_{word_limit}_{long_sentence_number}.txt'withopen(output_dir/filename, 'w', encoding='utf-8') asf:
f.write(sentence)
chunk_mappings.append({
'filename': filename,
'start_line': current_mapping[1],
'end_line': current_mapping[2],
'preview': sentence
})
long_sentence_number+=1sentence_index+=1continueifcurrent_word_count+sentence_word_count<=word_limit:
current_chunk.append(sentence)
current_word_count+=sentence_word_countelse:
ifcurrent_chunk:
chunk_text=' '.join(current_chunk)
filename=f'chunk_{chunk_number}.txt'withopen(output_dir/filename, 'w', encoding='utf-8') asf:
f.write(chunk_text)
chunk_mappings.append({
'filename': filename,
'start_line': line_mappings[sentence_index-len(current_chunk)][1],
'end_line': line_mappings[sentence_index-1][2],
'preview': chunk_text
})
chunk_number+=1current_chunk= [sentence]
current_word_count=sentence_word_countsentence_index+=1# Save any remaining chunkifcurrent_chunk:
chunk_text=' '.join(current_chunk)
filename=f'chunk_{chunk_number}.txt'withopen(output_dir/filename, 'w', encoding='utf-8') asf:
f.write(chunk_text)
chunk_mappings.append({
'filename': filename,
'start_line': line_mappings[sentence_index-len(current_chunk)][1],
'end_line': line_mappings[sentence_index-1][2],
'preview': chunk_text
})
# Create mapping reference filecreate_mapping_file(chunk_mappings, output_dir)
if__name__=='__main__':
input_file='data.txt'word_limit=50# default word limittry:
create_chunks(input_file, word_limit)
print(f"Text has been split into chunks in the 'text_chunks' directory")
print(f"Check 'text_chunks/chunk_mapping.txt' for chunk locations")
exceptExceptionase:
print(f"An error occurred: {str(e)}")
Step 5: Celan your datan & run
Examine the text files and review the results. 'chunk_mapping' provides a detailed description of each chunk. Move it so that Zone does not process it.
Place the 'Speaker Audio (your_voice.mp3 or *.wav)' in the root directory of Zonos or specify the path in the command below.
i think, you could concatenate the audio chunk into one single audio waveform tensor, which could be written into a final file
Yes, but it depends on the use cases. For instances like this [https://www.youtube.com/watch?v=3OFkEtTFM84], it is practical to have the output in parts.
Usage:
python batch_process.py --text-dir ./text_chunks --output-dir ./output_audio --speaker-audio ./your_voice.mp3 --language en-us --seed 420
Another solution: #98 (comment)
Step 1: Open gradio_interface.py
insert the code above:
def build_interface():
( line 196) and save the changes.Step 2: Create batch_process.py
In the same directory as gradio.interface.py, create a new file, insert the following code, and save it.
Step 3: Prepare your text
You can either manually divide your text into small chunks that are a maximum of 27 to 30 seconds long or use a text splitter to segment your text into smaller portions. Zonos works with a maximum of 30 seconds of audio generation and does not count the words. Some sentences with fewer words could generate longer audio based on the settings. For the default setting, I found 40 to 50 words work well. The code below can split your text based on sentences that end with "'.', '!', '?" (You can adjust the filter) and are not longer than 50 words. You can adjust the number of words by changing the value of "word_limit."
put your text into a text file and save it as data.txt.
you can change the file name but update the input_file" value.
Step 4: Split the text
create new 'text_split.py'
Create a new file and copy and paste the following code.
python text_split.py
This script will create a folder named 'text_chunks' and save text files there. If any sentences exceed 50 words, it will generate 'chunk_plus_50_1.txt' for you to manually address.
Step 5: Celan your datan & run
Examine the text files and review the results. 'chunk_mapping' provides a detailed description of each chunk. Move it so that Zone does not process it.
Place the 'Speaker Audio (your_voice.mp3 or *.wav)' in the root directory of Zonos or specify the path in the command below.
In Terminal run:
python batch_process.py --text-dir ./text_chunks --output-dir ./output_audio --speaker-audio ./your_voice.mp3 --language en-us --seed 420
In my tests, it will produce up to 80 to 85% consistent voice tone.
The text was updated successfully, but these errors were encountered: