An AI-powered podcast generation tool: automatically extract text from any source (webpage, YouTube, PDF) → generate multi-role dialogue scripts via Gemini LLM → synthesize speech with Edge TTS → store in Cloudflare R2 + Cloudflare D1.
- AI Podcast: https://podcast-997.pages.dev/
- AI Podcast UI & Cloudflare Worker: https://github.com/ai-bot-pro/react-nextjs-web-podcast
| Step | Description |
|---|---|
| Content Extraction | Supports webpages (BeautifulSoup), YouTube (subtitles/transcription), PDF (PyMuPDF) |
| Script Generation | Gemini (google-genai + instructor) streams multi-role podcast dialogue |
| Speech Synthesis | Microsoft Edge TTS, supports Chinese/English/Japanese/Korean and more, with exponential backoff retry |
| Cover Art | SiliconFlow AI image generation, auto-compressed and uploaded to Cloudflare R2 |
| Word-level Subtitles | Gemini audio understanding produces per-token timing; writes JSON / WebVTT (karaoke <time> tags) / enhanced LRC / SRT |
| Storage / Database | Cloudflare R2 (audio/images/subtitles) + Cloudflare D1 (metadata) |
| RSS Feed | Generate Apple Podcasts-compatible RSS XML from D1, optionally upload to R2 |
podcast/
├── pyproject.toml
├── .env.example
├── podcast/
│ ├── gen_podcast.py # End-to-end CLI entry point
│ ├── gen_podcasts_xml.py # Generate RSS feed XML from D1
│ ├── content_parser_tts.py # Content extraction + TTS
│ ├── subtitle_generator.py # Gemini audio → word-level subtitles
│ ├── insert_podcast.py # R2 upload + D1 insert
│ ├── _llm_retry.py # Shared Gemini transient-error retry helpers
│ ├── audio_length.py
│ ├── image_compression.py
│ ├── siliconflow_api.py
│ ├── aws/
│ │ └── upload.py # Cloudflare R2 via boto3
│ ├── cloudflare/
│ │ └── rest_api.py # D1 REST API
│ └── content_parser/
│ ├── types.py # Language code mapping
│ ├── content_extractor_instructor.py
│ ├── pdf_extractor_instructor.py
│ ├── website_extractor_instructor.py
│ ├── youtube_transcriber_instructor.py
│ └── table/
│ └── podcast.py # LLM prompt + structured output
└── audios/ # Generated audio files (gitignored)
# Python 3.11+ recommended
pip install gen-podcast
# Or install from source in editable mode
cd podcast
make install # equivalent to: pip install -e .Copy .env.example to .env and fill in your values:
cp .env.example .env| Variable | Description |
|---|---|
GOOGLE_API_KEY |
Google Gemini API key |
GEMINI_MODEL |
Model ID, default gemini-3-flash-preview (without models/ prefix) |
GEMINI_FALLBACK_MODEL |
Optional fallback model on 503 overload (e.g. gemini-2.5-flash) |
GEMINI_SUBTITLE_MODEL |
Audio-capable model for word-level subtitles, default gemini-2.5-flash |
GEMINI_MAX_RETRIES |
LLM retry count, default 6 |
GEMINI_RETRY_BASE_SEC |
Retry base delay in seconds (exponential backoff), default 2 |
ROUND_CN |
Number of dialogue rounds (optional, random 20–50 if unset) |
PODCAST_D1_DB_ID |
Cloudflare D1 database ID |
CLOUDFLARE_API_KEY |
Cloudflare API Token |
CLOUDFLARE_ACCOUNT_ID |
Cloudflare account ID |
CLOUDFLARE_ACCESS_KEY |
R2 Access Key |
CLOUDFLARE_SECRET_KEY |
R2 Secret Key |
CLOUDFLARE_REGION |
R2 region, default apac |
S3_BUCKET_URL |
R2 public base URL |
SILICONCLOUD_API_KEY |
SiliconFlow API key (for cover art generation) |
# English podcast
gen-podcast run \
"https://en.wikipedia.org/wiki/Large_language_model"
# Chinese podcast (specify Chinese voices)
gen-podcast run \
--role-tts-voices zh-CN-YunjianNeural \
--role-tts-voices zh-CN-XiaoxiaoNeural \
--language zh \
--category 1 \
"https://en.wikipedia.org/wiki/Large_language_model"
# Multiple sources + publish
gen-podcast run \
--role-tts-voices zh-CN-YunjianNeural \
--role-tts-voices zh-CN-XiaoxiaoNeural \
--language zh \
--category 1 \
--is-published \
"https://www.youtube.com/watch?v=aR6CzM0x-g0" \
"https://en.wikipedia.org/wiki/Large_language_model" \
"/path/to/paper.pdf"
# Or use make (pass extra arguments via ARGS)
make gen-podcast ARGS="--language zh https://en.wikipedia.org/wiki/Large_language_model"run options:
| Option | Default | Description |
|---|---|---|
SOURCES |
— | One or more sources (webpage URL / YouTube URL / PDF path) |
--role-tts-voices |
en-US-JennyNeural en-US-EricNeural |
Edge TTS voice(s), repeatable |
--language |
en |
Dialogue language (zh / en / ja / ko etc.) |
--save-dir |
./audios/podcast |
Audio output directory |
--category |
0 |
Category (0=unknown 1=tech 2=education 3=food 4=travel 5=code …) |
--is-published |
False |
When set, marks as published in D1 and prints the public URL |
--subtitles / --no-subtitles |
False |
When enabled, generate word-level subtitles via Gemini, upload to R2 and save URLs to D1 (opt-in) |
--subtitle-model |
"" |
Override GEMINI_SUBTITLE_MODEL for the subtitle pass only |
# Single source → generate mp3 + vtt
content-parser-tts instruct-content-tts \
"https://en.wikipedia.org/wiki/Large_language_model"
# Chinese
content-parser-tts instruct-content-tts \
--role-tts-voices zh-CN-YunjianNeural \
--role-tts-voices zh-CN-XiaoxiaoNeural \
--language zh \
"https://en.wikipedia.org/wiki/Large_language_model"
# Manually merge segmented audio
content-parser-tts merge-audio-files \
audios/podcast/Large_language_model/0 audios/podcast/LLM.mp3# Generate rss.xml locally from D1 podcast data
gen-podcasts-xml gen_xml_from_d1_podcast
# Generate and upload to Cloudflare R2
gen-podcasts-xml gen_xml_from_d1_podcast --is-upload
# Or use make
make gen-rss # generate only
make gen-rss-upload # generate + upload to R2# Upload audio + generate cover + write to D1
insert-podcast insert-podcast-to-d1 \
./audios/podcast/LLM.mp3 \
"Large Language Model" \
"weedge" \
"en-US-EricNeural,en-US-JennyNeural" \
--language en \
--category 1 \
--is-published
# Update cover art
insert-podcast update-podcast-cover-to-d1 \
<pid> "https://example.com/cover.png"
# Backfill subtitle URLs for an existing podcast
insert-podcast update-podcast-subtitles-to-d1 \
<pid> \
--subtitle-json-url https://.../foo.words.json \
--subtitle-vtt-url https://.../foo.words.vtt \
--subtitle-lrc-url https://.../foo.words.lrc \
--subtitle-srt-url https://.../foo.srtTranscribe the final merged mp3 with Gemini and emit 4 subtitle artifacts next to the audio: <stem>.words.json, <stem>.words.vtt (WebVTT with inline <00:00:00.500> karaoke time tags), <stem>.words.lrc (enhanced LRC / A2 extension), <stem>.srt. CJK languages are tokenized as 2–5-character phrases; space-separated languages are tokenized per word.
# Standalone: generate subtitles for any mp3
subtitle-gen generate audios/podcast/LLM.mp3 --language zh
# With optional reference transcript (helps alignment)
subtitle-gen generate audios/podcast/LLM.mp3 \
--language zh \
--script-file audios/podcast/LLM.txt \
--output-dir /tmp/subs
# Pick specific formats only
subtitle-gen generate audios/podcast/LLM.mp3 --formats json --formats vtt
# Or via make (AUDIO=required, LANG defaults to en)
make subtitle-gen AUDIO=audios/podcast/LLM.mp3 LANG=zh
make subtitle-gen AUDIO=audios/podcast/LLM.mp3 LANG=zh ARGS="--output-dir /tmp/subs"When used through gen-podcast run, subtitle generation is opt-in: pass --subtitles to generate, upload them to R2 automatically, and store the URLs in D1 columns subtitle_json_url / subtitle_vtt_url / subtitle_lrc_url / subtitle_srt_url. Before enabling, existing D1 databases need this one-time migration:
ALTER TABLE podcast ADD COLUMN subtitle_json_url text DEFAULT "";
ALTER TABLE podcast ADD COLUMN subtitle_vtt_url text DEFAULT "";
ALTER TABLE podcast ADD COLUMN subtitle_lrc_url text DEFAULT "";
ALTER TABLE podcast ADD COLUMN subtitle_srt_url text DEFAULT "";| Voice ID | Gender | Style |
|---|---|---|
zh-CN-YunjianNeural |
Male | Broadcast style |
zh-CN-YunxiNeural |
Male | Narrative style |
zh-CN-YunyangNeural |
Male | News style |
zh-CN-XiaoxiaoNeural |
Female | Natural & friendly (recommended) |
zh-CN-XiaoyiNeural |
Female | Gentle & sweet |
| Voice ID | Gender |
|---|---|
en-US-EricNeural |
Male |
en-US-JennyNeural |
Female |
Note: When
--language zhis set without Chinese voices, the tool automatically replaces them and prints a warning with the recommended list.
| Model ID | Description |
|---|---|
gemini-3-flash-preview |
Default, Gemini 3 Flash (fast) |
gemini-3.1-flash-lite-preview |
Gemini 3.1 Lite |
gemini-2.5-flash |
Stable, production-ready |
gemini-2.5-pro |
Best reasoning, higher cost |
Note:
gemini-3.1-flash-previewdoes not exist and will return a 404 error.
Source (URL / PDF)
│
▼
ContentExtractor ← Webpage / YouTube / PDF
│
▼
Gemini LLM (instructor) ← Streams structured multi-role Podcast object
│
▼
Edge TTS (per-role) ← Auto-cleans Markdown/SSML, exponential backoff retry
│
▼
pydub merge mp3
│
├─── SiliconFlow cover art (translate title → generate → compress)
│
├─── Gemini audio → word-level subtitles (json / vtt / lrc / srt) [opt-in via --subtitles]
│
▼
Cloudflare R2 upload (audio + cover + subtitles)
│
▼
Cloudflare D1 metadata insert (includes subtitle URLs)
make help # Show all available commands
make install # Install the package in editable mode
make gen-podcast # Run gen-podcast CLI (pass ARGS="..." for extra arguments)
make gen-rss # Generate RSS feed XML from D1 podcast data
make gen-rss-upload # Generate RSS feed XML and upload to Cloudflare R2
make subtitle-gen # Word-level subtitles via Gemini (AUDIO=path [LANG=zh] [ARGS="..."])
make build # Build source and wheel distributions
make dist-local # Install the built wheel locally
make publish-test # Publish package to TestPyPI
make publish # Publish package to PyPI
make clean # Remove build artifacts| Error | Cause | Solution |
|---|---|---|
503 UNAVAILABLE |
Gemini service overloaded | Set GEMINI_FALLBACK_MODEL=gemini-2.5-flash for automatic retry with fallback |
404 not found |
Invalid model ID | Check GEMINI_MODEL; do not use gemini-3.1-flash-preview |
NoAudioReceived |
Text contains unsupported characters or Edge TTS service issue | Tool auto-cleans and retries; skips the segment if all attempts fail |
ModuleNotFoundError: jsonref |
Missing dependency | pip install jsonref or pip install -e . |
| Chinese podcast opens/closes in English | Default English voices used or --language zh not set |
Add --language zh --role-tts-voices zh-CN-YunjianNeural ... |
