AI Bot Pro — Podcast

An AI-powered podcast generation tool: automatically extract text from any source (webpage, YouTube, PDF) → generate multi-role dialogue scripts via Gemini LLM → synthesize speech with Edge TTS → store in Cloudflare R2 + Cloudflare D1.

AI Podcast: https://podcast-997.pages.dev/
AI Podcast UI & Cloudflare Worker: https://github.com/ai-bot-pro/react-nextjs-web-podcast

Features

Step	Description
Content Extraction	Supports webpages (BeautifulSoup), YouTube (subtitles/transcription), PDF (PyMuPDF)
Script Generation	Gemini (`google-genai` + `instructor`) streams multi-role podcast dialogue
Speech Synthesis	Microsoft Edge TTS, supports Chinese/English/Japanese/Korean and more, with exponential backoff retry
Cover Art	SiliconFlow AI image generation, auto-compressed and uploaded to Cloudflare R2
Word-level Subtitles	Gemini audio understanding produces per-token timing; writes JSON / WebVTT (karaoke `<time>` tags) / enhanced LRC / SRT
Storage / Database	Cloudflare R2 (audio/images/subtitles) + Cloudflare D1 (metadata)
RSS Feed	Generate Apple Podcasts-compatible RSS XML from D1, optionally upload to R2

Project Structure

podcast/
├── pyproject.toml
├── .env.example
├── podcast/
│   ├── gen_podcast.py              # End-to-end CLI entry point
│   ├── gen_podcasts_xml.py         # Generate RSS feed XML from D1
│   ├── content_parser_tts.py       # Content extraction + TTS
│   ├── subtitle_generator.py       # Gemini audio → word-level subtitles
│   ├── insert_podcast.py           # R2 upload + D1 insert
│   ├── _llm_retry.py               # Shared Gemini transient-error retry helpers
│   ├── audio_length.py
│   ├── image_compression.py
│   ├── siliconflow_api.py
│   ├── aws/
│   │   └── upload.py               # Cloudflare R2 via boto3
│   ├── cloudflare/
│   │   └── rest_api.py             # D1 REST API
│   └── content_parser/
│       ├── types.py                # Language code mapping
│       ├── content_extractor_instructor.py
│       ├── pdf_extractor_instructor.py
│       ├── website_extractor_instructor.py
│       ├── youtube_transcriber_instructor.py
│       └── table/
│           └── podcast.py          # LLM prompt + structured output
└── audios/                         # Generated audio files (gitignored)

Installation

# Python 3.11+ recommended
pip install gen-podcast

# Or install from source in editable mode
cd podcast
make install   # equivalent to: pip install -e .

Environment Variables

Copy .env.example to .env and fill in your values:

cp .env.example .env

Variable	Description
`GOOGLE_API_KEY`	Google Gemini API key
`GEMINI_MODEL`	Model ID, default `gemini-3-flash-preview` (without `models/` prefix)
`GEMINI_FALLBACK_MODEL`	Optional fallback model on 503 overload (e.g. `gemini-2.5-flash`)
`GEMINI_SUBTITLE_MODEL`	Audio-capable model for word-level subtitles, default `gemini-2.5-flash`
`GEMINI_MAX_RETRIES`	LLM retry count, default `6`
`GEMINI_RETRY_BASE_SEC`	Retry base delay in seconds (exponential backoff), default `2`
`ROUND_CN`	Number of dialogue rounds (optional, random 20–50 if unset)
`PODCAST_D1_DB_ID`	Cloudflare D1 database ID
`CLOUDFLARE_API_KEY`	Cloudflare API Token
`CLOUDFLARE_ACCOUNT_ID`	Cloudflare account ID
`CLOUDFLARE_ACCESS_KEY`	R2 Access Key
`CLOUDFLARE_SECRET_KEY`	R2 Secret Key
`CLOUDFLARE_REGION`	R2 region, default `apac`
`S3_BUCKET_URL`	R2 public base URL
`SILICONCLOUD_API_KEY`	SiliconFlow API key (for cover art generation)

Quick Start

End-to-End Generation (Recommended)

# English podcast
gen-podcast run \
    "https://en.wikipedia.org/wiki/Large_language_model"

# Chinese podcast (specify Chinese voices)
gen-podcast run \
    --role-tts-voices zh-CN-YunjianNeural \
    --role-tts-voices zh-CN-XiaoxiaoNeural \
    --language zh \
    --category 1 \
    "https://en.wikipedia.org/wiki/Large_language_model"

# Multiple sources + publish
gen-podcast run \
    --role-tts-voices zh-CN-YunjianNeural \
    --role-tts-voices zh-CN-XiaoxiaoNeural \
    --language zh \
    --category 1 \
    --is-published \
    "https://www.youtube.com/watch?v=aR6CzM0x-g0" \
    "https://en.wikipedia.org/wiki/Large_language_model" \
    "/path/to/paper.pdf"

# Or use make (pass extra arguments via ARGS)
make gen-podcast ARGS="--language zh https://en.wikipedia.org/wiki/Large_language_model"

run options:

Option	Default	Description
`SOURCES`	—	One or more sources (webpage URL / YouTube URL / PDF path)
`--role-tts-voices`	`en-US-JennyNeural` `en-US-EricNeural`	Edge TTS voice(s), repeatable
`--language`	`en`	Dialogue language (`zh` / `en` / `ja` / `ko` etc.)
`--save-dir`	`./audios/podcast`	Audio output directory
`--category`	`0`	Category (0=unknown 1=tech 2=education 3=food 4=travel 5=code …)
`--is-published`	`False`	When set, marks as published in D1 and prints the public URL
`--subtitles` / `--no-subtitles`	`False`	When enabled, generate word-level subtitles via Gemini, upload to R2 and save URLs to D1 (opt-in)
`--subtitle-model`	`""`	Override `GEMINI_SUBTITLE_MODEL` for the subtitle pass only

Audio Only (No Database)

# Single source → generate mp3 + vtt
content-parser-tts instruct-content-tts \
    "https://en.wikipedia.org/wiki/Large_language_model"

# Chinese
content-parser-tts instruct-content-tts \
    --role-tts-voices zh-CN-YunjianNeural \
    --role-tts-voices zh-CN-XiaoxiaoNeural \
    --language zh \
    "https://en.wikipedia.org/wiki/Large_language_model"

# Manually merge segmented audio
content-parser-tts merge-audio-files \
    audios/podcast/Large_language_model/0  audios/podcast/LLM.mp3

Generate RSS Feed

# Generate rss.xml locally from D1 podcast data
gen-podcasts-xml gen_xml_from_d1_podcast

# Generate and upload to Cloudflare R2
gen-podcasts-xml gen_xml_from_d1_podcast --is-upload

# Or use make
make gen-rss          # generate only
make gen-rss-upload   # generate + upload to R2

Manual Database Insert

# Upload audio + generate cover + write to D1
insert-podcast insert-podcast-to-d1 \
    ./audios/podcast/LLM.mp3 \
    "Large Language Model" \
    "weedge" \
    "en-US-EricNeural,en-US-JennyNeural" \
    --language en \
    --category 1 \
    --is-published

# Update cover art
insert-podcast update-podcast-cover-to-d1 \
    <pid> "https://example.com/cover.png"

# Backfill subtitle URLs for an existing podcast
insert-podcast update-podcast-subtitles-to-d1 \
    <pid> \
    --subtitle-json-url https://.../foo.words.json \
    --subtitle-vtt-url  https://.../foo.words.vtt \
    --subtitle-lrc-url  https://.../foo.words.lrc \
    --subtitle-srt-url  https://.../foo.srt

Word-Level Subtitles (Gemini Audio)

Transcribe the final merged mp3 with Gemini and emit 4 subtitle artifacts next to the audio: <stem>.words.json, <stem>.words.vtt (WebVTT with inline <00:00:00.500> karaoke time tags), <stem>.words.lrc (enhanced LRC / A2 extension), <stem>.srt. CJK languages are tokenized as 2–5-character phrases; space-separated languages are tokenized per word.

# Standalone: generate subtitles for any mp3
subtitle-gen generate audios/podcast/LLM.mp3 --language zh

# With optional reference transcript (helps alignment)
subtitle-gen generate audios/podcast/LLM.mp3 \
    --language zh \
    --script-file audios/podcast/LLM.txt \
    --output-dir /tmp/subs

# Pick specific formats only
subtitle-gen generate audios/podcast/LLM.mp3 --formats json --formats vtt

# Or via make (AUDIO=required, LANG defaults to en)
make subtitle-gen AUDIO=audios/podcast/LLM.mp3 LANG=zh
make subtitle-gen AUDIO=audios/podcast/LLM.mp3 LANG=zh ARGS="--output-dir /tmp/subs"

When used through gen-podcast run, subtitle generation is opt-in: pass --subtitles to generate, upload them to R2 automatically, and store the URLs in D1 columns subtitle_json_url / subtitle_vtt_url / subtitle_lrc_url / subtitle_srt_url. Before enabling, existing D1 databases need this one-time migration:

ALTER TABLE podcast ADD COLUMN subtitle_json_url text DEFAULT "";
ALTER TABLE podcast ADD COLUMN subtitle_vtt_url  text DEFAULT "";
ALTER TABLE podcast ADD COLUMN subtitle_lrc_url  text DEFAULT "";
ALTER TABLE podcast ADD COLUMN subtitle_srt_url  text DEFAULT "";

Recommended Edge TTS Voices

Chinese `--language zh`

Voice ID	Gender	Style
`zh-CN-YunjianNeural`	Male	Broadcast style
`zh-CN-YunxiNeural`	Male	Narrative style
`zh-CN-YunyangNeural`	Male	News style
`zh-CN-XiaoxiaoNeural`	Female	Natural & friendly (recommended)
`zh-CN-XiaoyiNeural`	Female	Gentle & sweet

English `--language en` (default)

Voice ID	Gender
`en-US-EricNeural`	Male
`en-US-JennyNeural`	Female

Note: When --language zh is set without Chinese voices, the tool automatically replaces them and prints a warning with the recommended list.

Supported Gemini Models

Model ID	Description
`gemini-3-flash-preview`	Default, Gemini 3 Flash (fast)
`gemini-3.1-flash-lite-preview`	Gemini 3.1 Lite
`gemini-2.5-flash`	Stable, production-ready
`gemini-2.5-pro`	Best reasoning, higher cost

Note: gemini-3.1-flash-preview does not exist and will return a 404 error.

Pipeline Overview

Source (URL / PDF)
       │
       ▼
ContentExtractor          ← Webpage / YouTube / PDF
       │
       ▼
Gemini LLM (instructor)   ← Streams structured multi-role Podcast object
       │
       ▼
Edge TTS (per-role)       ← Auto-cleans Markdown/SSML, exponential backoff retry
       │
       ▼
pydub merge mp3
       │
       ├─── SiliconFlow cover art (translate title → generate → compress)
       │
       ├─── Gemini audio → word-level subtitles (json / vtt / lrc / srt) [opt-in via --subtitles]
       │
       ▼
Cloudflare R2 upload (audio + cover + subtitles)
       │
       ▼
Cloudflare D1 metadata insert (includes subtitle URLs)

Make Commands

make help            # Show all available commands
make install         # Install the package in editable mode
make gen-podcast     # Run gen-podcast CLI (pass ARGS="..." for extra arguments)
make gen-rss         # Generate RSS feed XML from D1 podcast data
make gen-rss-upload  # Generate RSS feed XML and upload to Cloudflare R2
make subtitle-gen    # Word-level subtitles via Gemini (AUDIO=path [LANG=zh] [ARGS="..."])
make build           # Build source and wheel distributions
make dist-local      # Install the built wheel locally
make publish-test    # Publish package to TestPyPI
make publish         # Publish package to PyPI
make clean           # Remove build artifacts

Troubleshooting

Error	Cause	Solution
`503 UNAVAILABLE`	Gemini service overloaded	Set `GEMINI_FALLBACK_MODEL=gemini-2.5-flash` for automatic retry with fallback
`404 not found`	Invalid model ID	Check `GEMINI_MODEL`; do not use `gemini-3.1-flash-preview`
`NoAudioReceived`	Text contains unsupported characters or Edge TTS service issue	Tool auto-cleans and retries; skips the segment if all attempts fail
`ModuleNotFoundError: jsonref`	Missing dependency	`pip install jsonref` or `pip install -e .`
Chinese podcast opens/closes in English	Default English voices used or `--language zh` not set	Add `--language zh --role-tts-voices zh-CN-YunjianNeural ...`

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
podcast		podcast
skills/gen-podcast		skills/gen-podcast
.env.example		.env.example
.gitignore		.gitignore
CHANGE_LOG.md		CHANGE_LOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
makefile		makefile
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Bot Pro — Podcast

Features

Project Structure

Installation

Environment Variables

Quick Start

End-to-End Generation (Recommended)

Audio Only (No Database)

Generate RSS Feed

Manual Database Insert

Word-Level Subtitles (Gemini Audio)

Recommended Edge TTS Voices

Chinese `--language zh`

English `--language en` (default)

Supported Gemini Models

Pipeline Overview

Make Commands

Troubleshooting

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Bot Pro — Podcast

Features

Project Structure

Installation

Environment Variables

Quick Start

End-to-End Generation (Recommended)

Audio Only (No Database)

Generate RSS Feed

Manual Database Insert

Word-Level Subtitles (Gemini Audio)

Recommended Edge TTS Voices

Chinese --language zh

English --language en (default)

Supported Gemini Models

Pipeline Overview

Make Commands

Troubleshooting

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Chinese `--language zh`

English `--language en` (default)

Packages