Headless bot that joins a Daily.co room, records all participants, and produces a composed MP4 with dynamic layout switching — without using Daily's cloud recording API.
| File | Description |
|---|---|
bot_recorder.py |
Approach 1 — PNG-based recorder. Saves each video frame as a PNG file during recording, then encodes to MP4 in post-processing via ffmpeg concat demuxer. High disk usage (~2 GB per participant for 8 min), slower composition. |
bot_recorder_v2.py |
Approach 2 — Direct MP4 encoder. Encodes video frames directly to MP4 in real time via PyAV. No intermediate PNG files. Low disk usage (~10–15 MB per participant), faster composition. Recommended. |
compose_video.py |
Standalone composition script — use to retry or recompose from saved session files without re-running the bot. Works with output from both approaches. |
create_room.py |
Creates a Daily room and prints the URL. |
During recording, each RGBA video frame is saved as a PNG file named by its capture timestamp in microseconds (000000033411.png, etc.). After the session ends:
- Per-participant PNGs → MP4 via ffmpeg concat demuxer (exact variable framerate from timestamps)
- Per-participant WAV files written from raw PCM
- Audio mixed with per-track silence padding for timeline alignment
- Timeline windows computed from join/leave timestamps
- Segments encoded (one per layout window) via ffmpeg
-ss/-t - Segments concatenated into final MP4 with mixed audio
- Uploaded to S3
~2.2 GB per participant for an 8-minute session
~4.5 GB total for a 2-participant session
PNG files are kept in recordings/<session_id>/<name>_frames/ and are never deleted.
# Create a room
python create_room.py
# Start the bot
python bot_recorder.py --room-url "https://yourapp.daily.co/<room-name>"Frames are encoded directly into an H.264 MP4 container in real time using PyAV as they arrive from Daily's on_video_frame callback. No intermediate files written during recording.
Timebase is 1/90000 (standard H.264/MP4). Each frame's pts is:
pts = int(elapsed_us * 90000 / 1_000_000)
After close_video(), the raw MP4 is re-encoded to CFR 30fps via normalize_video() — this is required because PyAV produces a VFR stream (r_frame_rate=90000/1) that confuses ffmpeg's seek, causing frozen frames in composition. After normalization all downstream ffmpeg operations work correctly.
Post-processing pipeline is the same as Approach 1 from Step 2 onwards.
~10–15 MB per participant for an 8-minute session
~25–30 MB total for a 2-participant session
# Create a room
python create_room.py
# Start the bot
python bot_recorder_v2.py --room-url "https://yourapp.daily.co/<room-name>"Step 1 done in Xs ← close + normalize per-participant videos
Composition done in Xs ← segments + concat
DONE — total post-processing: Xs
Both approaches produce a composed MP4 where the layout reflects exactly who was present at each moment:
| Participants present | Layout |
|---|---|
| 1 | Full-width (OUTPUT_WIDTH × OUTPUT_HEIGHT) |
| 2 | Side-by-side hstack (OUTPUT_WIDTH/2 each) |
| 3+ | Equal-width hstack |
Example (Aditya joins at T=0, iPhone joins at T=40s, iPhone leaves at T=55s, session ends T=65s):
T=0 → T=40 : Aditya full-width
T=40 → T=55 : Aditya | iPhone (side-by-side)
T=55 → T=65 : Aditya full-width
All files are kept in recordings/<session_id>/ and never deleted:
recordings/<session_id>/
Aditya_video.mp4 per-participant video
Aditya_audio.wav per-participant audio
Iphone_video.mp4
Iphone_audio.wav
Iphone_audio_padded.wav silence-padded (if iPhone joined later than Aditya)
mixed.wav merged stereo audio (aligned to video timeline)
session.json timeline metadata — join/leave times, file paths
<session_id>_recording.mp4 final composed output
For quick review in this repository, one representative composed output is committed at:
recordings/20260318_132416/20260318_132416_recording.mp4
All other recordings/* content remains ignored to avoid committing large generated artifacts.
session.json example:
{
"session_id": "20260318_132416",
"session_dir": "/path/to/recordings/20260318_132416",
"participants": [
{
"name": "Aditya",
"join_time_s": 4.46,
"leave_time_s": 54.27,
"video_path": "...",
"audio_path": "..."
},
{
"name": "iPhone",
"join_time_s": 39.01,
"leave_time_s": 51.49,
"video_path": "...",
"audio_path": "..."
}
]
}If the bot ran but composition failed, or you want to recompose with different settings:
python compose_video.py --session recordings/<session_id>/session.jsonOutput: recordings/<session_id>/<session_id>_composed.mp4
python compose_video.py \
--p1-video recordings/<session_id>/Aditya_video.mp4 \
--p1-name Aditya \
--p1-join 4.46 \
--p1-leave 54.27 \
--p1-audio recordings/<session_id>/Aditya_audio.wav \
--p2-video recordings/<session_id>/Iphone_video.mp4 \
--p2-name iPhone \
--p2-join 39.01 \
--p2-leave 51.49 \
--p2-audio recordings/<session_id>/Iphone_audio.wav \
--output recordings/<session_id>/output.mp4Join/leave times come from session.json.
Videos recorded with an older bot version (before normalize_video()) will have r_frame_rate=90000/1. Re-encode them first:
ffmpeg -y -i recordings/<session_id>/Aditya_video.mp4 \
-c:v libx264 -preset fast -crf 18 -pix_fmt yuv420p -vf "fps=30" \
recordings/<session_id>/Aditya_video_fixed.mp4
ffmpeg -y -i recordings/<session_id>/Iphone_video.mp4 \
-c:v libx264 -preset fast -crf 18 -pix_fmt yuv420p -vf "fps=30" \
recordings/<session_id>/Iphone_video_fixed.mp4Then pass the _fixed.mp4 files to compose_video.py.
pip install -r requirements.txt
brew install ffmpeg # or: apt install ffmpeg.env:
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_REGION=ap-south-1
S3_BUCKET=your-bucket
BASE_CDN_URL=https://your-cdn.com # optional
RECORDINGS_DIR=./recordings # optional, default: ./recordings
OUTPUT_WIDTH=1280 # optional
OUTPUT_HEIGHT=720 # optional
LOG_LEVEL=INFO # set to DEBUG for verbose timeline/ffmpeg output
For an 8-minute session with 2 participants (Daily pricing, 10k–100k minute tier at $0.004/participant-min):
| Component | Daily cloud recording | Bot solution |
|---|---|---|
| Participant minutes (2 users) | 2 × 8 × $0.004 = $0.064 | 2 × 8 × $0.004 = $0.064 |
| Bot participant minutes | — | 1 × 8 × $0.004 = $0.032 |
| Cloud recording | 8 × $0.01349 = $0.108 | $0.00 |
| Modal compute | — | ~$0.013 |
| Total | $0.172 | ~$0.109 |
~37% cheaper per session. At 10,000 sessions/month that's ~$630/month saved, driven entirely by eliminating Daily's cloud recording charge.