Real-time audio transcription & AI summarization platform
ScribeAI is a production-ready web application that streams live audio, transcribes speech using Whisper AI (local, offline), and generates intelligent summaries with OpenAI GPT-4o-mini. Built with Next.js, Socket.io, and PostgreSQL for scalable real-time processing.
- Node.js 18+ and npm/pnpm
- PostgreSQL database (local or cloud)
- OpenAI API Key (pay-as-you-go, no quota limits)
# Clone the repository
git clone https://github.com/Jasonwill2004/ScribeAI.git
cd ScribeAI
# Install dependencies (monorepo with Turborepo)
npm install
# Set up environment variables
cp .env.example .env
# Edit .env and add:
# DATABASE_URL="postgresql://user:password@localhost:5432/scribeai"
# OPENAI_API_KEY="your-openai-api-key"
# Run database migrations
npx prisma migrate dev
# Start development servers (web + api-socket)
npm run dev- Open http://localhost:3000
- Click "New Session" to start recording
- Allow microphone access
- Speak into your microphone
- Click "End Session" to generate AI summary
- View your session at Sessions page
ScribeAI uses a chunked streaming architecture for real-time transcription with minimal latency:
graph LR
A[Client Browser] -->|MediaRecorder| B[Socket.io WebSocket]
B -->|Audio Chunks| C[API Socket Server]
C -->|WebM Opus| D[Whisper Base Model]
D -->|Text Chunks| E[PostgreSQL DB]
E -->|Aggregated Transcript| F[OpenAI GPT-4o-mini]
F -->|Summary + Insights| E
E -->|Session Data| G[Next.js App Router]
G -->|SSR/API Routes| A
style A fill:#e1f5ff
style C fill:#fff4e1
style D fill:#e8f5e9
style F fill:#f3e5f5
style E fill:#ffe0b2
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | Next.js 14 App Router | React SSR, server components, API routes |
| Real-time | Socket.io | Bidirectional WebSocket for audio streaming |
| Transcription | Whisper Base (@xenova/transformers) | Local, offline speech-to-text (~150MB model) |
| AI Summary | OpenAI GPT-4o-mini | Natural language understanding, key insights |
| Database | PostgreSQL + Prisma ORM | Session persistence, relational data |
| Monorepo | Turborepo | Unified build system for apps/web + apps/api-socket |
| Approach | Cost | Latency | Accuracy | Offline | ScribeAI Choice |
|---|---|---|---|---|---|
| Whisper Base (Local) | FREE | ~200ms | Good (85-90%) | β Yes | β Selected |
| Whisper Large v3 | FREE | ~500ms | Excellent (95%+) | β Yes | β Too heavy (3GB) |
| OpenAI Whisper API | $0.006/min | ~100ms | Excellent | β No | β Recurring cost |
| Google Speech-to-Text | $0.016/min | ~50ms | Excellent | β No | β Expensive at scale |
| AWS Transcribe | $0.024/min | ~100ms | Excellent | β No | β Most expensive |
Why Whisper Base? Balance of speed, accuracy, and zero cost. Runs entirely client-side or server-side without external API dependencies.
| Approach | Cost | Quality | Streaming | ScribeAI Choice |
|---|---|---|---|---|
| OpenAI GPT-4o-mini | $0.15/1M in + $0.60/1M out | Excellent | β Yes | β Selected |
| Google Gemini 1.5 Flash | FREE (15 RPM) | Excellent | β Yes | β Quota limits |
| OpenAI GPT-4o | $5/1M tokens | Excellent | β Yes | β More expensive |
| Claude 3.5 Sonnet | $3/1M tokens | Excellent | β Yes | β More expensive |
| Llama 3 (Local) | FREE | Good | β No | β Resource intensive |
Why GPT-4o-mini? Unlimited usage (pay-as-you-go), fast response times (~1-2s), excellent natural language understanding, JSON mode for structured output, no quota limits.
Resilient Fallback Design: ScribeAI implements graceful degradation when the OpenAI API is temporarily unavailable (network issues, etc.). In these cases:
- β Session is always saved with full transcript chunks
- β Fallback summary is generated showing transcript preview
- β UI displays "AI Unavailable - Fallback Summary" badge
- β Users can still download and review raw transcripts
- β No data loss occurs - summaries can be regenerated later
This architecture ensures 100% reliability for core transcription functionality, with AI summarization as an enhancement layer rather than a hard dependency.
ScribeAI is architected for horizontal scalability with session-based isolation:
1. Chunked Streaming Pipeline
Audio is processed in 30-second chunks (configurable) rather than waiting for the entire recording. This enables:
- Memory efficiency: Fixed ~150MB RAM per session (Whisper model)
- Progressive UX: Users see transcripts appear in real-time
- Fault tolerance: Failed chunks don't invalidate the entire session
2. Stateless Session Management
Each recording session is isolated in PostgreSQL with a unique sessionId. The Socket.io server maintains no in-memory state beyond active connections, allowing:
- Horizontal scaling: Multiple API servers behind a load balancer
- Session recovery: Clients reconnect using
sessionIdto resume - Database-driven state: All progress persisted (recording β paused β processing β completed)
3. Async Summary Generation
OpenAI API calls happen asynchronously after session end, preventing blocking:
- Non-blocking: User can close the browser while summarization runs
- Event-driven: Socket.io emits
completedevent when summary finishes - Retry logic: Failed summaries can be regenerated without re-transcribing
Projected Capacity (Single 4-core server):
- Concurrent sessions: ~20-30 (limited by Whisper CPU usage)
- Database load: <100 sessions/sec write throughput
- Cost at 1000 users/day: ~$3-5 (GPT-4o-mini: $0.15/1M input tokens, avg ~500 tokens/summary)
Bottlenecks to Monitor:
- Whisper CPU usage (solution: GPU acceleration or cloud Whisper API)
- OpenAI rate limits (solution: tier-based limits, typically 500 RPM on tier 1)
- PostgreSQL connections (solution: PgBouncer pooling)
- Start: Click "New Session" β allow mic access
- Record: Speak naturally (app auto-chunks every 30s)
- Pause/Resume: Control recording with toolbar buttons
- End: Click "End Session" to trigger AI summary
- Sessions Page: Grid of all sessions with state badges
- Session Detail: Full transcript with timestamps + AI insights
- Download: Export transcript as formatted TXT file
Use the included script to generate sample sessions:
./scripts/record-demo.shScribeAI/
βββ apps/
β βββ web/ # Next.js frontend
β β βββ app/
β β β βββ sessions/
β β β β βββ page.tsx # Sessions list
β β β β βββ [id]/page.tsx # Session detail
β β β βββ api/
β β β βββ download/[sessionId]/route.ts
β β βββ lib/
β β βββ auth.ts # Auth helpers (Better Auth TODO)
β βββ api-socket/ # Socket.io server
β βββ src/
β β βββ socket.ts # Socket event handlers
β β βββ summary/
β β β βββ processor.ts # Summary aggregation logic
β β βββ lib/
β β βββ summary.ts # OpenAI API integration
β βββ package.json
βββ packages/
β βββ database/ # Prisma schema & migrations
β βββ prisma/
β βββ schema.prisma
βββ turbo.json # Turborepo config
βββ README.md
Frontend
apps/web/app/sessions/page.tsx: Server component fetching sessions with Prismaapps/web/app/sessions/[id]/page.tsx: Detail view with transcript + summaryapps/web/app/api/download/[sessionId]/route.ts: TXT export endpoint
Backend
apps/api-socket/src/socket.ts: Socket.io events (start_session,audio_chunk,end_session)apps/api-socket/src/summary/processor.ts: Core logic for aggregating transcript & generating summaryapps/api-socket/src/lib/summary.ts: OpenAI API wrapper
Database
packages/database/prisma/schema.prisma: Models:User,Session,TranscriptChunk,Summary
# Development
npm run dev # Start all apps (web + api-socket)
npm run dev:web # Start Next.js only
npm run dev:api # Start Socket.io only
# Build
npm run build # Build all apps
npm run build:web # Build Next.js
npm run build:api # Build Socket.io
# Database
npx prisma migrate dev # Apply migrations
npx prisma studio # Visual DB browser
npx prisma generate # Regenerate Prisma Client
# Testing
npm run test # Run tests (TODO: Add tests)
npm run test:summary # Test OpenAI summarization# Database
DATABASE_URL="postgresql://user:password@localhost:5432/scribeai"
# AI Services
OPENAI_API_KEY="sk-..." # Get from https://platform.openai.com/api-keys
# Better Auth (TODO)
BETTER_AUTH_SECRET="your-secret-key"
BETTER_AUTH_URL="http://localhost:3000"- PR #5: MediaRecorder β Socket.io audio streaming
- PR #6: Whisper Base local transcription
- PR #7: OpenAI GPT-4o-mini summarization
- PR #8: Session list/detail pages + README
- Better Auth integration (email/password + OAuth)
- Real-time transcript display during recording
- WebSocket heartbeat monitoring
- Export to PDF/JSON formats
- Speaker diarization (multi-speaker support)
- Custom summary templates
- Mobile app (React Native)
- Self-hosted deployment guides (Docker)
MIT License - see LICENSE for details.
- Whisper AI: OpenAI's state-of-the-art speech recognition
- Xenova Transformers: Lightweight Whisper.js implementation
- Google Gemini: Powerful language model for summarization
- Next.js: React framework for production-grade apps
- Socket.io: Reliable WebSocket library
Built with β€οΈ by Jason William
GitHub β’ Issues β’ Pull Requests
AI-powered audio transcription and meeting summarization tool built with Next.js, Socket.io, and Google Gemini API.
- Real-time Audio Transcription: Capture and transcribe audio from microphone or shared meeting tabs (Google Meet/Zoom)
- Live Streaming: Stream audio chunks to Gemini API for incremental transcription
- Session Management: Record, pause, resume, and stop recording sessions
- AI Summaries: Generate meeting summaries with key points, action items, and decisions
- Long-duration Support: Architected for 1+ hour recording sessions with chunked streaming
- Real-time Updates: Socket.io integration for live status updates
- Dark Mode: Full dark mode support for extended sessions
- Frontend/Backend: Next.js 14+ (App Router, TypeScript)
- Real-time Communication: Socket.io
- Database: PostgreSQL via Prisma ORM
- AI/ML: Google Gemini API
- Authentication: Better Auth (planned)
- Styling: Tailwind CSS
- Code Quality: ESLint, Prettier
ScribeAI/
βββ apps/
β βββ web/ # Next.js frontend application
β β βββ app/
β β β βββ components/ # React components
β β β βββ providers/ # Context providers
β β β βββ layout.tsx # Root layout with dark mode
β β β βββ page.tsx # Home page
β β β βββ globals.css # Global styles
β β βββ next.config.js
β β βββ tailwind.config.cjs
β β βββ package.json
β βββ api-socket/ # Node.js Socket.io server
β βββ src/
β β βββ index.ts # Express server setup
β β βββ socket.ts # Socket event handlers
β βββ tsconfig.json
β βββ package.json
βββ docker-compose.yml # PostgreSQL container
βββ .eslintrc.cjs # ESLint configuration
βββ .prettierrc # Prettier configuration
βββ .env.example # Environment variables template
βββ package.json # Root workspace configuration
- Node.js 18+ and npm
- Docker and Docker Compose (for PostgreSQL)
- Google Gemini API key (Get it here)
-
Clone the repository
git clone <repository-url> cd ScribeAI
-
Install dependencies
npm install
-
Set up environment variables
cp .env.example .env
Edit
.envand add your Gemini API key:GEMINI_API_KEY=your_actual_api_key_here
-
Start PostgreSQL with Docker
docker-compose up -d
Verify it's running:
docker-compose ps
-
Run the development servers
# Start both Next.js and Socket.io servers concurrently npm run dev # Or run them separately: npm run dev:app # Next.js on http://localhost:3000 npm run dev:socket # Socket.io on http://localhost:4001
-
Access the application
- Frontend: http://localhost:3000
- Socket.io server: http://localhost:4001
- Health check: http://localhost:4001/health
npm run dev- Start both app and socket server concurrentlynpm run dev:app- Start Next.js development server onlynpm run dev:socket- Start Socket.io server onlynpm run build- Build all workspacesnpm run start- Start production serversnpm run lint- Lint all workspacesnpm run format- Format code with Prettiernpm run format:check- Check code formatting
npm run dev --workspace=apps/web- Start Next.js dev servernpm run build --workspace=apps/web- Build Next.js appnpm run start --workspace=apps/web- Start Next.js production server
npm run dev --workspace=apps/api-socket- Start Socket.io dev servernpm run build --workspace=apps/api-socket- Build TypeScript to JavaScriptnpm run start --workspace=apps/api-socket- Start production server
session:start- Initialize new recording sessionaudio:chunk- Send audio data chunksession:pause- Pause current sessionsession:resume- Resume paused sessionsession:stop- Stop and process session
session:status- Session state updates (recording, paused, processing, completed)audio:received- Acknowledge audio chunk receipterror- Error notifications
# Start PostgreSQL
docker-compose up -d
# Stop PostgreSQL
docker-compose down
# View logs
docker-compose logs -f postgres
# Remove volumes (β οΈ deletes all data)
docker-compose down -v- Make changes to code
- Hot reload will update automatically
- Check console for errors
- Format code before committing:
npm run format
- TypeScript: Strict type checking enabled
- ESLint: Configured for TypeScript and React best practices
- Prettier: Consistent code formatting
- JSDoc: Inline documentation for functions and components
- Prisma ORM integration
- Better Auth authentication
- Gemini API transcription integration
- Session history dashboard
- Audio chunk streaming implementation
- Meeting summary generation
- Export transcripts (PDF, TXT, JSON)
- Multi-speaker diarization
- WebRTC implementation for tab sharing
- Unit and integration tests
- Create a feature branch
- Make your changes
- Run linting and formatting
- Submit a pull request
This project is part of the AttackCapital assignment.
Built with β€οΈ for productivity professionals