Skip to content

Jasonwill2004/ScribeAI

Repository files navigation

ScribeAI πŸŽ™οΈ

Real-time audio transcription & AI summarization platform

ScribeAI is a production-ready web application that streams live audio, transcribes speech using Whisper AI (local, offline), and generates intelligent summaries with OpenAI GPT-4o-mini. Built with Next.js, Socket.io, and PostgreSQL for scalable real-time processing.


πŸš€ Quick Start

Prerequisites

  • Node.js 18+ and npm/pnpm
  • PostgreSQL database (local or cloud)
  • OpenAI API Key (pay-as-you-go, no quota limits)

Installation

# Clone the repository
git clone https://github.com/Jasonwill2004/ScribeAI.git
cd ScribeAI

# Install dependencies (monorepo with Turborepo)
npm install

# Set up environment variables
cp .env.example .env
# Edit .env and add:
#   DATABASE_URL="postgresql://user:password@localhost:5432/scribeai"
#   OPENAI_API_KEY="your-openai-api-key"

# Run database migrations
npx prisma migrate dev

# Start development servers (web + api-socket)
npm run dev

First Recording

  1. Open http://localhost:3000
  2. Click "New Session" to start recording
  3. Allow microphone access
  4. Speak into your microphone
  5. Click "End Session" to generate AI summary
  6. View your session at Sessions page

πŸ—οΈ Architecture

ScribeAI uses a chunked streaming architecture for real-time transcription with minimal latency:

graph LR
    A[Client Browser] -->|MediaRecorder| B[Socket.io WebSocket]
    B -->|Audio Chunks| C[API Socket Server]
    C -->|WebM Opus| D[Whisper Base Model]
    D -->|Text Chunks| E[PostgreSQL DB]
    E -->|Aggregated Transcript| F[OpenAI GPT-4o-mini]
    F -->|Summary + Insights| E
    E -->|Session Data| G[Next.js App Router]
    G -->|SSR/API Routes| A
    
    style A fill:#e1f5ff
    style C fill:#fff4e1
    style D fill:#e8f5e9
    style F fill:#f3e5f5
    style E fill:#ffe0b2
Loading

Technology Stack

Layer Technology Purpose
Frontend Next.js 14 App Router React SSR, server components, API routes
Real-time Socket.io Bidirectional WebSocket for audio streaming
Transcription Whisper Base (@xenova/transformers) Local, offline speech-to-text (~150MB model)
AI Summary OpenAI GPT-4o-mini Natural language understanding, key insights
Database PostgreSQL + Prisma ORM Session persistence, relational data
Monorepo Turborepo Unified build system for apps/web + apps/api-socket

πŸ“Š Technology Comparison

Transcription Approaches

Approach Cost Latency Accuracy Offline ScribeAI Choice
Whisper Base (Local) FREE ~200ms Good (85-90%) βœ… Yes βœ… Selected
Whisper Large v3 FREE ~500ms Excellent (95%+) βœ… Yes ❌ Too heavy (3GB)
OpenAI Whisper API $0.006/min ~100ms Excellent ❌ No ❌ Recurring cost
Google Speech-to-Text $0.016/min ~50ms Excellent ❌ No ❌ Expensive at scale
AWS Transcribe $0.024/min ~100ms Excellent ❌ No ❌ Most expensive

Why Whisper Base? Balance of speed, accuracy, and zero cost. Runs entirely client-side or server-side without external API dependencies.

AI Summarization Approaches

Approach Cost Quality Streaming ScribeAI Choice
OpenAI GPT-4o-mini $0.15/1M in + $0.60/1M out Excellent βœ… Yes βœ… Selected
Google Gemini 1.5 Flash FREE (15 RPM) Excellent βœ… Yes ❌ Quota limits
OpenAI GPT-4o $5/1M tokens Excellent βœ… Yes ❌ More expensive
Claude 3.5 Sonnet $3/1M tokens Excellent βœ… Yes ❌ More expensive
Llama 3 (Local) FREE Good ❌ No ❌ Resource intensive

Why GPT-4o-mini? Unlimited usage (pay-as-you-go), fast response times (~1-2s), excellent natural language understanding, JSON mode for structured output, no quota limits.

Resilient Fallback Design: ScribeAI implements graceful degradation when the OpenAI API is temporarily unavailable (network issues, etc.). In these cases:

  • βœ… Session is always saved with full transcript chunks
  • βœ… Fallback summary is generated showing transcript preview
  • βœ… UI displays "AI Unavailable - Fallback Summary" badge
  • βœ… Users can still download and review raw transcripts
  • βœ… No data loss occurs - summaries can be regenerated later

This architecture ensures 100% reliability for core transcription functionality, with AI summarization as an enhancement layer rather than a hard dependency.


πŸ“ˆ Scalability Design

ScribeAI is architected for horizontal scalability with session-based isolation:

1. Chunked Streaming Pipeline
Audio is processed in 30-second chunks (configurable) rather than waiting for the entire recording. This enables:

  • Memory efficiency: Fixed ~150MB RAM per session (Whisper model)
  • Progressive UX: Users see transcripts appear in real-time
  • Fault tolerance: Failed chunks don't invalidate the entire session

2. Stateless Session Management
Each recording session is isolated in PostgreSQL with a unique sessionId. The Socket.io server maintains no in-memory state beyond active connections, allowing:

  • Horizontal scaling: Multiple API servers behind a load balancer
  • Session recovery: Clients reconnect using sessionId to resume
  • Database-driven state: All progress persisted (recording β†’ paused β†’ processing β†’ completed)

3. Async Summary Generation
OpenAI API calls happen asynchronously after session end, preventing blocking:

  • Non-blocking: User can close the browser while summarization runs
  • Event-driven: Socket.io emits completed event when summary finishes
  • Retry logic: Failed summaries can be regenerated without re-transcribing

Projected Capacity (Single 4-core server):

  • Concurrent sessions: ~20-30 (limited by Whisper CPU usage)
  • Database load: <100 sessions/sec write throughput
  • Cost at 1000 users/day: ~$3-5 (GPT-4o-mini: $0.15/1M input tokens, avg ~500 tokens/summary)

Bottlenecks to Monitor:

  • Whisper CPU usage (solution: GPU acceleration or cloud Whisper API)
  • OpenAI rate limits (solution: tier-based limits, typically 500 RPM on tier 1)
  • PostgreSQL connections (solution: PgBouncer pooling)

🎬 Video Walkthrough

Recording a Session

  1. Start: Click "New Session" β†’ allow mic access
  2. Record: Speak naturally (app auto-chunks every 30s)
  3. Pause/Resume: Control recording with toolbar buttons
  4. End: Click "End Session" to trigger AI summary

Viewing Results

  • Sessions Page: Grid of all sessions with state badges
  • Session Detail: Full transcript with timestamps + AI insights
  • Download: Export transcript as formatted TXT file

Demo Script

Use the included script to generate sample sessions:

./scripts/record-demo.sh

πŸ› οΈ Development

Project Structure

ScribeAI/
β”œβ”€β”€ apps/
β”‚   β”œβ”€β”€ web/              # Next.js frontend
β”‚   β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”‚   β”œβ”€β”€ sessions/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ page.tsx         # Sessions list
β”‚   β”‚   β”‚   β”‚   └── [id]/page.tsx    # Session detail
β”‚   β”‚   β”‚   └── api/
β”‚   β”‚   β”‚       └── download/[sessionId]/route.ts
β”‚   β”‚   └── lib/
β”‚   β”‚       └── auth.ts              # Auth helpers (Better Auth TODO)
β”‚   └── api-socket/       # Socket.io server
β”‚       β”œβ”€β”€ src/
β”‚       β”‚   β”œβ”€β”€ socket.ts            # Socket event handlers
β”‚       β”‚   β”œβ”€β”€ summary/
β”‚       β”‚   β”‚   └── processor.ts     # Summary aggregation logic
β”‚       β”‚   └── lib/
β”‚       β”‚       └── summary.ts       # OpenAI API integration
β”‚       └── package.json
β”œβ”€β”€ packages/
β”‚   └── database/         # Prisma schema & migrations
β”‚       └── prisma/
β”‚           └── schema.prisma
β”œβ”€β”€ turbo.json            # Turborepo config
└── README.md

Key Files

Frontend

  • apps/web/app/sessions/page.tsx: Server component fetching sessions with Prisma
  • apps/web/app/sessions/[id]/page.tsx: Detail view with transcript + summary
  • apps/web/app/api/download/[sessionId]/route.ts: TXT export endpoint

Backend

  • apps/api-socket/src/socket.ts: Socket.io events (start_session, audio_chunk, end_session)
  • apps/api-socket/src/summary/processor.ts: Core logic for aggregating transcript & generating summary
  • apps/api-socket/src/lib/summary.ts: OpenAI API wrapper

Database

  • packages/database/prisma/schema.prisma: Models: User, Session, TranscriptChunk, Summary

Available Scripts

# Development
npm run dev              # Start all apps (web + api-socket)
npm run dev:web          # Start Next.js only
npm run dev:api          # Start Socket.io only

# Build
npm run build            # Build all apps
npm run build:web        # Build Next.js
npm run build:api        # Build Socket.io

# Database
npx prisma migrate dev   # Apply migrations
npx prisma studio        # Visual DB browser
npx prisma generate      # Regenerate Prisma Client

# Testing
npm run test             # Run tests (TODO: Add tests)
npm run test:summary     # Test OpenAI summarization

Environment Variables

# Database
DATABASE_URL="postgresql://user:password@localhost:5432/scribeai"

# AI Services
OPENAI_API_KEY="sk-..."  # Get from https://platform.openai.com/api-keys

# Better Auth (TODO)
BETTER_AUTH_SECRET="your-secret-key"
BETTER_AUTH_URL="http://localhost:3000"

🚧 Roadmap

βœ… Completed

  • PR #5: MediaRecorder β†’ Socket.io audio streaming
  • PR #6: Whisper Base local transcription
  • PR #7: OpenAI GPT-4o-mini summarization
  • PR #8: Session list/detail pages + README

πŸ”„ In Progress

  • Better Auth integration (email/password + OAuth)
  • Real-time transcript display during recording
  • WebSocket heartbeat monitoring

πŸ“‹ Planned

  • Export to PDF/JSON formats
  • Speaker diarization (multi-speaker support)
  • Custom summary templates
  • Mobile app (React Native)
  • Self-hosted deployment guides (Docker)

πŸ“„ License

MIT License - see LICENSE for details.


πŸ™ Acknowledgments

  • Whisper AI: OpenAI's state-of-the-art speech recognition
  • Xenova Transformers: Lightweight Whisper.js implementation
  • Google Gemini: Powerful language model for summarization
  • Next.js: React framework for production-grade apps
  • Socket.io: Reliable WebSocket library

Built with ❀️ by Jason William
GitHub β€’ Issues β€’ Pull Requests

AI-powered audio transcription and meeting summarization tool built with Next.js, Socket.io, and Google Gemini API.

Next.js TypeScript Socket.io PostgreSQL Tailwind CSS Docker License

πŸš€ Features

  • Real-time Audio Transcription: Capture and transcribe audio from microphone or shared meeting tabs (Google Meet/Zoom)
  • Live Streaming: Stream audio chunks to Gemini API for incremental transcription
  • Session Management: Record, pause, resume, and stop recording sessions
  • AI Summaries: Generate meeting summaries with key points, action items, and decisions
  • Long-duration Support: Architected for 1+ hour recording sessions with chunked streaming
  • Real-time Updates: Socket.io integration for live status updates
  • Dark Mode: Full dark mode support for extended sessions

πŸ“‹ Tech Stack

  • Frontend/Backend: Next.js 14+ (App Router, TypeScript)
  • Real-time Communication: Socket.io
  • Database: PostgreSQL via Prisma ORM
  • AI/ML: Google Gemini API
  • Authentication: Better Auth (planned)
  • Styling: Tailwind CSS
  • Code Quality: ESLint, Prettier

πŸ—οΈ Project Structure

ScribeAI/
β”œβ”€β”€ apps/
β”‚   β”œβ”€β”€ web/                    # Next.js frontend application
β”‚   β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”‚   β”œβ”€β”€ components/     # React components
β”‚   β”‚   β”‚   β”œβ”€β”€ providers/      # Context providers
β”‚   β”‚   β”‚   β”œβ”€β”€ layout.tsx      # Root layout with dark mode
β”‚   β”‚   β”‚   β”œβ”€β”€ page.tsx        # Home page
β”‚   β”‚   β”‚   └── globals.css     # Global styles
β”‚   β”‚   β”œβ”€β”€ next.config.js
β”‚   β”‚   β”œβ”€β”€ tailwind.config.cjs
β”‚   β”‚   └── package.json
β”‚   └── api-socket/             # Node.js Socket.io server
β”‚       β”œβ”€β”€ src/
β”‚       β”‚   β”œβ”€β”€ index.ts        # Express server setup
β”‚       β”‚   └── socket.ts       # Socket event handlers
β”‚       β”œβ”€β”€ tsconfig.json
β”‚       └── package.json
β”œβ”€β”€ docker-compose.yml          # PostgreSQL container
β”œβ”€β”€ .eslintrc.cjs              # ESLint configuration
β”œβ”€β”€ .prettierrc                 # Prettier configuration
β”œβ”€β”€ .env.example                # Environment variables template
└── package.json                # Root workspace configuration

πŸ› οΈ Setup Instructions

Prerequisites

  • Node.js 18+ and npm
  • Docker and Docker Compose (for PostgreSQL)
  • Google Gemini API key (Get it here)

Installation

  1. Clone the repository

    git clone <repository-url>
    cd ScribeAI
  2. Install dependencies

    npm install
  3. Set up environment variables

    cp .env.example .env

    Edit .env and add your Gemini API key:

    GEMINI_API_KEY=your_actual_api_key_here
  4. Start PostgreSQL with Docker

    docker-compose up -d

    Verify it's running:

    docker-compose ps
  5. Run the development servers

    # Start both Next.js and Socket.io servers concurrently
    npm run dev
    
    # Or run them separately:
    npm run dev:app      # Next.js on http://localhost:3000
    npm run dev:socket   # Socket.io on http://localhost:4001
  6. Access the application

πŸ“¦ Available Scripts

Root Level

  • npm run dev - Start both app and socket server concurrently
  • npm run dev:app - Start Next.js development server only
  • npm run dev:socket - Start Socket.io server only
  • npm run build - Build all workspaces
  • npm run start - Start production servers
  • npm run lint - Lint all workspaces
  • npm run format - Format code with Prettier
  • npm run format:check - Check code formatting

Web App (apps/web)

  • npm run dev --workspace=apps/web - Start Next.js dev server
  • npm run build --workspace=apps/web - Build Next.js app
  • npm run start --workspace=apps/web - Start Next.js production server

Socket Server (apps/api-socket)

  • npm run dev --workspace=apps/api-socket - Start Socket.io dev server
  • npm run build --workspace=apps/api-socket - Build TypeScript to JavaScript
  • npm run start --workspace=apps/api-socket - Start production server

πŸ”Œ Socket.io Events

Client β†’ Server

  • session:start - Initialize new recording session
  • audio:chunk - Send audio data chunk
  • session:pause - Pause current session
  • session:resume - Resume paused session
  • session:stop - Stop and process session

Server β†’ Client

  • session:status - Session state updates (recording, paused, processing, completed)
  • audio:received - Acknowledge audio chunk receipt
  • error - Error notifications

🐳 Docker Commands

# Start PostgreSQL
docker-compose up -d

# Stop PostgreSQL
docker-compose down

# View logs
docker-compose logs -f postgres

# Remove volumes (⚠️ deletes all data)
docker-compose down -v

πŸ”§ Development Workflow

  1. Make changes to code
  2. Hot reload will update automatically
  3. Check console for errors
  4. Format code before committing:
    npm run format

πŸ“ Code Quality

  • TypeScript: Strict type checking enabled
  • ESLint: Configured for TypeScript and React best practices
  • Prettier: Consistent code formatting
  • JSDoc: Inline documentation for functions and components

🚧 TODO / Roadmap

  • Prisma ORM integration
  • Better Auth authentication
  • Gemini API transcription integration
  • Session history dashboard
  • Audio chunk streaming implementation
  • Meeting summary generation
  • Export transcripts (PDF, TXT, JSON)
  • Multi-speaker diarization
  • WebRTC implementation for tab sharing
  • Unit and integration tests

🀝 Contributing

  1. Create a feature branch
  2. Make your changes
  3. Run linting and formatting
  4. Submit a pull request

πŸ“„ License

This project is part of the AttackCapital assignment.


Built with ❀️ for productivity professionals

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages