This project provides a hybrid semantic search solution by combining Elasticsearch keyword search with OpenAI-generated embeddings, resulting in more accurate and relevant search experiences, especially suited for large-scale websites.
- Hybrid Search: Combines traditional Elasticsearch keyword queries with semantic embeddings.
- Scalable: Optimized for indexing large volumes of website data.
- Flexible Embedding Options: Uses OpenAI by default, with easy alternatives.
- Synthetic Data Generation: Built-in tool for generating test data with semantic relationships.
- Elasticsearch & Kibana: Core search and data visualization services.
- OpenAI Embeddings API: Semantic embedding generation.
- Open Web Crawler: Web content ingestion.
- Next.js: User-facing search interface.
- Synthetic Data Generator: Python package for generating test data.
- Docker (20.10+)
- Docker Compose (2.27+)
- OpenAI API Key (minimal costs involved)
- Python 3.11+ (for synthetic data generation)
Copy env.example to .env and provide required values:
| Variable | Description | Example | 
|---|---|---|
| ELASTIC_PASSWORD | Elasticsearch admin password | secure_pass! | 
| KIBANA_PASSWORD | Kibana system user password | secure_pass! | 
| OPENAI_API_KEY | OpenAI API key | key_1234567 | 
| ES_MEM_LIMIT | Elasticsearch max memory | 4000000000 | 
| ES_INDEX | Elasticsearch index name | site-index | 
| ES_URL | Elasticsearch URL | https://localhost:9200 | 
| ES_API_KEY | Elasticsearch API key | your_api_key | 
| ES_PIPELINE | Embeddings pipeline name | openai_embeddings_pipeline | 
Run the setup script (creates Elasticsearch, Kibana, SSL certificates, and verifies connectivity):
./scripts/start-elastic.sh --build
./scripts/copy-certs.sh
./scripts/test-elastic.shCheck Kibana at http://localhost:5601. Login with user elastic and your ELASTIC_PASSWORD.
Set up index, inference endpoint, and embedding pipeline:
./scripts/create-index.sh
./scripts/create-openai-inference-endpoint.sh
./scripts/create-openai-embeddings-pipeline.shThe synthetic data generator creates controlled, semantically related documents perfect for testing and development. No external dependencies or permissions needed.
- Set up Python environment:
# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
# Install the package
pip install -e .- Generate synthetic data:
# Generate 100 documents and save to data/synthetic_data.json
python scripts/generate-synthetic-data.py --num-documents 100
# Generate with a specific seed for reproducibility
python scripts/generate-synthetic-data.py --seed 42
# Generate without indexing to Elasticsearch
python scripts/generate-synthetic-data.py --no-indexGenerated documents include:
{
  "id": "doc1",
  "title": "Machine Learning in AI",
  "content": "Machine learning is a AI technique that focuses on feature engineering...",
  "category": "AI",
  "tags": ["machine learning", "feature engineering", "AI"],
  "type": "definition",
  "difficulty": "intermediate",
  "length": "short"
}- Violate terms of service
- Overwhelm servers (potential DoS)
- Be illegal in some jurisdictions
- Result in IP bans
If you have permission to crawl a site:
- Configure the crawler at backend/crawler/config/private/crawler-config.yml:
domains:
  - url: https://example.com
    sitemap_urls:
      - https://example.com/sitemap.xml
output_sink: elasticsearch
output_index: site-index
max_crawl_depth: 2
elasticsearch:
  host: https://es01
  port: 9200
  username: elastic
  password: <ELASTIC_PASSWORD>
  api_key: <api key>
  ca_fingerprint: <Fingerprint from certs/es01.crt>
  pipeline: openai_embeddings_pipeline
  pipeline_enabled: true- Generate an API key (recommended):
./scripts/create-crawler-key.sh- Run the crawler:
./scripts/start-crawler.sh
docker exec -it crawler bin/crawler crawl config/private/crawler-config.ymlStart Next.js frontend:
./scripts/start-next.shAccess your search UI at http://localhost:3000.
User -> Next.js Frontend -> Elasticsearch API
                             ^
                             |
                      Elasticsearch Index
                             ^
                             |
Open Web Crawler -> Elasticsearch Pipeline
                        |
                        v
              OpenAI Embeddings API
Detailed architecture available in docs/ARCHITECTURE.md.
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request