This project provides a hybrid semantic search solution by combining Elasticsearch keyword search with OpenAI-generated embeddings, resulting in more accurate and relevant search experiences, especially suited for large-scale websites.
- Hybrid Search: Combines traditional Elasticsearch keyword queries with semantic embeddings.
- Scalable: Optimized for indexing large volumes of website data.
- Flexible Embedding Options: Uses OpenAI by default, with easy alternatives.
- Synthetic Data Generation: Built-in tool for generating test data with semantic relationships.
- Elasticsearch & Kibana: Core search and data visualization services.
- OpenAI Embeddings API: Semantic embedding generation.
- Open Web Crawler: Web content ingestion.
- Next.js: User-facing search interface.
- Synthetic Data Generator: Python package for generating test data.
- Docker (20.10+)
- Docker Compose (2.27+)
- OpenAI API Key (minimal costs involved)
- Python 3.11+ (for synthetic data generation)
Copy env.example
to .env
and provide required values:
Variable | Description | Example |
---|---|---|
ELASTIC_PASSWORD |
Elasticsearch admin password | secure_pass! |
KIBANA_PASSWORD |
Kibana system user password | secure_pass! |
OPENAI_API_KEY |
OpenAI API key | key_1234567 |
ES_MEM_LIMIT |
Elasticsearch max memory | 4000000000 |
ES_INDEX |
Elasticsearch index name | site-index |
ES_URL |
Elasticsearch URL | https://localhost:9200 |
ES_API_KEY |
Elasticsearch API key | your_api_key |
ES_PIPELINE |
Embeddings pipeline name | openai_embeddings_pipeline |
Run the setup script (creates Elasticsearch, Kibana, SSL certificates, and verifies connectivity):
./scripts/start-elastic.sh --build
./scripts/copy-certs.sh
./scripts/test-elastic.sh
Check Kibana at http://localhost:5601. Login with user elastic
and your ELASTIC_PASSWORD
.
Set up index, inference endpoint, and embedding pipeline:
./scripts/create-index.sh
./scripts/create-openai-inference-endpoint.sh
./scripts/create-openai-embeddings-pipeline.sh
The synthetic data generator creates controlled, semantically related documents perfect for testing and development. No external dependencies or permissions needed.
- Set up Python environment:
# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install the package
pip install -e .
- Generate synthetic data:
# Generate 100 documents and save to data/synthetic_data.json
python scripts/generate-synthetic-data.py --num-documents 100
# Generate with a specific seed for reproducibility
python scripts/generate-synthetic-data.py --seed 42
# Generate without indexing to Elasticsearch
python scripts/generate-synthetic-data.py --no-index
Generated documents include:
{
"id": "doc1",
"title": "Machine Learning in AI",
"content": "Machine learning is a AI technique that focuses on feature engineering...",
"category": "AI",
"tags": ["machine learning", "feature engineering", "AI"],
"type": "definition",
"difficulty": "intermediate",
"length": "short"
}
- Violate terms of service
- Overwhelm servers (potential DoS)
- Be illegal in some jurisdictions
- Result in IP bans
If you have permission to crawl a site:
- Configure the crawler at
backend/crawler/config/private/crawler-config.yml
:
domains:
- url: https://example.com
sitemap_urls:
- https://example.com/sitemap.xml
output_sink: elasticsearch
output_index: site-index
max_crawl_depth: 2
elasticsearch:
host: https://es01
port: 9200
username: elastic
password: <ELASTIC_PASSWORD>
api_key: <api key>
ca_fingerprint: <Fingerprint from certs/es01.crt>
pipeline: openai_embeddings_pipeline
pipeline_enabled: true
- Generate an API key (recommended):
./scripts/create-crawler-key.sh
- Run the crawler:
./scripts/start-crawler.sh
docker exec -it crawler bin/crawler crawl config/private/crawler-config.yml
Start Next.js frontend:
./scripts/start-next.sh
Access your search UI at http://localhost:3000.
User -> Next.js Frontend -> Elasticsearch API
^
|
Elasticsearch Index
^
|
Open Web Crawler -> Elasticsearch Pipeline
|
v
OpenAI Embeddings API
Detailed architecture available in docs/ARCHITECTURE.md
.
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request