Skip to content

Commit a76baf4

Browse files
committed
switched from ElasticSearch to Weaviate
1 parent 6dc997c commit a76baf4

File tree

13 files changed

+425
-248
lines changed

13 files changed

+425
-248
lines changed

Gemfile

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,7 @@ gem "puma", ">= 5.0"
1616
gem "bcrypt", "~> 3.1.7"
1717

1818
# Search & Caching
19-
gem "elasticsearch-rails", "~> 8.0"
20-
gem "elasticsearch-model", "~> 8.0"
19+
gem "faraday", "~> 2.0" # For Weaviate HTTP client
2120
gem "redis", "~> 5.0"
2221
gem "redis-namespace", "~> 1.11"
2322
gem "connection_pool", "~> 2.4"

Gemfile.lock

Lines changed: 1 addition & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -96,19 +96,6 @@ GEM
9696
dotenv (3.1.8)
9797
drb (2.2.3)
9898
ed25519 (1.4.0)
99-
elastic-transport (8.4.1)
100-
faraday (< 3)
101-
multi_json
102-
elasticsearch (8.19.1)
103-
elastic-transport (~> 8.3)
104-
elasticsearch-api (= 8.19.1)
105-
elasticsearch-api (8.19.1)
106-
multi_json
107-
elasticsearch-model (8.0.1)
108-
activesupport (> 3)
109-
elasticsearch (~> 8)
110-
hashie
111-
elasticsearch-rails (8.0.1)
11299
erb (5.1.1)
113100
erubi (1.13.1)
114101
et-orbi (1.4.0)
@@ -140,7 +127,6 @@ GEM
140127
raabro (~> 1.4)
141128
globalid (1.3.0)
142129
activesupport (>= 6.1)
143-
hashie (5.0.0)
144130
i18n (1.14.7)
145131
concurrent-ruby (~> 1.0)
146132
io-console (0.8.1)
@@ -232,7 +218,6 @@ GEM
232218
mini_portile2 (2.8.9)
233219
minitest (5.26.0)
234220
msgpack (1.8.0)
235-
multi_json (1.17.0)
236221
net-http (0.6.0)
237222
uri
238223
net-imap (0.5.12)
@@ -469,10 +454,9 @@ DEPENDENCIES
469454
circuitbox (~> 2.0)
470455
connection_pool (~> 2.4)
471456
debug
472-
elasticsearch-model (~> 8.0)
473-
elasticsearch-rails (~> 8.0)
474457
factory_bot_rails (~> 6.4)
475458
faker (~> 3.2)
459+
faraday (~> 2.0)
476460
fast_jsonapi (~> 1.5)
477461
kamal
478462
kaminari (~> 1.2)

README.md

Lines changed: 41 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,32 @@
11
# DDoc Search
22

3-
A high-performance, multi-tenant document search API built with Ruby on Rails. This application provides full-text search capabilities powered by Elasticsearch, with support for tenant isolation, rate limiting, caching, and asynchronous document indexing via Kafka.
3+
A high-performance, multi-tenant document search API built with Ruby on Rails. This application provides full-text search capabilities powered by Weaviate, with support for tenant isolation, rate limiting, caching, and asynchronous document indexing via Kafka.
44

55
## Quick Start
66

77
### Running Locally
88

99
1. **Install dependencies**
10+
1011
```bash
1112
bundle install
1213
```
1314

14-
2. **Start required services** (Elasticsearch, Redis, Kafka)
15+
2. **Start required services** (Weaviate, Redis, Kafka)
16+
1517
```bash
1618
docker compose -f docker-compose.dev.yml up -d
1719
```
1820

19-
3. **Setup database and Elasticsearch**
21+
3. **Setup database and Weaviate**
22+
2023
```bash
2124
rails db:drop db:create db:migrate
22-
rails runner "Document.__elasticsearch__.create_index! force: true"
25+
rails runner "Document.ensure_weaviate_schema!"
2326
```
2427

2528
4. **Create a test tenant**
29+
2630
```bash
2731
rails runner tmp/create_tenant.rb
2832
# Save the API key from the output!
@@ -43,9 +47,7 @@ A high-performance, multi-tenant document search API built with Ruby on Rails. T
4347

4448
### Ready-to-Use curl Commands
4549

46-
```bash
47-
export TEST_API_KEY="df1a5764855153924486beaae96cebef739f3f54f68e28ebdf0338aea5155ee5"
48-
```
50+
Using test API key: `aead1b358e37d400e37bd9f6d031fe3a0fab53f6f6e3839b494740b7373658fe`
4951

5052
**Health Check:**
5153

@@ -57,7 +59,7 @@ curl http://localhost:3000/health | jq '.'
5759

5860
```bash
5961
curl -X POST http://localhost:3000/v1/documents \
60-
-H "X-API-Key: $TEST_API_KEY" \
62+
-H "X-API-Key: aead1b358e37d400e37bd9f6d031fe3a0fab53f6f6e3839b494740b7373658fe" \
6163
-H "Content-Type: application/json" \
6264
-d '{
6365
"document": {
@@ -68,25 +70,38 @@ curl -X POST http://localhost:3000/v1/documents \
6870
}' | jq '.'
6971
```
7072

73+
```bash
74+
curl -X POST http://localhost:3000/v1/documents \
75+
-H "X-API-Key: aead1b358e37d400e37bd9f6d031fe3a0fab53f6f6e3839b494740b7373658fe" \
76+
-H "Content-Type: application/json" \
77+
-d '{
78+
"document": {
79+
"title": "The Symphony of Earth",
80+
"content": "'"$(cat test/fixtures/files/earth.txt | tr '\n' ' ' | sed 's/"/\\"/g')"'",
81+
"metadata": {"category": "story", "tags": ["earth", "life", "harmony"]}
82+
}
83+
}' | jq '.'
84+
```
85+
7186
**Retrieve Document:**
7287

7388
```bash
7489
curl http://localhost:3000/v1/documents/1 \
75-
-H "X-API-Key: $TEST_API_KEY" | jq '.'
90+
-H "X-API-Key: aead1b358e37d400e37bd9f6d031fe3a0fab53f6f6e3839b494740b7373658fe" | jq '.'
7691
```
7792

7893
**Search Documents:**
7994

8095
```bash
8196
curl "http://localhost:3000/v1/search?q=car&page=1&per_page=10" \
82-
-H "X-API-Key: $TEST_API_KEY" | jq '.'
97+
-H "X-API-Key: aead1b358e37d400e37bd9f6d031fe3a0fab53f6f6e3839b494740b7373658fe" | jq '.'
8398
```
8499

85100
**Delete Document:**
86101

87102
```bash
88103
curl -X DELETE http://localhost:3000/v1/documents/1 \
89-
-H "X-API-Key: $TEST_API_KEY" | jq '.'
104+
-H "X-API-Key: aead1b358e37d400e37bd9f6d031fe3a0fab53f6f6e3839b494740b7373658fe" | jq '.'
90105
```
91106

92107
### Test Files Available
@@ -102,10 +117,10 @@ The project includes three test files in `test/fixtures/files/` that you can use
102117
DDoc Search is designed to handle document storage and search for multiple tenants with the following key features:
103118

104119
- **Multi-tenant Architecture**: Complete data isolation per tenant with subdomain-based routing
105-
- **Full-text Search**: Powered by Elasticsearch with custom analyzers and highlighting
120+
- **Full-text Search**: Powered by Weaviate with BM25 keyword search
106121
- **Asynchronous Processing**: Kafka-based message queue for document indexing operations
107122
- **High Performance**: Redis caching, circuit breakers, and rate limiting
108-
- **Scalable Design**: Horizontal scaling support with configurable shards and replicas
123+
- **Scalable Design**: Horizontal scaling support with Weaviate's distributed architecture
109124
- **RESTful API**: Clean JSON API with comprehensive error handling
110125

111126
## Architecture
@@ -115,7 +130,7 @@ DDoc Search is designed to handle document storage and search for multiple tenan
115130
- **Framework**: Ruby on Rails 8.0.3
116131
- **Ruby Version**: 3.3.0 (3.4.7 for Docker)
117132
- **Database**: SQLite3 (development/test), with support for multiple databases in production
118-
- **Search Engine**: Elasticsearch 8.0
133+
- **Search Engine**: Weaviate 1.26.1
119134
- **Cache**: Redis 5.0 with connection pooling
120135
- **Message Queue**: Kafka (via Karafka 2.4)
121136
- **Background Jobs**: Sidekiq 7.0
@@ -125,7 +140,7 @@ DDoc Search is designed to handle document storage and search for multiple tenan
125140
### Key Components
126141

127142
- **Tenant Middleware**: Request-level tenant identification via API keys
128-
- **Circuit Breaker**: Prevents cascading failures to Elasticsearch
143+
- **Circuit Breaker**: Prevents cascading failures to Weaviate
129144
- **Rate Limiter**: Redis-based sliding window rate limiting per tenant
130145
- **Document Indexing**: Async Kafka-based indexing with automatic retries
131146
- **Search Analytics**: Background job processing for usage metrics
@@ -164,7 +179,7 @@ Ensure you have the following installed:
164179
- Ruby 3.3.0 or higher
165180
- Bundler 2.x
166181
- PostgreSQL (if migrating from SQLite)
167-
- Elasticsearch 8.0+
182+
- Weaviate 1.26+
168183
- Redis 5.0+
169184
- Kafka (Apache Kafka or compatible)
170185

@@ -191,8 +206,8 @@ Ensure you have the following installed:
191206
# Database
192207
DATABASE_URL=sqlite3:storage/development.sqlite3
193208

194-
# Elasticsearch
195-
ELASTICSEARCH_URL=http://localhost:9200
209+
# Weaviate
210+
WEAVIATE_URL=http://localhost:8080
196211

197212
# Redis
198213
REDIS_URL=redis://localhost:6379/0
@@ -213,13 +228,13 @@ Ensure you have the following installed:
213228
rails db:seed # Optional: creates sample data
214229
```
215230

216-
5. **Configure Elasticsearch**
231+
5. **Configure Weaviate**
217232

218-
Ensure Elasticsearch is running, then create the index:
233+
Ensure Weaviate is running, then create the schema:
219234

220235
```bash
221236
rails console
222-
> Document.__elasticsearch__.create_index! force: true
237+
> Document.ensure_weaviate_schema!
223238
```
224239

225240
6. **Start required services**
@@ -272,7 +287,7 @@ docker build -t ddoc_search .
272287
docker run -d \
273288
-p 80:80 \
274289
-e RAILS_MASTER_KEY=<value from config/master.key> \
275-
-e ELASTICSEARCH_URL=http://elasticsearch:9200 \
290+
-e WEAVIATE_URL=http://weaviate:8080 \
276291
-e REDIS_URL=redis://redis:6379/0 \
277292
-e KAFKA_BROKERS=kafka:9092 \
278293
--name ddoc_search \
@@ -294,7 +309,7 @@ kamal deploy
294309

295310
### Application Configuration
296311

297-
- [config/initializers/elasticsearch.rb](config/initializers/elasticsearch.rb) - Elasticsearch client configuration
312+
- [config/initializers/weaviate.rb](config/initializers/weaviate.rb) - Weaviate client configuration
298313
- [config/initializers/redis.rb](config/initializers/redis.rb) - Redis connection pool setup
299314
- [config/initializers/karafka.rb](config/initializers/karafka.rb) - Kafka consumer configuration
300315
- [config/initializers/sidekiq.rb](config/initializers/sidekiq.rb) - Sidekiq background job configuration
@@ -389,10 +404,10 @@ bundle exec brakeman
389404
## Performance Features
390405

391406
- **Caching**: Search results cached for 10 minutes, documents cached for 1 hour
392-
- **Circuit Breaker**: Automatic fallback to SQL search when Elasticsearch is unavailable
407+
- **Circuit Breaker**: Automatic fallback to SQL search when Weaviate is unavailable
393408
- **Rate Limiting**: Configurable per-tenant rate limits with Redis-backed sliding window
394409
- **Connection Pooling**: Redis connection pooling for efficient resource utilization
395-
- **Elasticsearch Optimization**: 10 shards, 2 replicas, custom analyzers with snowball stemming
410+
- **Weaviate BM25 Search**: Efficient keyword-based search with relevance scoring
396411

397412
## Monitoring
398413

@@ -401,7 +416,7 @@ The application includes:
401416
- Health check endpoints at `/health` and `/up`
402417
- Search analytics tracking (query, results count, response time)
403418
- Lograge for structured logging
404-
- Circuit breaker metrics for Elasticsearch availability
419+
- Circuit breaker metrics for Weaviate availability
405420

406421
## License
407422

app/controllers/health_controller.rb

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ def show
2727
def check_dependencies
2828
{
2929
postgresql: check_postgresql,
30-
elasticsearch: check_elasticsearch,
30+
weaviate: check_weaviate,
3131
redis: check_redis,
3232
kafka: check_kafka
3333
}
@@ -45,15 +45,15 @@ def check_postgresql
4545
{ status: "down", error: e.message } # Record the error message for debugging purposes.
4646
end
4747

48-
# Check Elasticsearch cluster health and response time.
49-
def check_elasticsearch
48+
# Check Weaviate health and response time.
49+
def check_weaviate
5050
start = Time.current # Record the current time for latency calculation.
5151

52-
# Call the Elasticsearch client to retrieve cluster health information.
53-
Elasticsearch::Model.client.cluster.health
52+
# Call the Weaviate client to retrieve schema information to verify connectivity.
53+
WEAVIATE_CLIENT.schema.get
5454
{ status: "up", latency_ms: ((Time.current - start) * 1000).round(2) } # Successful execution, record latency.
5555
rescue => e
56-
# Handle any errors during Elasticsearch connection or query execution.
56+
# Handle any errors during Weaviate connection or query execution.
5757
{ status: "down", error: e.message } # Record the error message for debugging purposes.
5858
end
5959

app/controllers/search_controller.rb

Lines changed: 13 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -22,15 +22,15 @@ def index
2222
# Records the start time of the search operation for performance monitoring purposes.
2323
start_time = Time.current
2424

25-
# Performs the actual search on the documents using the Elasticsearch client.
25+
# Performs the actual search on the documents using the Weaviate client.
2626
results = Document.search_for_tenant(@current_tenant.id, query, page: page, per_page: per_page)
2727

2828
# Calculates the elapsed time since the search began in milliseconds.
2929
took_ms = ((Time.current - start_time) * 1000).round
3030

31-
# Handle both Elasticsearch and SQL results
31+
# Handle both Weaviate and SQL results
3232
if results.respond_to?(:total)
33-
# Elasticsearch results
33+
# Weaviate results
3434
total = results.total
3535
documents = results.records
3636
SearchAnalyticsJob.perform_later(@current_tenant.id, query, total, took_ms)
@@ -66,29 +66,26 @@ def index
6666
private
6767

6868
def format_search_result(document, search_results)
69-
# Handle both Elasticsearch and SQL results
69+
# Handle both Weaviate and SQL results
7070
if search_results.respond_to?(:response)
71-
# Elasticsearch results - with highlighting and scores
72-
highlight = search_results.response.dig("hits", "hits")
73-
.find { |h| h["_id"] == document.id.to_s }
74-
&.dig("highlight", "content")
75-
&.first
71+
# Weaviate results - with scores
72+
weaviate_docs = search_results.response.dig("data", "Get", Document.weaviate_class_name) || []
73+
weaviate_doc = weaviate_docs.find { |d| d["title"] == document.title }
7674

77-
score = search_results.response.dig("hits", "hits")
78-
.find { |h| h["_id"] == document.id.to_s }
79-
&.dig("_score")
75+
score = weaviate_doc&.dig("_additional", "score")
76+
# Weaviate doesn't provide highlighting by default, so we'll use truncated content
77+
snippet = document.content.truncate(200)
8078
else
8179
# SQL fallback - no highlighting or scores
82-
highlight = nil
80+
snippet = document.content.truncate(200)
8381
score = nil
8482
end
8583

86-
# Formats a search result object with the ID, title, snippet (either highlighted content or truncated original content),
87-
# score, and created at time of the matched document.
84+
# Formats a search result object with the ID, title, snippet, score, and created at time
8885
{
8986
id: document.id,
9087
title: document.title,
91-
snippet: highlight || document.content.truncate(200),
88+
snippet: snippet,
9289
score: score,
9390
created_at: document.created_at
9491
}

app/jobs/index_document_job.rb

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# frozen_string_literal: true
22

3-
# This class represents a job for indexing documents in Elasticsearch.
3+
# This class represents a job for indexing documents in Weaviate.
44
# It takes care of checking document ownership, attempting to index it,
55
# and logging various events throughout the process.
66
class IndexDocumentJob < ApplicationJob
@@ -25,12 +25,18 @@ def perform(document_id, tenant_id)
2525
return
2626
end
2727

28+
# Ensure Weaviate schema exists before indexing
29+
Document.ensure_weaviate_schema!
30+
2831
# Use a circuit breaker to limit the number of concurrent indexing attempts.
29-
CircuitBreaker.call(:elasticsearch) do
30-
# Index the document using Elasticsearch's Ruby client. This is
31-
# where the actual indexing happens, and it should succeed most
32-
# of the time if everything's set up correctly.
33-
document.__elasticsearch__.index_document
32+
CircuitBreaker.call(:weaviate) do
33+
# Index the document using Weaviate's Ruby client.
34+
weaviate_object = document.to_weaviate_object
35+
36+
WEAVIATE_CLIENT.objects.create(
37+
class_name: Document.weaviate_class_name,
38+
properties: weaviate_object[:properties]
39+
)
3440

3541
# Update the document with a timestamp indicating when it was indexed.
3642
document.update_column(:indexed_at, Time.current)
@@ -42,7 +48,7 @@ def perform(document_id, tenant_id)
4248
# If the document doesn't exist in the database, log a warning and skip the indexing attempt.
4349
Rails.logger.warn("Document #{document_id} not found, skipping indexing")
4450
rescue => e
45-
# Catch any other exceptions that might occur during indexing. This includes errors like network connectivity issues or Elasticsearch client errors.
51+
# Catch any other exceptions that might occur during indexing. This includes errors like network connectivity issues or Weaviate client errors.
4652
Rails.logger.error("Failed to index document #{document_id}: #{e.message}")
4753

4854
# If we've reached the maximum retry count (5 attempts), send a job to the dead-letter queue with the exception details.
@@ -61,3 +67,4 @@ def sidekiq_retry_count
6167
self.class.sidekiq_options_hash["retry_count"] || 0
6268
end
6369
end
70+

0 commit comments

Comments
 (0)