A Telegram bot for music discovery using semantic similarity search with FAISS indexing and Weaviate vector database.
- Data Source: YouTube Creative Commons music dataset from HuggingFace
- Processing Pipeline:
- Data cleaning and preprocessing in
main.ipynb - Embedding generation using SentenceTransformers
- Dimensionality reduction with PCA for efficient storage
- Cleaned dataset with reduced embeddings for similarity search
- Data cleaning and preprocessing in
- FAISS Index: IndexIVFFlat with Inner Product (IP) distance for fast approximate nearest neighbor search
- Index Configuration:
- Uses inverted file structure with adaptive nlist (sqrt of dataset size)
- L2 normalized embeddings with Inner Product similarity
- Chosen for balance between speed and accuracy on medium-scale datasets
- Index Generation:
make_index.pycreates optimized FAISS index from embeddings - Similarity Engine:
similarity.pyprovides search functionality with:- Hybrid search combining semantic and lexical matching
- Query expansion using WordNet synonyms
- Integration with Weaviate vector database
- Advantages: Sub-linear search time, memory efficient, supports clustering-based search
- Bot Implementation:
main.pyhandles user interactions and search requests - Logging System:
search_logger.pytracks user queries and system performance - Features:
- Natural language music search
- Query processing and result filtering
- User session management
- Comprehensive logging for analytics
-
Start Weaviate database:
docker-compose up -d
(Also run in dataset folder main.ipynb to fill database)
-
Run the Telegram bot:
cd tg python3 main.py
- FAISS for similarity search
- Weaviate for vector database
- SentenceTransformers for embeddings
- python-telegram-bot for bot functionality
- scikit-learn for dimensionality reduction