Skip to content

Latest commit

Β 

History

History
151 lines (115 loc) Β· 4.67 KB

File metadata and controls

151 lines (115 loc) Β· 4.67 KB

MLX-MemOS

MLX-MemOS is a high-performance LLM serving and RAG (Retrieval-Augmented Generation) infrastructure toolkit optimized for Apple Silicon (macOS). Built on top of Apple's MLX framework, it provides a seamless experience for running large language models (like Qwen3) and embedding/reranking models locally with OpenAI-compatible APIs.

πŸš€ Key Features

  • Apple Silicon Optimized: Leverages MLX for efficient inference on Mac devices (M1/M2/M3/M4).
  • OpenAI Compatible: Provides a drop-in replacement for OpenAI's Chat Completions API.
  • RAG Ready: Includes a dedicated server for Embeddings (bge-m3) and Reranking (bge-reranker-v2-m3).
  • Model Management: Ready-to-use scripts for converting and managing Qwen3 models (0.6B, 4B, 8B, 14B).
  • Production Friendly: Includes startup/shutdown scripts, PID management, and logging.
  • Benchmarking: Built-in tools to stress test and verify model performance.

πŸ“‹ Prerequisites

  • macOS 13.0+ (Ventura or later recommended)
  • Python 3.10+
  • Apple Silicon (M-series chip)

πŸ› οΈ Installation

  1. Clone the repository

    git clone https://github.com/yourusername/MLX-MemOS.git
    cd MLX-MemOS
  2. Create and activate a virtual environment

    python3 -m venv .venv
    source .venv/bin/activate
  3. Install dependencies

    pip install -r requirements.txt

πŸ—οΈ Model Preparation

41β†’ 42β†’MLX-MemOS expects models to be placed in the models/ directory. 43β†’ 44β†’### ⚠️ Important: Restore Large Models 45β†’ 46β†’Due to GitHub's file size limits, some large model files (over 2GB) are split into chunks. You must run the following command after cloning to restore them: 47β†’ 48β†’bash 49β†’./script/manage_large_files.sh merge 50β†’ 51β†’ 52β†’This will reassemble files like pytorch_model.bin and model.safetensors from their split parts. Specifically, it handles:

  • models/bge-m3/pytorch_model.bin
  • models/bge-reranker-v2-m3/model.safetensors
  • models/Qwen3-8B-MLX/model.safetensors
  • models/Qwen3-4B-MLX/model.safetensors
  • models/Qwen3-14B-MLX/model-00001-of-00002.safetensors
  • models/Qwen3-14B-MLX/model-00002-of-00002.safetensors 53β†’ 54β†’### Download & Convert Models 55β†’ 56β†’We also provide scripts to help you convert Hugging Face models to MLX format.
# Example: Convert Qwen3-14B
./script/convert_qwen3_14b.sh

# Example: Convert Qwen3-8B
./script/convert_qwen3_8b.sh

Ensure you have sufficient disk space and memory for the conversion process.

🚦 Usage

1. Start the LLM Server (Chat Completions)

This starts an OpenAI-compatible server hosting the LLM (default: Qwen3-14B-MLX).

./script/start_mlx_server.sh start
  • Port: 8080
  • Endpoint: http://127.0.0.1:8080/v1/chat/completions
  • Logs: logs/mlx_server.log

To stop or restart:

./script/start_mlx_server.sh stop
./script/start_mlx_server.sh restart
./script/start_mlx_server.sh status

2. Start the Embedding & Rerank Server

This starts a separate server for text embeddings and document reranking.

./script/start_embedding_server.sh start
  • Port: 8081
  • Endpoints:
    • Embeddings: http://127.0.0.1:8081/v1/embeddings
    • Rerank: http://127.0.0.1:8081/v1/rerank
  • Logs: logs/embedding_server.log

3. Verification

Verify that the servers are running correctly:

# Verify LLM Server
./script/verify_mlx_curl.sh
# OR using Python script
python script/verify_mlx_server.py

# Verify Embedding/Rerank Server
./script/verify_embedding_server.sh

πŸ“Š Benchmarking

You can benchmark the performance of the LLM server using the included Python script:

python script/benchmark_mlx.py

This script will simulate concurrent requests and report token generation speeds (TPS) and latency metrics.

πŸ“‚ Project Structure

MLX-MemOS/
β”œβ”€β”€ models/                 # Model checkpoints (MLX format)
β”œβ”€β”€ script/                 # Operation scripts
β”‚   β”œβ”€β”€ start_mlx_server.sh         # Manage LLM server
β”‚   β”œβ”€β”€ start_embedding_server.sh   # Manage Embedding server
β”‚   β”œβ”€β”€ convert_*.sh                # Model conversion scripts
β”‚   β”œβ”€β”€ verify_*.sh                 # Verification scripts
β”‚   └── benchmark_mlx.py            # Performance testing
β”œβ”€β”€ logs/                   # Server logs
β”œβ”€β”€ requirements.txt        # Python dependencies
└── README.md               # Project documentation

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.