Skip to content

Incremental learning with CatBoost and Ray for scalable training, tuning, and serving of large ML models

License

Notifications You must be signed in to change notification settings

xRiskLab/catboost-incremental

Repository files navigation

catboost-incremental

Python 3.9 Python 3.10 Python 3.11 Python 3.12 CatBoost Ray License

πŸ“š Table of Contents

πŸͺͺ Introduction

This repo contains a machine learning (ML) training and serving system for CatBoost + Ray incremental learning. CatBoost and Ray are a powerful combination for training and serving machine learning models.

CatBoost is a gradient boosting library that is particularly effective for categorical / text / embedding features, while Ray is a distributed computing framework that allows you to scale your machine learning workloads.

The package serves to address the following use cases:

  • Training on datasets that do not fit into memory (e.g., S3 parquet).
  • Tuning hyperparameters with Ray Tune on large datasets.
  • Serving incrementally trained CatBoost models with Ray Serve.

βš™οΈ Prerequisites

  • Python 3.9+
  • Ray >= 2.0
  • uv
  • Data in Parquet format (partitioned)
  • Docker (optional)

πŸš€ Getting Started

To train your first model, just point DataLoader to a local or S3 parquet directory and call trainer.train(). See Training for full examples.

πŸ—‚ Dataset preparation

Your dataset should be partitioned in Parquet files, with a partition column (e.g., partition_id).

Note if label_col is not provided to DataLoader, the last column is used as label.

You can generate the dataset using the following code:

uv run tools/generate_data.py

πŸ”„ Incremental learning

CatBoost supports incremental learning mode, allowing to train models on a large data chunks. This is useful when the dataset is too large to fit into memory all at once.

In CatBoost, incremental training is supported on CPU starting from version 0.15. GPU issue is still open.

The implementation of Ray Tuner is adopted from "Example of using Ray Tune and CatBoost" by Richard Decal (GitHub code).

The serving API is the author's implementation and is not based on any existing code.

πŸ’» Installation

Install from GitHub with pip:

pip install git+https://github.com/xRiskLab/catboost-incremental.git

Or install from a local clone using uv:

uv pip install -e .

Examples of package usage are provided in the notebooks/ directory.

πŸ’» Training

When reading parquet data with DataLoader, use the use_cols argument to specify the features and label to include in the training pool.

Make sure the label column is:

  • included in use_cols
  • explicitly passed via label_col if it is not the last column in the dataset.

πŸ“ Local Parquet

We can load the dataset from a local directory. The DataLoader will read the data in chunks and train the model incrementally.

import pyarrow.dataset as ds
from catboost_incremental.catboost_trainer import CatBoostTrainer
from catboost_incremental.data_loader import DataLoader

# Load full dataset
dataset_path = "../data/"
dataset = ds.dataset(dataset_path)
full_df = dataset.to_table().to_pandas().sample(1000)
label = "target"

data_loader = DataLoader(
    dataset_path, chunk_size=1000, partition_id_col="partition_id", label_col=label
)
trainer = CatBoostTrainer(
    data_loader=data_loader,
    label_col=label,
    model_config={"allow_writing_files": False},
)

model = trainer.train()  # returns trained CatBoost model artifact
score = trainer.evaluate(full_df)
print(f"Accuracy: {score:.4f}")

☁️ Using with AWS S3

When loading from S3, pass an active boto3.Session to the DataLoader:

import boto3

session = boto3.Session(
    aws_access_key_id="your-access-key",
    aws_secret_access_key="your-secret-key",
    region_name="your-region",
)

data_loader = DataLoader(
    dataset_path="s3://your-bucket/path/to/data/",
    chunk_size=1000,
    partition_id_col="partition_id",
    label_col="target",
    boto3_session=session,
)

πŸͺ› Tuning

from pathlib import Path
import pyarrow.dataset as ds
from ray import tune
from catboost_incremental.catboost_trainer import CatBoostTrainer
from catboost_incremental.catboost_tuner import CatBoostTuner
from catboost_incremental.data_loader import DataLoader

dataset_path = str(Path("../data/").resolve()) + "/"

dataset = ds.dataset(dataset_path)
full_df = dataset.to_table().to_pandas()

label = "target"

# Setup DataLoader
data_loader = DataLoader(
    dataset_path,
    chunk_size=10_000,
    partition_id_col="partition_id",
    label_col=label,
)

# Create data_loader and trainer
data_loader = DataLoader(
    dataset_path, chunk_size=10_000, partition_id_col="partition_id", label_col=label
)
trainer = CatBoostTrainer(data_loader=data_loader, label_col=label)

# Generate train_data generator and test set
train_data = trainer.data_loader.read_parquet()
test_df = ds.dataset(dataset_path).to_table().to_pandas()

# Initialize tuner
tuner = CatBoostTuner(trainer=trainer, metric="accuracy")

result = tuner.tune(
    param_space={
        "iterations": tune.choice([50, 100]),
        "learning_rate": tune.uniform(0.01, 0.3),
        "depth": tune.choice([4, 6, 8]),
        "verbose": 0,
        "allow_writing_files": False,
    },
    num_samples=5,
)

print(f"Best config: {result.config}")

🧬 Serving

πŸ““ Jupyter notebooks

If you're running inside a Jupyter notebook, you can serve the model using:

# Start Ray and Serve
ray.init(ignore_reinit_error=True)

# Deploy and run the app
app = CatBoostModelDeployment.bind(model_path="models/cb_model.cbm")
serve.start(detached=True, http_options={"host": "0.0.0.0", "port": 8000})
serve.run(app, route_prefix="/predict")

πŸ–₯️ Command line

You can start Ray cluster from the command line:

source .venv/bin/activate
ray start --head

You should see in the terminal:

Local node IP: 127.0.0.1

--------------------
Ray runtime started.
--------------------

Start the Ray Serve API server by running:

uv run catboost_incremental/serve_ray.py

You should see:

2025-04-01 01:06:29,822 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379...
2025-04-01 01:06:29,826 INFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265

Test the endpoint with curl:

curl -X POST http://127.0.0.1:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"0": 1.2, "1": 3.4, "2": 0.0, "3": 2.1, "4": 5.5, "5": 0.7, "6": 1.0, "7": 4.2, "8": 0.9, "9": 3.3}'

Stop the server with:

ray stop

You should see:

{"proba":[0.8514547423772816,0.14854525762271836]}

Docker

docker build -t ray-catboost-serve .

And run the container:

docker run -d -p 8000:8000 -p 8265:8265 ray-catboost-serve

Send a request to the server:

curl -X POST http://127.0.0.1:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"0": 1.2, "1": 3.4, "2": 0.0, "3": 2.1, "4": 5.5, "5": 0.7, "6": 1.0, "7": 4.2, "8": 0.9, "9": 3.3}'

You should see:

{"proba":[0.8514547423772816,0.14854525762271836]}

To stop Docker (and Ray) run:

docker stop <container_id>

For example, my container ID is 'a5ae3ebc6f' so I run:

docker stop a5ae3ebc6f

Note: Docker containers exit by default once the entrypoint script finishes execution. To keep the container alive, the entrypoint script (entrypoint.sh) ends with: exec tail -f /dev/null (See this Stack Overflow thread).

πŸ“„ License

This project is licensed under the MIT License. See the LICENSE file for details.

About

Incremental learning with CatBoost and Ray for scalable training, tuning, and serving of large ML models

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published