Part 1 — Python (Recommended)

What is HMC?

HMC(Heterogeneous Memory Communication) is a communication framework for heterogeneous systems (CPU / GPU / accelerators). It provides:

Unified memory abstraction: Memory A device-agnostic interface for allocating/freeing buffers and copying data across devices.
Unified registered IO buffer: ConnBuffer A stable, registered buffer used as the staging area for all network transfers.
Unified transport: Communicator A single interface for one-sided data movement between peers’ buffers:
- ConnType.RDMA (platform/device dependent)
- ConnType.UCX (commonly used for CPU and some GPU-direct configurations)
Control plane for synchronization & small messages A rank-based control channel over TCP and/or UDS (same-host).

Supported devices

HMC supports CPU memory and multiple accelerator backends, including:

NVIDIA GPUs (CUDA)
AMD GPUs (ROCm)
Hygon platforms / GPUs (platform-dependent; enabled when the corresponding backend is available in your build)
Cambricon MLUs (CNRT / Neuware)
Moore Threads GPUs (MUSA)

Availability depends on how HMC is built (e.g., enabling CUDA/ROCm/CNRT/MUSA) and on the runtime/driver environment on your machine. We develop and test on Nvidia ConnectX NICs, so optimal performance and compatibility are expected in environments with similar hardware and driver configurations.

Core model

Data movement always happens between two peers’ registered buffers (ConnBuffer) using (offset, size). Your application decides offsets and uses a control message (tag) to coordinate.

Part 1 — Python (Recommended)

1. Installation (Python)

HMC Python bindings are built via pybind11 on top of the C++ core.

1.1 Prerequisites

Python 3.8+
C++14+
CMake ≥ 3.18
UCX (required for ConnType.UCX)
- Users must install UCX themselves and place it under: /usr/local/ucx
- For NVIDIA / AMD GPU scenarios, it’s recommended to install a UCX build with GPU-Direct / GDR support enabled (and matched to your CUDA/ROCm toolchain)

Tip: Make sure the UCX runtime libraries are visible at runtime (e.g., /usr/local/ucx/lib is on the dynamic linker search path via LD_LIBRARY_PATH or system linker configuration).

1.2 Build & Install from Source

git clone https://github.com/IIC-SIG-MLsys/HMC.git
cd HMC
git submodule update --init --recursive

Build with Python module enabled:

mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_PYTHON_MOD=ON
make -j

Build wheel & install:

cd ..
python -m build
pip install dist/hmc-*.whl

Verify:

python -c "import hmc; print('hmc imported:', hmc.__file__)"

1.3 (Optional) GPU/Accelerator Backends

If you enable CUDA/ROCm/CNRT/MUSA backends, make sure the same toolchain is visible during build.

2. Key Concepts

Session is the recommended high-level API.
All transfers are buffer-staged:
1. Copy payload into local ConnBuffer (via sess.buf.put(...))
2. Send/receive bytes using sess.put_remote(...) / sess.get_remote(...)
3. Copy bytes out of local ConnBuffer to your destination (via sess.buf.get_into(...))
Offsets are called bias/offset in docs and code.

3. Example: CPU-to-CPU Transfer (UCX)

3.1 Server

import hmc

ip = "192.168.2.244"
rank = 0

sess = hmc.create_session(
    device_id=0,
    buffer_size=128 * 1024 * 1024,
    mem_type=hmc.MemoryType.CPU,
    num_chs=1,
    local_ip=ip,
)

sess.init_server(
    bind_ip=ip,
    ucx_port=2025,
    rdma_port=2026,
    ctrl_tcp_port=5001,
    self_id=rank,
    ctrl_uds_dir="/tmp",
)

print("UCX server ready")

3.2 Client

import hmc

server_ip = "192.168.2.244"
my_ip = "192.168.2.248"

sess = hmc.create_session(mem_type=hmc.MemoryType.CPU, local_ip=my_ip)

sess.connect(
    peer_id=0,
    self_id=1,
    peer_ip=server_ip,
    data_port=2025,
    ctrl_tcp_port=5001,
    conn=hmc.ConnType.UCX,
)

payload = b"hello hmc"

# stage into local ConnBuffer
n = sess.buf.put(payload, offset=0)

# put local [0, n) -> remote [0, n)
sess.put_remote(server_ip, 2025, local_off=0, remote_off=0, nbytes=n, conn=hmc.ConnType.UCX)
print("sent", n, "bytes")

3.3 Coordinating “data ready” (recommended)

You typically add a simple tag handshake so the server knows when to read:

# sender side
sess.ctrl_send(peer=0, tag=1)

# receiver side
tag = sess.ctrl_recv(peer=1)
assert tag == 1
# then read data from server-side ConnBuffer using sess.buf.get_into(...)

Your application should define a protocol: which offset holds which message, and what tags mean.

4. NumPy & PyTorch (CPU)

4.1 NumPy Send/Recv

import hmc, numpy as np

x = np.arange(1024, dtype=np.int32)

n = sess.buf.put(x, offset=0)
sess.put_remote(server_ip, 2025, local_off=0, remote_off=0, nbytes=n, conn=hmc.ConnType.UCX)

y = np.empty_like(x)
sess.get_remote(server_ip, 2025, local_off=0, remote_off=0, nbytes=y.nbytes, conn=hmc.ConnType.UCX)
sess.buf.get_into(y, nbytes=y.nbytes, offset=0)

Notes:

Source arrays should be C-contiguous; wrapper may copy if needed.
Destination arrays must be writable and contiguous.

4.2 PyTorch CPU Tensor

import torch

t = torch.arange(4096, dtype=torch.int32)
n = sess.buf.put(t, offset=0)

sess.put_remote(server_ip, 2025, local_off=0, remote_off=0, nbytes=n, conn=hmc.ConnType.UCX)

out = torch.empty_like(t)
sess.get_remote(server_ip, 2025, local_off=0, remote_off=0, nbytes=out.numel() * out.element_size(), conn=hmc.ConnType.UCX)
sess.buf.get_into(out, nbytes=out.numel() * out.element_size(), offset=0)

5. GPU Tensors (Advanced)

HMC can stage CUDA tensors via writeFromGpu/readToGpu by using the tensor pointer (data_ptr() internally).

import torch

t = torch.randn(1024 * 1024, device="cuda")
n = sess.buf.put(t, offset=0, device="cuda")  # GPU -> ConnBuffer

sess.put_remote(server_ip, 2026, local_off=0, remote_off=0, nbytes=n, conn=hmc.ConnType.RDMA)

recv = torch.empty_like(t)
sess.get_remote(server_ip, 2026, local_off=0, remote_off=0, nbytes=n, conn=hmc.ConnType.RDMA)
sess.buf.get_into(recv, nbytes=n, offset=0, device="cuda")

Caveats:

GPU-direct depends on platform/driver/NIC/UCX/RDMA configuration.
Start with CPU path first to validate protocol and correctness.

6. Troubleshooting (Python)

Common errors

Buffer overflow: offset + nbytes > buffer_size
- Increase buffer_size or adjust offsets
Not connected: connect() not called or wrong data_port/conn
Ctrl mismatch (UDS vs TCP)
- Use CTRL_TRANSPORT=tcp or CTRL_TRANSPORT=uds to force a choice
UDS path confusion
- UDS uses a full socket file path, not a directory

Best practices

Define a clear protocol (offset layout + tag meanings).
Avoid overwriting remote offsets without an ack handshake.
Use put_nb/get_nb + wait when you need overlap or concurrent transfers.

Part 2 — C++ Core Library (Advanced / Integration)

7. Installation (C++)

7.1 Prerequisites

C++14+
CMake ≥ 3.18

Optional:

sudo apt-get install -y libgtest-dev

7.2 Build

git clone https://github.com/IIC-SIG-MLsys/HMC.git
cd HMC
mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j

8. Usage (C++)

8.1 Core Components

Memory: unified allocate/copy across devices
ConnBuffer: stable registered buffer; all transfers use (offset, size)
CtrlSocketManager: ctrl plane, rank-based, TCP/UDS, HELLO binding
Communicator: data plane + ctrl integration: put/get/putNB/getNB/wait, connectTo/initServer

8.2 Minimal Data-Plane Example (UCX put)

#include <hmc.h>
#include <memory>
#include <vector>
#include <string>

using namespace hmc;

int main() {
  auto buf = std::make_shared<ConnBuffer>(0, 128ull * 1024 * 1024, MemoryType::CPU);
  Communicator comm(buf);

  std::string server_ip = "192.168.2.244";
  uint16_t data_port = 2025;
  uint16_t ctrl_tcp_port = 5001;

  Communicator::CtrlLink link;
  link.transport = Communicator::CtrlTransport::TCP;
  link.ip = server_ip;
  link.port = ctrl_tcp_port;

  comm.connectTo(/*peer_id=*/0, /*self_id=*/1, server_ip, data_port, link, ConnType::UCX);

  std::vector<char> payload(1024, 'A');
  buf->writeFromCpu(payload.data(), payload.size(), 0);

  comm.put(server_ip, data_port, /*local_off=*/0, /*remote_off=*/0, payload.size(), ConnType::UCX);
  return 0;
}

9. Appendix: API Cheatsheet

Python

create_session(...) -> Session
Session.init_server(...)
Session.connect(...)
Session.put_remote(...) / get_remote(...)
Session.put_nb(...) / get_nb(...) / wait(...)
Session.ctrl_send(...) / ctrl_recv(...)
sess.buf.put(...) / sess.buf.get_into(...) / buffer_copy_within(...)

C++

ConnBuffer::{writeFromCpu/readToCpu/writeFromGpu/readToGpu/copyWithin}
Communicator::{initServer/connectTo/put/get/putNB/getNB/wait/ctrlSend/ctrlRecv}
CtrlSocketManager::{start/connectTcp/connectUds/sendU64/recvU64/...}

Name		Name	Last commit message	Last commit date
Latest commit History 235 Commits
.github/workflows		.github/workflows
apps		apps
cmake		cmake
docs		docs
extern		extern
hmc		hmc
include		include
python		python
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
Install_ucx.md		Install_ucx.md
MANIFEST.in		MANIFEST.in
README.md		README.md
README_zh.md		README_zh.md
build.sh		build.sh
build_with_python.sh		build_with_python.sh
client.sh		client.sh
server.sh		server.sh
setup.py		setup.py

IIC-SIG-MLsys/HMC

Folders and files

Latest commit

History

Repository files navigation

What is HMC?

Supported devices

Core model

Part 1 — Python (Recommended)

1. Installation (Python)

1.1 Prerequisites

1.2 Build & Install from Source

1.3 (Optional) GPU/Accelerator Backends

2. Key Concepts

3. Example: CPU-to-CPU Transfer (UCX)

3.1 Server

3.2 Client

3.3 Coordinating “data ready” (recommended)

4. NumPy & PyTorch (CPU)

4.1 NumPy Send/Recv

4.2 PyTorch CPU Tensor

5. GPU Tensors (Advanced)

6. Troubleshooting (Python)

Common errors

Best practices

Part 2 — C++ Core Library (Advanced / Integration)

7. Installation (C++)

7.1 Prerequisites

7.2 Build

8. Usage (C++)

8.1 Core Components

8.2 Minimal Data-Plane Example (UCX put)

9. Appendix: API Cheatsheet

Python

C++

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

Packages