Rust-GPU · nnethercote · Dec 9, 2025 · Dec 9, 2025 · Dec 9, 2025 · Dec 9, 2025
@@ -1,5 +1,134 @@
+//! # cudnn
+//! Type safe cuDNN wrapper for the Rust programming language.
+//!
+//! ## Project status
+//!
+//! The current version of cuDNN targeted by this wrapper is 8.3.2. You can refer to the official
+//! [release notes] and to the [support matrix] by NVIDIA.
+//!
+//! [release notes]: https://docs.nvidia.com/deeplearning/cudnn/release-notes/index.html
+//! [support matrix]: https://docs.nvidia.com/deeplearning/cudnn/support-matrix/index.html
+//!
+//! The legacy API is somewhat complete and usable but the backend API is still a work in progress
+//! and its usage is discouraged. Both APIs are still being developed so expect bugs and reasonable
+//! breaking changes whilst using this crate.
+//!
+//! The project is part of the Rust CUDA ecosystem and is actively maintained by
+//! [frjnn](https://github.com/frjnn).
+//!
+//! ## Primer
+//!
+//! Here follows a list of useful concepts that should be taken as a handbook for the users of the
+//! crate. This is not intended to be the full documentation, as each wrapped struct, enum and
+//! function has its own docs, but rather a quick sum up of the key points of the API. For a deeper
+//! view, you should refer both to the docs of each item and to the [official ones] by NVIDIA.
+//! Furthermore, if you are new to cuDNN we strongly suggest reading the [official developer
+//! guide].
+//!
+//! [official ones]: https://docs.nvidia.com/deeplearning/cudnn/api/index.html
+//! [official developer guide]: https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#overview
+//!
+//! ### Device buffers
+//!
+//! This crate is built around [`cust`](https://docs.rs/cust/latest/cust/index.html) which is the
+//! core wrapper for interfacing with the CUDA driver API of our choice.
+//!
+//! ### cuDNN statuses and Result
+//!
+//! All cuDNN library functions return their status. This crate uses
+//! [`Result`](https://doc.rust-lang.org/std/result/enum.Result.html) to achieve a leaner,
+//! idiomatic and easier to manage API.
+//!
+//! ### cuDNN handles and RAII
+//!
+//! The main entry point of the cuDNN library is the `CudnnContext` struct. This handle is tied to
+//! a device and it is explicitly passed to every subsequent library function that operates on GPU
+//! data. It manages resources allocations both on the host and the device and takes care of the
+//! synchronization of all the the cuDNN primitives.
+//!
+//! The handles, and the other cuDNN structs wrapped by this crate, are implementors of the
+//! [`Drop`](https://doc.rust-lang.org/std/ops/trait.Drop.html) trait which implicitly calls their
+//! destructors on the cuDNN side when they go out of scope.
+//!
+//! cuDNN contexts can be created as shown in the following snippet:
+//!
+//! ```rust
+//! use cudnn::CudnnContext;
+//!
+//! let ctx = CudnnContext::new().unwrap();
+//! ```
+//!
+//! ### cuDNN data types
+//!
+//! In order to enforce type safety as much as possible at compile time, we shifted away from the
+//! original cuDNN enumerated data types and instead opted to leverage Rust's generics. In
+//! practice, this means that specifying the data type of a cuDNN tensor descriptor is done as
+//! follows:
+//!
+//! ```rust
+//! use cudnn::{CudnnContext, TensorDescriptor};
+//!
+//! let ctx = CudnnContext::new().unwrap();
+//!
+//! let shape = &[5, 5, 10, 25];
+//! let strides = &[1250, 250, 25, 1];
+//!
+//! // f32 tensor
+//! let desc = TensorDescriptor::<f32>::new_strides(shape, strides).unwrap();
+//! ```
+//!
+//! This API also allows for using Rust own types as cuDNN data types, which we see as a desirable
+//! property.
+//!
+//! Safely manipulating cuDNN data types that do not have any such direct match, such as vectorized
+//! ones, whilst still performing compile time compatibility checks can be done as follows:
+//!
+//! ```rust
+//! use cudnn::{CudnnContext, TensorDescriptor, Vec4};
+//!
+//! let ctx = CudnnContext::new().unwrap();
+//!
+//! let shape = &[4, 32, 32, 32];
+//!
+//! // in cuDNN this is equal to the INT8x4 data type and CUDNN_TENSOR_NCHW_VECT_C format
+//! let desc = TensorDescriptor::<i8>::new_vectorized::<Vec4>(shape).unwrap();
+//! ```
+//!
+//! The previous tensor descriptor can be used together with a `i8` device buffer and cuDNN will
+//! see it as being a tensor of `CUDNN_TENSOR_NCHW_VECT_C` format and `CUDNN_DATA_INT8x4` data
+//! type.
+//!
+//! Currently this crate does not support `f16` and `bf16` data types.
+//!
+//! ### cuDNN tensor formats
+//!
+//! We decided not to check tensor format configurations at compile time, because it is too strong
+//! a requirement. As a consequence, should you mess up, the program will fail at run-time. A
+//! proper understanding of the cuDNN API mechanics is thus fundamental to properly use this crate.
+//!
+//! You can refer to this [extract] from the cuDNN developer guide to learn more about tensor
+//! formats.
+//!
+//! [extract]: https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#data-layout-formats
+//!
+//! We split the original cuDNN tensor format enum, which counts 3 variants, in 2 parts: the
+//! `ScalarC` enum and the `TensorFormat::NchwVectC` enum variant. The former stands for "scalar
+//! channel" and it encapsulates the `Nchw` and `Nhwc` formats. Scalar channel formats can be both
+//! converted to the `TensorFormat` enum with
+//! [`.into()`](https://doc.rust-lang.org/std/convert/trait.Into.html).
+//!
+//! ```rust
+//! use cudnn::{ScalarC, TensorFormat};
+//!
+//! let sc_fmt = ScalarC::Nchw;
+//!
+//! let vc_fmt = TensorFormat::NchwVectC;
+//!
+//! let sc_to_tf: TensorFormat = sc_fmt.into();
+//! ```
+
 #![deny(rustdoc::broken_intra_doc_links)]
-#[doc = include_str!("../README.md")]
+
 mod activation;
 mod attention;
 mod backend;

diff --git a/crates/cust/README.md b/crates/cust/README.md
diff --git a/crates/cust/src/error.rs b/crates/cust/src/error.rs
@@ -1,6 +1,6 @@
 //! Types for error handling
 //!
-//! # Error handling in CUDA:
+//! # Error handling in CUDA
 //!
 //! cust uses the [`CudaError`](enum.CudaError.html) enum to represent the errors returned by
 //! the CUDA API. It is important to note that nearly every function in CUDA (and therefore

diff --git a/crates/cust/src/function.rs b/crates/cust/src/function.rs
@@ -474,7 +474,7 @@ impl Function<'_> {
 
 /// Launch a kernel function asynchronously.
 ///
-/// # Syntax:
+/// # Syntax
 ///
 /// The format of this macro is designed to resemble the triple-chevron syntax used to launch
 /// kernels in CUDA C. There are two forms available:

diff --git a/crates/cust/src/lib.rs b/crates/cust/src/lib.rs
@@ -6,17 +6,17 @@
 //! provides unsafe functions for retrieving and setting handles to raw CUDA objects.
 //! This allows advanced users to embed libraries that rely on CUDA, such as OptiX.
 //!
-//! # CUDA Terminology:
+//! # CUDA Terminology
 //!
-//! ## Devices and Hosts:
+//! ## Devices and Hosts
 //!
 //! This crate and its documentation uses the terms "device" and "host" frequently, so it's worth
 //! explaining them in more detail. A device refers to a CUDA-capable GPU or similar device and its
 //! associated external memory space. The host is the CPU and its associated memory space. Data
 //! must be transferred from host memory to device memory before the device can use it for
 //! computations, and the results must then be transferred back to host memory.
 //!
-//! ## Contexts, Modules, Streams and Functions:
+//! ## Contexts, Modules, Streams and Functions
 //!
 //! A CUDA context is akin to a process on the host - it contains all of the state for working with
 //! a device, all memory allocations, etc. Each context is associated with a single device.
@@ -30,7 +30,7 @@
 //! stream. Work within a single stream will execute sequentially in the order that it was
 //! submitted, and may interleave with work from other streams.
 //!
-//! ## Grids, Blocks and Threads:
+//! ## Grids, Blocks and Threads
 //!
 //! CUDA devices typically execute kernel functions on many threads in parallel. These threads can
 //! be grouped into thread blocks, which share an area of fast hardware memory known as shared
@@ -44,7 +44,7 @@
 //! hand, if the thread blocks are too small each processor will be under-utilized and the
 //! code will be unable to make effective use of shared memory.
 //!
-//! # Usage:
+//! # Usage
 //!
 //! Before using cust, you must install the CUDA development libraries for your system. Version
 //! 9.0 or newer is required. You must also have a CUDA-capable GPU installed with the appropriate

diff --git a/crates/cust/src/memory/pointer.rs b/crates/cust/src/memory/pointer.rs
@@ -429,7 +429,7 @@ impl<T: DeviceCopy> UnifiedPointer<T> {
 
     /// Returns a null unified pointer.
     ///
-    /// # Examples:
+    /// # Examples
     ///
     /// ```
     /// # let _context = cust::quick_init().unwrap();

diff --git a/crates/cust/src/module.rs b/crates/cust/src/module.rs
@@ -338,7 +338,7 @@ impl Module {
 
     /// Get a reference to a global symbol, which can then be copied to/from.
     ///
-    /// # Panics:
+    /// # Panics
     ///
     /// This function panics if the size of the symbol is not the same as the `mem::sizeof<T>()`.
     ///

diff --git a/crates/cust_derive/README.md b/crates/cust_derive/README.md
diff --git a/crates/cust_derive/src/lib.rs b/crates/cust_derive/src/lib.rs
@@ -1,3 +1,5 @@
+//! Custom derive macro crate for cust.
+
 #[macro_use]
 extern crate quote;
 extern crate proc_macro;

diff --git a/crates/gpu_rand/README.md b/crates/gpu_rand/README.md
diff --git a/crates/gpu_rand/src/lib.rs b/crates/gpu_rand/src/lib.rs
@@ -1,13 +1,41 @@
-//! gpu_rand is the Rust CUDA Project's equivalent of cuRAND. cuRAND unfortunately does not work with
-//! the CUDA Driver API, therefore, we reimplement (and extend) some of its algorithms and provide them in this crate.
+//! gpu_rand is the Rust CUDA Project's equivalent of cuRAND. cuRAND unfortunately does not work
+//! with the CUDA Driver API, therefore, we reimplement (and extend) some of its algorithms and
+//! provide them in this crate.
 //!
-//! This crate is meant to be gpu-centric, which means it may special-case certain things to run faster on the GPU by using PTX
-//! assembly. However, it is supposed to also work on the CPU, allowing you to reuse the same random states across CPU and GPU.
+//! This crate is meant to be GPU-centric, which means it may special-case certain things to run
+//! faster on the GPU by using PTX assembly. However, it is supposed to also work on the CPU,
+//! allowing you to reuse the same random states across CPU and GPU.
+//!
+//! A lot of the initial code is taken from the [rust-random
+//! project](https://github.com/rust-random) and modified to make it able to pass to the GPU, as
+//! well as cleaning up certain things and updating it to edition 2024.
 //!
-//! A lot of the initial code is taken from the [rust-random project](https://github.com/rust-random) and modified to make it able to
-//! pass to the GPU, as well as cleaning up certain things and updating it to edition 2024.
 //! The following generators are implemented:
 //!
+//! The random generators currently implemented are:
+//!
+//! 32-bit:
+//! - Xoroshiro64**
+//! - Xoroshiro64*
+//! - Xoroshiro128+
+//! - Xoroshiro128++
+//! - Xoroshiro128**
+//!
+//! 64-bit:
+//! - Xoroshiro128+
+//! - Xoroshiro128++
+//! - Xoroshiro128**
+//! - Xoroshiro256+
+//! - Xoroshiro256++
+//! - Xoroshiro256**
+//! - Xoroshiro512+
+//! - Xoroshiro512++
+//! - Xoroshiro512**
+//! - SplitMix64
+//!
+//! We also provide a default 64-bit generator which should be more than enough for most
+//! applications. The default currently uses Xoroshiro128** but that is subject to change in the
+//! future.
 
 #![deny(missing_docs)]
 #![deny(missing_debug_implementations)]