Releases: PennyLaneAI/pennylane-lightning-gpu
Release 0.32.0
New features since last release
-
Add sparse Hamiltonian support to multi-node/multi-GPU adjoint methods. (#128)
-
Add Sparse Hamiltonian support for expectation value calculation. (#127)
Breaking changes
-
Rename
QubitStateVector
toStatePrep
in theLightningGPU
class. (#134) -
Deprecate Python 3.8. (#134)
-
Update PennyLane-Lightning imports following the (#472) refactoring. #134
Improvements
- Optimizes the single qubit rotation gate by using a single cuStateVector API call instead of separate Pauli gate applications. (#132)
Bug fixes
-
apply
no longer mutates the inputted list of operations and add the missing_dp
to the LightningGPU class with single GPU backend. (#133) -
Ensure active return check doesn't break CI. (#136)
Contributors
This release contains contributions from (in alphabetical order):
David Clark (NVIDIA), Vincent Michaud-Rioux, Shuli Shu
Release v0.31.0
New features since last release
- Add multi-node/multi-GPU support to adjoint methods. (#119)
Note each MPI process will return the overall result of the adjoint method. The MPI adjoint method has two options:
- The default method is faster if the available problem fits into GPU memory, and will simply enabled with the
mpi=True
device argument. With the default method, a separatebra
is created for each observable and theket
is only updated once for each operation, regardless of the number of observables. This approach may consume more memory due to the up-front creation of multiplebra
s. - The memory-optimized method requires less memory but is slower due serialization of the execution. The memory-optimized method uses a single
bra
object that is reused for all observables. Theket
needs to be updatedn
times, wheren
is the number of observables, for each operation. This approach reduces memory consumption as only onebra
object is created. However, it may lead to slower execution due to the multipleket
updates per gate operation.
Each MPI
process will return the overall simulation results for the adjoint method.
The workflow for the default adjoint method with MPI support is as follows:
from mpi4py import MPI
import pennylane as qml
from pennylane import numpy as np
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
n_wires = 20
n_layers = 2
dev = qml.device('lightning.gpu', wires= n_wires, mpi=True)
@qml.qnode(dev, diff_method="adjoint")
def circuit_adj(weights):
qml.StronglyEntanglingLayers(weights, wires=list(range(n_wires)))
return qml.math.hstack([qml.expval(qml.PauliZ(i)) for i in range(n_wires)])
if rank == 0:
params = np.random.random(qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_wires))
else:
params = None
params = comm.bcast(params, root=0)
jac = qml.jacobian(circuit_adj)(params)
To enable the memory-optimized method, batch_obs
should be set as True
. The workflow for the memory-optimized method is as follows:
dev = qml.device('lightning.gpu', wires= n_wires, mpi=True, batch_obs=True)
- Add multi-node/multi-GPU support to measurement methods, including
expval
,generate_samples
andprobability
. (#116)
Note that each MPI process will return the overall result of expectation value and sample generation. However, probability
will
return local probability results. Users should be responsible to collect probability results across the MPI processes.
The workflow for collecting probability results across the MPI processes is as follows:
from mpi4py import MPI
import pennylane as qml
import numpy as np
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
dev = qml.device('lightning.gpu', wires=8, mpi=True)
prob_wires = [0, 1]
@qml.qnode(dev)
def mpi_circuit():
qml.Hadamard(wires=1)
return qml.probs(wires=prob_wires)
local_probs = mpi_circuit()
#For data collection across MPI processes.
recv_counts = comm.gather(len(local_probs),root=0)
if rank == 0:
probs = np.zeros(2**len(prob_wires))
else:
probs = None
comm.Gatherv(local_probs,[probs,recv_counts],root=0)
if rank == 0:
print(probs)
-
Add multi-node/multi-gpu support to gate operation. (#112)
This new feature empowers users to leverage the computational power of multi-node and multi-GPUs for running large-scale applications. It requires both the total number of overall
MPI
processes and the number ofMPI
processes of each node to be the same and power of2
. EachMPI
process is responsible for managing one GPU for the moment.
To enable this feature, users can setmpi=True
. Furthermore, users can fine-tune the performance ofMPI
operations by adjusting thempi_buf_size
parameter. This parameter determines the allocation ofmpi_buf_size
MiB (mebibytes,2^20
bytes) GPU memory forMPI
operations. Note thatmpi_buf_size
should be also power of 2 and there will be a runtime warning if GPU memory buffer for MPI operation is larger than the GPU memory allocated for the local state vector. By default (mpi_buf_size=0
), the GPU memory allocated for MPI operations will be the same of size of the local state vector, with a upper limit of 64 MiB. Note that MiB (2^20
bytes) is different from MB (megabytes,10^6
bytes).
The workflow for the new feature is as follows:from mpi4py import MPI import pennylane as qml dev = qml.device('lightning.gpu', wires=8, mpi=True, mpi_buf_size=1) @qml.qnode(dev) def circuit_mpi(): qml.PauliX(wires=[0]) return qml.state() local_state_vector = circuit_mpi() print(local_state_vector)
Note that each MPI process will return its local state vector with
qml.state()
here.
Breaking changes
- Update tests to be compliant with PennyLane v0.31.0 development changes and deprecations. (#114)
Improvements
-
Use
Operator.name
instead ofOperation.base_name
. (#115) -
Updated runs-on label for self-hosted runner workflows. (#117)
-
Update workflow to support multi-gpu self-hosted runner. (#118)
-
Add compat workflows. (#121)
Documentation
- Update
README.rst
andCHANGLOG.md
for the MPI backend. (#122)
Contributors
This release contains contributions from (in alphabetical order):
Christina Lee, Rashid N H M, Shuli Shu
Release v0.30.0
New features since last release
Improvements
- Wheels are now checked with
twine check
post-creation for PyPI compatibility. (#103)
Bug fixes
-
Fix CUDA version to 11 for cuquantum dependency in CI. (#107)
-
Fix the controlled-gate generators, which are now fully used in the adjoint pipeline following PennyLane PR (#3874). (#101)
-
Updates to use the new call signature for
QuantumScript.get_operation
. (#104)
Contributors
Vincent Michaud-Rioux, Romain Moyard, Lee James O'Riordan
Release v0.29.1
Improvements
- Optimization updates to custatevector integration. E.g., creation of fewer cublas, cusparse and custatevec handles and fewer calls to small data transfers between host and device. (#73)
Contributors
Ania Brown (NVIDIA), Andreas Hehn (NVIDIA)
Release v0.29.0
Improvements
-
Update
inv()
toqml.adjoint()
in Python tests following recent changes in Pennylane. (#88) -
Remove explicit Numpy requirement. (#90)
Bug fixes
- Ensure early-failure rather than return of incorrect results from out of order probs wires. (#94)
Contributors
This release contains contributions from (in alphabetical order):
Amintor Dusko, Lee James O'Riordan, Shuli Shu
Release 0.28.1
Release 0.28.0
New features since last release
- Add customized CUDA kernels for statevector initialization to cpp layer. (#70)
Breaking changes
- Deprecate
_state
and_pre_rotated_state
and refactorsyncH2D
andsyncD2H
. (#70)
The refactor on syncH2D
and syncD2H
allows users to explicitly access and update statevector data
on device when needed and could reduce the unnecessary memory allocation on host.
The workflow for syncH2D
is:
dev = qml.device('lightning.gpu', wires=3)
obs = qml.Identity(0) @ qml.PauliX(1) @ qml.PauliY(2)
obs1 = qml.Identity(1)
H = qml.Hamiltonian([1.0, 1.0], [obs1, obs])
state_vector = np.array([0.0 + 0.0j, 0.0 + 0.1j, 0.1 + 0.1j, 0.1 + 0.2j,
0.2 + 0.2j, 0.3 + 0.3j, 0.3 + 0.4j, 0.4 + 0.5j,], dtype=np.complex64,)
dev.syncH2D(state_vector)
res = dev.expval(H)
The workflow for syncD2H
is:
dev = qml.device('lightning.gpu', wires=num_wires)
dev.apply([qml.PauliX(wires=[0])])
state_vector = np.zeros(2**dev.num_wires).astype(dev.C_DTYPE)
dev.syncD2H(state_vector)
Improvements
-
lightning.gpu
is decoupled from Numpy layer during initialization and execution
and changelightning.gpu
to inherit fromQubitDevice
instead ofLightningQubit
. (#70) -
Add support for CI checks. (#76)
-
Implement improved
stopping_condition
method, and make Linux wheel builds more performant. (#77)
Bug fixes
-
Fix wheel-builder to pin CUDA version to 11.8 instead of latest. (#83)
-
Pin CMake to 3.24.x in wheel-builder to avoid Python not found error in CMake 3.25. (#75)
-
Fix data copy method in the state() method. (#82)
Contributors
This release contains contributions from (in alphabetical order):
Amintor Dusko, Lee J. O'Riordan, Shuli Shu
Release v0.27.0
New features since last release
-
Explicit support for
qml.SparseHamiltonian
using the adjoint gradient method. (#72)This support allows users to explicitly make use of
qml.SparseHamiltonian
in expectation value calculations, and ensures the gradients can be taken efficiently.
A user can now explicitly decide whether to decompose the Hamiltonian into separate Pauli-words, with evaluations happening over multiple GPUs, or convert the Hamiltonian directly to a sparse representation for evaluation on a single GPU. Depending on the Hamiltonian structure, a user may benefit from one method or the other.The workflow for decomposing a Hamiltonian is as:
obs_per_gpu = 1 dev = qml.device("lightning.gpu", wires=num_wires, batch_obs=obs_per_gpu) H = sum([0.5*(i+1)*(qml.PauliZ(i)@qml.PauliZ(i+1)) for i in range(0, num_wires-1, 2)]) @qml.qnode(dev, diff_method="adjoint") def circuit(params): for i in range(num_wires): qml.RX(params[i], i) return qml.expval(H)
For the new
qml.SparseHamiltonian
support, the above script becomes:dev = qml.device("lightning.gpu", wires=num_wires) H = sum([0.5*(i+1)*(qml.PauliZ(i)@qml.PauliZ(i+1)) for i in range(0, num_wires-1, 2)]) H_sparse_matrix = qml.utils.sparse_hamiltonian(H, wires=range(num_wires)) SpH = qml.SparseHamiltonian(H_sparse_matrix, wires=range(num_wires)) @qml.qnode(dev, diff_method="adjoint") def circuit(params): for i in range(num_wires): qml.RX(params[i], i) return qml.expval(SpH)
-
Enable building of python 3.11 wheels and upgrade python on CI/CD workflows to 3.8. (#71)
Improvements
- Update
LightningGPU
device following changes inLightningQubit
inheritance fromDefaultQubit
toQubitDevice
. (#74)
Bug fixes
-
Ensure device fallback successfully carries through for 0 CUDA devices. (#67)
-
Fix void data type used in SparseSpMV. (#69)
Contributors
Amintor Dusko, Lee J. O'Riordan, Shuli Shu
Release v0.26.2
Bug fixes
- Fix reduction over batched & decomposed Hamiltonians in adjoint pipeline (#64)
Contributors
Lee J. O'Riordan
Release v0.26.1
This is a minor release with an update to how qml.Hamiltonian
's are handled at the C++ layer.
Bug fixes
- Ensure
qml.Hamiltonian
is auto-decomposed for the adjoint differentiation pipeline to avoid OOM errors.
(#62)
Contributors
Lee J. O'Riordan