diff --git a/README.md b/README.md index f232e06..a529b9b 100644 --- a/README.md +++ b/README.md @@ -5,6 +5,320 @@ | [![](https://img.shields.io/badge/Developed_by-PSOR_Lab-342674)](https://psor.uconn.edu/) | [![Build Status](https://github.com/PSORLab/SourceCodeMcCormick.jl/workflows/CI/badge.svg?branch=master)](https://github.com/PSORLab/SourceCodeMcCormick.jl/actions?query=workflow%3ACI) [![codecov](https://codecov.io/gh/PSORLab/SourceCodeMcCormick.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/PSORLab/SourceCodeMcCormick.jl)| +This package uses source-code generation to create CUDA kernels that return evaluations of McCormick-based +relaxations. Expressions composed of `Symbolics.jl`-type variables can be passed into the main `SourceCodeMcCormick.jl` +(`SCMC`) function, `kgen`, after which the expressions are factored, algebraic rearrangements and other +modifications are applied as necessary, and a new, custom function is written that calculates inclusion +monotonic interval extensions, convex and concave relaxations, and subgradients of the convex and concave +relaxations of the original expression. These new custom functions are meant to be used in, e.g., a branch-and-bound +routine where the optimizer would like to calculate relaxations for many nodes simultaneously using GPU +resources. They are designed to be used with `CUDA.CuArray{Float64}` objects, with 64-bit (double-precision) +numbers recommended to maintain accuracy. + + +## Basic Functionality + +The primary user-facing function is `kgen` ("kernel generator"). The `kgen` function returns a source-code-generated +function that provides evaluations of convex and concave relaxations, inclusion monotonic interval extensions, +convex relaxation subgradients, and concave relaxation subgradients, for an input symbolic expression. Inputs +to the newly generated function are expected to be an output placeholder, followed by a separate `CuArray` for +each input variable, sorted in alphabetical order. The "input" variables must be `(m,3)`-dimensional `CuArray`s, +with the columns corresponding to points where evaluations are requested, lower bounds, and upper bounds, +respectively. The "output" placeholder must be `(m,4+2n)`-dimensional, where `n` is the dimensionality of the +expression. The columns will hold, respectively, the convex and concave relaxations, lower and upper bounds of +an interval extension, a subgradient of the convex relaxation, and a subgradient of the concave relaxation. +Each row `m` in the inputs and output are independent of one another, allowing calculations of relaxation information +at multiple points on (potentially) different domains. A demonstration of how to use `kgen` is shown here, +with the output compared to the multiple-dispatch-based `McCormick.jl`: + +```julia +using SourceCodeMcCormick, Symbolics, CUDA +Symbolics.@variables x, y +expr = exp(x/y) - (x*y^2)/(y+1) +new_func = kgen(expr) +x_in = CuArray([1.0 0.5 3.0;]) +y_in = CuArray([0.7 0.1 2.0;]) +OUT = CUDA.zeros(Float64, 1, 8) + +using McCormick +xMC = MC{2,NS}(1.0, Interval(0.5, 3.0), 1) +yMC = MC{2,NS}(0.7, Interval(0.1, 2.0), 2) + + +julia> new_func(OUT, x_in, y_in) # SourceCodeMcCormick.jl +CUDA.HostKernel for f_3zSdMo7LFdg_1(CuDeviceMatrix{Float64, 1}, CuDeviceMatrix{Float64, 1}, CuDeviceMatrix{Float64, 1}) + +julia> Array(OUT) +1×8 Matrix{Float64}: + 0.228368 2.96348e12 -9.62507 1.06865e13 -2.32491 -3.62947 3.59209e12 -8.98023e11 + +julia> exp(xMC/yMC) - (xMC*yMC^2)/(yMC+1) # McCormick.jl +MC{2, NS}(0.22836802303235793, 2.963476144457207e12, [-9.62507, 1.06865e+13], [-2.3249068975747305, -3.6294726270892967], [3.592092296310309e12, -8.980230740778097e11], false) +``` + +In this example, the symbolic expression `exp(x/y) - (x*y^2)/(y+1)` is passed into `kgen`, which returns the +function `new_func`. Passing inputs representing the values and domain where calculations are desired for `x` +and `y` (i.e., evaluating `x` at `1.0` in the domain `[0.5, 3.0]` and `y` at `0.7` in the domain `[0.1, 2.0]`) +into `new_func` returns pointwise evaluations of {cv, cc, lo, hi, [cvgrad]..., [ccgrad]...} for the original +expression of `exp(x/y) - (x*y^2)/(y+1)` on the specified domain of `x` and `y`. + +Evaluating large numbers of points simultaneously using a GPU allows for faster relaxation calculations than +evaluating points individually using a CPU. This is demonstrated in the following example: + +```julia +using SourceCodeMcCormick, Symbolics, CUDA, McCormick, BenchmarkTools +Symbolics.@variables x, y +expr = exp(x/y) - (x*y^2)/(y+1) +new_func = kgen(expr) + +# CuArray values for SCMC timing +x_in = hcat(2.5.*CUDA.rand(Float64, 8192) .+ 0.5, + 0.5.*CUDA.ones(Float64, 8192), + 3.0.*CUDA.ones(Float64, 8192)) +y_in = hcat(1.9.*CUDA.rand(Float64, 8192) .+ 0.1, + 0.1.*CUDA.ones(Float64, 8192), + 2.0.*CUDA.ones(Float64, 8192)) +OUT = CUDA.zeros(Float64, 8192, 8) + +# McCormick objects for McCormick.jl timing +xMC = MC{2,NS}(1.0, Interval(0.5, 3.0), 1) +yMC = MC{2,NS}(0.7, Interval(0.1, 2.0), 2) + +######################################################## +##### Timing +# CPU: Intel i7-9850H +# GPU: NVIDIA Quadro T2000 + + +##### McCormick.jl (single evaluation on CPU) +@btime exp($xMC/$yMC) - ($xMC*$yMC^2)/($yMC+1); +# 236.941 ns (0 allocations: 0 bytes) + + +##### SourceCodeMcCormick.jl (8192 simultaneous evaluations on GPU) +@btime new_func($OUT, $x_in, $y_in); +# 75.700 μs (18 allocations: 384 bytes) + +# Average time per evaluation = 75.700 μs / 8192 = 9.241 ns + + +# Time to move results back to CPU (if needed): +@btime Array($OUT); +# 107.200 μs (8 allocations: 512.16 KiB) + +# Time including evaluation: +# 75.700 μs + 107.200 μs = 182.900 μs +# Average time per evaluation = 182.900 μs / 8192 = 22.327 ns +``` + +As shown by the two previous examples, functions generated by `kgen` can be significantly faster than +the same calculations done by `McCormick.jl`, and provide the same information for a given input +expression. There is a considerable time penalty for bringing the results from device memory (GPU) back +to host memory (CPU) due to the volume of information generated. If possible, an algorithm designed +to make use of `SCMC`-generated functions should utilize calculated relaxation information within the +GPU rather than transfer the information between hardware components. + + +## Arguments for `kgen` + +The `kgen` function can be called a `Num`-type expression as an input (as in the examples in the +previous section), or with several possible arguments that affect `kgen`'s behavior. Some of these +functionalities and their use cases are shown in this section. + +### 1) Overwrite + +Functions and kernels created by `kgen` are human-readable and are [currently] saved in the +"src\kernel_writer\storage" folder with a title created by hashing the input expression and +list of relevant variables. By saving the files in this way, calling `kgen` multiple times +with the same expression in the same Julia session can skip the steps of re-writing and +re-compilling the same expression. If the same expression is used across Julia sessions, +`kgen` can skip the re-writing step and instead compile the existing function and kernels. + +In some cases, this behavior may not be desired. For example, if there is an error or interrupt +during the initial writing of the kernel, or if the source code of `kgen` is being adjusted +to modify how kernels are written, it is preferable to re-write the kernel from scratch +rather than rely on the existing generated code. For these cases, the setting `overwrite=true` +may be used to tell `kgen` to re-write and re-compile the generated code. For example: + +```julia +using SourceCodeMcCormick, Symbolics, CUDA +Symbolics.@variables x, y +expr = exp(x/y) - (x*y^2)/(y+1) + +# First call with an entirely new expression +@time new_func = kgen(expr); +# 3.958299 seconds (3.71 M allocations: 162.940 MiB, 13.60% compilation time) + +# Second call in the same Julia session +@time new_func = kgen(expr); +# 0.000310 seconds (252 allocations: 10.289 KiB) + +# Third call, with a hint for `kgen` to re-write and re-compile +@time new_func = kgen(expr, overwrite=true); +# 3.933316 seconds (3.70 M allocations: 162.619 MiB, 13.10% compilation time) +``` + + +### 2) Problem Variables + +When using `SCMC` to calculate relaxations for expressions that are components of a larger optimization +problem (such as individual constraints), it is necessary to provide `kgen` information about all of the +problem variables. For example, if the objective function in a problem is `x + y + z`, and one constraint +is `y^2 - z <= 0`, the subgradients of the relaxations for the constraint should be 3-dimensional despite +the constraint only depending on `y` and `z`. Further, the dimensions must match across expressions such +that the first subgradient dimension always corresponds to `x`, the second always corresponds to `y`, and +so on. This is critically important for the primary use case of `SCMC` and is built in to work without +the need to use a keyword argument. For example: + +```julia +using SourceCodeMcCormick, Symbolics, CUDA +Symbolics.@variables x, y, z +obj = x + y + z +constr1 = y^2 - z + +# A valid way to call `kgen`, which is incorrect in this case because the problem +# variables are {x, y, z}, but `constr1` only depends on {y, z}. +constr1_func_A = kgen(constr1) + +# The more correct way to call `kgen` in this case, to inform it that the problem +# variables are {x, y, z}. +constr1_func_B = kgen(constr1, [x, y, z]) +``` + +In this example, `constr1_func_A` assumes that the only variables are `y` and `z`, and it therefore expects +the `OUT` storage to be of size `(m,8)` (i.e., each subgradient is 2-dimensional). Because `kgen` is being +used for one constraint as part of an optimization problem, it is more correct to use `constr1_func_B`, +which expects `OUT` to be of size `(m,10)` (i.e., each subgradient is 3-dimensional). Note that this +is directly analogous to the typical use of `McCormick.jl`, where `MC` objects are constructed with the +larger problem in mind. E.g.: + +```julia +using McCormick + +xMC = MC{3,NS}(1.0, Interval(0.0, 2.0), 1) +yMC = MC{3,NS}(1.5, Interval(1.0, 3.0), 2) +zMC = MC{3,NS}(1.8, Interval(1.0, 4.0), 3) + +obj = xMC + yMC + zMC +constr1 = yMC^2 - zMC +``` + +Here, for the user to calculate `constr1`, they must have already created the `MC` objects `yMC` and `zMC`. +In the construction of these objects, it was specified that the subgradient would be 3-dimensional (the +`3` in `MC{3,NS}`), and `yMC` and `zMC` were selected as the second and third dimensions of the subgradient +by the `2` and `3` that follow the call to `Interval()`. `SCMC` needs the same information; only the +format is different from `McCormick.jl`. + +Note that if a list of problem variables is not provided, `SCMC` will collect the variables in the +expression it's provided and sort them alphabetically, and then numerically. I.e., `x1 < x2 < x10 < y1`. +If a variable list is provided, `SCMC` will respect the order of variables in the list without sorting +them. For example, if `constr1_func_B` were created by calling `kgen(constr1, [z, y, x])`, then +`constr1_func_B` would expect its input arguments to correspond to `(OUTPUT, z, y, x)` instead of +the alphabetical `(OUTPUT, x, y, z)`. + + +### 3) Splitting + +Internally, `kgen` may split the input expression into multiple subexpressions if the generated kernels +would be too long. In general, longer kernels are faster but require longer compilation time. The default +settings attempt to provide a balance of good performance and acceptably low compilation times, but +situations may arise where a different balance is preferred. The amount of splitting can be modified by +setting the keyword argument `splitting` to one of `{:low, :default, :high, :max}`, where higher values +indicate the (internal) creation of more, shorter kernels (faster compilation, slower performance), and lower +values indicate the (internal) creation of fewer, longer kernels (longer compilation, faster performance). +Actual compilation time and performance, as well as the impact of different `splitting` settings, is +expression-dependent. In most cases, however, the default value should be sufficient. + +Here's an example showing how different values affect compilation time and performance: +```julia +using SourceCodeMcCormick, Symbolics, CUDA +Symbolics.@variables x, y +expr = exp(x/y) - (x*y^2)/(y+1) + +# CuArray values for SCMC timing +x_in = hcat(2.5.*CUDA.rand(Float64, 8192) .+ 0.5, + 0.5.*CUDA.ones(Float64, 8192), + 3.0.*CUDA.ones(Float64, 8192)) +y_in = hcat(1.9.*CUDA.rand(Float64, 8192) .+ 0.1, + 0.1.*CUDA.ones(Float64, 8192), + 2.0.*CUDA.ones(Float64, 8192)) +OUT = CUDA.zeros(Float64, 8192, 8) + +################################## +# Default splitting (creates 1 kernel internally, based on the size of this expr) +@time new_func = kgen(expr, overwrite=true); +# 4.100798 seconds (3.70 M allocations: 162.637 MiB, 1.47% gc time, 12.29% compilation time) + +@btime new_func($OUT, $x_in, $y_in); +# 77.000 μs (18 allocations: 384 bytes) + +# --> This provides a good balance of compilation time and fast performance + +################################## +# Higher splitting (creates 2 kernels internally) +@time new_func = kgen(expr, splitting=:high, overwrite=true); +# 3.452594 seconds (3.87 M allocations: 170.845 MiB, 13.87% compilation time) + +@btime new_func($OUT, $x_in, $y_in); +# 117.000 μs (44 allocations: 960 bytes) + +# --> Splitting the function into 2 shorter kernels yields a marginally faster +# compilation time, but performance is marginally worse due to needing twice +# as many calls to kernels + +################################## +# Maximum splitting (creates 4 kernels internally) +@time new_func = kgen(expr, splitting=:max, overwrite=true); +# 5.903894 seconds (5.88 M allocations: 260.864 MiB, 12.91% compilation time) + +@btime new_func($OUT, $x_in, $y_in); +# 184.900 μs (83 allocations: 1.91 KiB) + +# --> Given the simplicity of the expression, splitting into 4 kernels does not +# provide benefit over the 2 kernels created with :high. Compilation time +# is longer because twice as many kernels need to be compiled (and each is +# not much less complicated than in the :high case), and performance is worse +# (because the number of kernel calls is doubled relative to the :high case +# without much decrease in complexity). +``` + +## The ParBB Algorithm + +The intended use of `SCMC` is in conjunction with a branch-and-bound algorithm that is able +to make use of relaxations and subgradient information that are calculated in parallel on a +GPU. An implementation of such an algorithm is available for `SCMC` version 0.4, and is +under development for version 0.5. See the paper reference in the section "Citing SourceCodeMcCormick" for +a more complete description of the ParBB algorithm for version 0.4. Briefly, ParBB is built +as an extension of the EAGO solver, and it works by parallelizing the node procesing routines +in such a way that tasks may be performed by a GPU. + + +## Citing SourceCodeMcCormick + +Please cite the following paper when using SourceCodeMcCormick.jl. In plain text form this is: +``` +Gottlieb, R. X., Xu, P., and Stuber, M. D. Automatic source code generation for deterministic +global optimization with parallel architectures. Optimization Methods and Software, 1–39 (2024). +DOI: 10.1080/10556788.2024.2396297 +``` +A BibTeX entry is given below: +```bibtex +@Article{doi:10.1080/10556788.2024.2396297, + author = {Robert X. Gottlieb, Pengfei Xu, and Matthew D. Stuber}, + journal = {Optimization Methods and Software}, + title = {Automatic source code generation for deterministic global optimization with parallel architectures}, + year = {2024}, + pages = {1--39}, + doi = {10.1080/10556788.2024.2396297}, + eprint = {https://doi.org/10.1080/10556788.2024.2396297}, + publisher = {Taylor \& Francis}, + url = {https://doi.org/10.1080/10556788.2024.2396297}, +} +``` + + +# README for v0.4 (Deprecated) + This package uses source-code transformation to construct McCormick-based relaxations. Expressions composed of `Symbolics.jl`-type variables can be passed into `SourceCodeMcCormick.jl` (`SCMC`) functions, after which the expressions are factored, generalized McCormick relaxation rules and inclusion monotonic interval @@ -18,7 +332,7 @@ numbers are recommended for relaxations to maintain accuracy. -## Basic Functionality +## Basic Functionality (v0.4) The primary user-facing function is `fgen` ("function generator"). The `fgen` function returns a source-code-generated function that provides evaluations of convex and concave relaxations, inclusion monotonic interval extensions, @@ -143,7 +457,7 @@ SourceCodeMcCormick functionality that creates these functions which will even f calculation speed. -## Arguments for `fgen` +## Arguments for `fgen` (v0.4) The `fgen` function can be called with only a `Num`-type expression as an input (as in the examples in the previous section), or with many other possible arguments that affect `fgen`'s behavior. Some of these @@ -339,7 +653,7 @@ these outputs does not impact the speed of the generated functions, since the ir aren't used in any calculations. -## The ParBB Algorithm +## The ParBB Algorithm (v0.4) The intended use of SourceCodeMcCormick is in conjunction with a branch-and-bound algorithm that is able to make use of relaxations and subgradient information that are calculated in @@ -705,7 +1019,7 @@ that ParBB is able to make use of the faster computing hardware of GPUs, and eve processed, can converge more quickly overall. -## Limitations +## Limitations (v0.4) (See "Citing SourceCodeMcCormick" for limitations on a pre-subgradient version of SourceCodeMcCormick.) @@ -725,7 +1039,7 @@ more frequently with respect to the bounds on its domain may perform worse than where the structure of the relaxation is more consistent. -## Citing SourceCodeMcCormick +## Citing SourceCodeMcCormick (v0.3) Please cite the following paper when using SourceCodeMcCormick.jl. In plain text form this is: ``` Gottlieb, R. X., Xu, P., and Stuber, M. D. Automatic source code generation for deterministic