|
1 |
| -# AdvancedVI.jl |
2 |
| -A library for variational Bayesian inference in Julia. |
3 |
| - |
4 |
| -At the time of writing (05/02/2020), implementations of the variational inference (VI) interface and some algorithms are implemented in [Turing.jl](https://github.com/TuringLang/Turing.jl). The idea is to soon separate the VI functionality in Turing.jl out and into this package. |
5 |
| - |
6 |
| -The purpose of this package will then be to provide a common interface together with implementations of standard algorithms and utilities with the goal of ease of use and the ability for other packages, e.g. Turing.jl, to write a light wrapper around AdvancedVI.jl for integration. |
7 | 1 |
|
8 |
| -As an example, in Turing.jl we support automatic differentiation variational inference (ADVI) but really the only piece of code tied into the Turing.jl is the conversion of a `Turing.Model` to a `logjoint(z)` function which computes `z ↦ log p(x, z)`, with `x` denoting the observations embedded in the `Turing.Model`. As long as this `logjoint(z)` method is compatible with some AD framework, e.g. `ForwardDiff.jl` or `Zygote.jl`, this is all we need from Turing.jl to be able to perform ADVI! |
9 |
| - |
10 |
| -## [WIP] Interface |
11 |
| -- `vi`: the main interface to the functionality in this package |
12 |
| - - `vi(model, alg)`: only used when `alg` has a default variational posterior which it will provide. |
13 |
| - - `vi(model, alg, q::VariationalPosterior, θ)`: `q` represents the family of variational distributions and `θ` is the initial parameters "indexing" the starting distribution. This assumes that there exists an implementation `Variational.update(q, θ)` which returns the variational posterior corresponding to parameters `θ`. |
14 |
| - - `vi(model, alg, getq::Function, θ)`: here `getq(θ)` is a function returning a `VariationalPosterior` corresponding to `θ`. |
15 |
| -- `optimize!(vo, alg::VariationalInference{AD}, q::VariationalPosterior, model::Model, θ; optimizer = TruncatedADAGrad())` |
16 |
| -- `grad!(vo, alg::VariationalInference, q, model::Model, θ, out, args...)` |
17 |
| - - Different combinations of variational objectives (`vo`), VI methods (`alg`), and variational posteriors (`q`) might use different gradient estimators. `grad!` allows us to specify these different behaviors. |
| 2 | +# AdvancedVI.jl |
| 3 | +[AdvancedVI](https://github.com/TuringLang/AdvancedVI.jl) provides implementations of variational inference (VI) algorithms, which is a family of algorithms aiming for scalable approximate Bayesian inference by leveraging optimization. |
| 4 | +`AdvancedVI` is part of the [Turing](https://turinglang.org/stable/) probabilistic programming ecosystem. |
| 5 | +The purpose of this package is to provide a common accessible interface for various VI algorithms and utilities so that other packages, e.g. `Turing`, only need to write a light wrapper for integration. |
| 6 | +For example, integrating `Turing` with `AdvancedVI.ADVI` only involves converting a `Turing.Model` into a [`LogDensityProblem`](https://github.com/tpapp/LogDensityProblems.jl) and extracting a corresponding `Bijectors.bijector`. |
18 | 7 |
|
19 | 8 | ## Examples
|
20 |
| -### Variational Inference |
21 |
| -A very simple generative model is the following |
22 | 9 |
|
23 |
| - μ ~ 𝒩(0, 1) |
24 |
| - xᵢ ∼ 𝒩(μ, 1) , ∀i = 1, …, n |
| 10 | +`AdvancedVI` works with differentiable models specified as a [`LogDensityProblem`](https://github.com/tpapp/LogDensityProblems.jl). |
| 11 | +For example, for the normal-log-normal model: |
25 | 12 |
|
26 |
| -where μ and xᵢ are some ℝᵈ vectors and 𝒩 denotes a d-dimensional multivariate Normal distribution. |
| 13 | +$$ |
| 14 | +\begin{aligned} |
| 15 | +x &\sim \mathrm{LogNormal}\left(\mu_x, \sigma_x^2\right) \\ |
| 16 | +y &\sim \mathcal{N}\left(\mu_y, \sigma_y^2\right), |
| 17 | +\end{aligned} |
| 18 | +$$ |
27 | 19 |
|
28 |
| -Given a set of `n` observations `[x₁, …, xₙ]` we're interested in finding the distribution `p(μ∣x₁, …, xₙ)` over the mean `μ`. We can obtain (an approximation to) this distribution that using AdvancedVI.jl! |
29 |
| - |
30 |
| -First we generate some observations and set up the problem: |
| 20 | +a `LogDensityProblem` can be implemented as |
31 | 21 | ```julia
|
32 |
| -julia> using Distributions |
33 |
| - |
34 |
| -julia> d = 2; n = 100; |
35 |
| - |
36 |
| -julia> observations = randn((d, n)); # 100 observations from 2D 𝒩(0, 1) |
37 |
| - |
38 |
| -julia> # Define generative model |
39 |
| - # μ ~ 𝒩(0, 1) |
40 |
| - # xᵢ ∼ 𝒩(μ, 1) , ∀i = 1, …, n |
41 |
| - prior(μ) = logpdf(MvNormal(ones(d)), μ) |
42 |
| -prior (generic function with 1 method) |
43 |
| - |
44 |
| -julia> likelihood(x, μ) = sum(logpdf(MvNormal(μ, ones(d)), x)) |
45 |
| -likelihood (generic function with 1 method) |
46 |
| - |
47 |
| -julia> logπ(μ) = likelihood(observations, μ) + prior(μ) |
48 |
| -logπ (generic function with 1 method) |
49 |
| - |
50 |
| -julia> logπ(randn(2)) # <= just checking that it works |
51 |
| --311.74132761437653 |
52 |
| -``` |
53 |
| -Now there are mainly two different ways of specifying the approximate posterior (and its family). The first is by providing a mapping from distribution parameters to the distribution `θ ↦ q(⋅∣θ)`: |
54 |
| -```julia |
55 |
| -julia> using DistributionsAD, AdvancedVI |
| 22 | +using LogDensityProblems |
| 23 | +using SimpleUnPack |
| 24 | + |
| 25 | +struct NormalLogNormal{MX,SX,MY,SY} |
| 26 | + μ_x::MX |
| 27 | + σ_x::SX |
| 28 | + μ_y::MY |
| 29 | + Σ_y::SY |
| 30 | +end |
56 | 31 |
|
57 |
| -julia> # Using a function z ↦ q(⋅∣z) |
58 |
| - getq(θ) = TuringDiagMvNormal(θ[1:d], exp.(θ[d + 1:4])) |
59 |
| -getq (generic function with 1 method) |
60 |
| -``` |
61 |
| -Then we make the choice of algorithm, a subtype of `VariationalInference`, |
62 |
| -```julia |
63 |
| -julia> # Perform VI |
64 |
| - advi = ADVI(10, 10_000) |
65 |
| -ADVI{AdvancedVI.ForwardDiffAD{40}}(10, 10000) |
66 |
| -``` |
67 |
| -And finally we can perform VI! The usual inferface is to call `vi` which behind the scenes takes care of the optimization and returns the resulting variational posterior: |
68 |
| -```julia |
69 |
| -julia> q = vi(logπ, advi, getq, randn(4)) |
70 |
| -[ADVI] Optimizing...100% Time: 0:00:01 |
71 |
| -TuringDiagMvNormal{Array{Float64,1},Array{Float64,1}}(m=[0.16282745378074515, 0.15789310089462574], σ=[0.09519377533754399, 0.09273176907111745]) |
72 |
| -``` |
73 |
| -Let's have a look at the resulting ELBO: |
74 |
| -```julia |
75 |
| -julia> AdvancedVI.elbo(advi, q, logπ, 1000) |
76 |
| --287.7866366886285 |
77 |
| -``` |
78 |
| -Unfortunately, the *final* value of the ELBO is not always a very good diagnostic, though the ELBO is an important metric to keep an eye on during training since an *increase* in the ELBO means we're going in the right direction. Luckily, this is such a simple problem that we can indeed obtain a closed form solution! Because we're lazy (at least I am), we'll let [ConjugatePriors.jl](https://github.com/JuliaStats/ConjugatePriors.jl) do this for us: |
79 |
| -```julia |
80 |
| -julia> # True posterior |
81 |
| - using ConjugatePriors |
| 32 | +function LogDensityProblems.logdensity(model::NormalLogNormal, θ) |
| 33 | + (; μ_x, σ_x, μ_y, Σ_y) = model |
| 34 | + logpdf(LogNormal(μ_x, σ_x), θ[1]) + logpdf(MvNormal(μ_y, Σ_y), θ[2:end]) |
| 35 | +end |
82 | 36 |
|
83 |
| -julia> pri = MvNormal(zeros(2), ones(2)); |
| 37 | +function LogDensityProblems.dimension(model::NormalLogNormal) |
| 38 | + length(model.μ_y) + 1 |
| 39 | +end |
84 | 40 |
|
85 |
| -julia> true_posterior = posterior((pri, pri.Σ), MvNormal, observations) |
86 |
| -DiagNormal( |
87 |
| -dim: 2 |
88 |
| -μ: [0.1746546592601148, 0.16457110079543008] |
89 |
| -Σ: [0.009900990099009901 0.0; 0.0 0.009900990099009901] |
90 |
| -) |
| 41 | +function LogDensityProblems.capabilities(::Type{<:NormalLogNormal}) |
| 42 | + LogDensityProblems.LogDensityOrder{0}() |
| 43 | +end |
91 | 44 | ```
|
92 |
| -Comparing to our variational approximation, this looks pretty good! Worth noting that in this particular case the variational posterior seems to overestimate the variance. |
93 | 45 |
|
94 |
| -To conclude, let's make a somewhat pretty picture: |
| 46 | +Since the support of `x` is constrained to be positive, and VI is best done in the unconstrained Euclidean space, we need to use a *bijector* to transform `x` into unconstrained Euclidean space. We will use the [`Bijectors.jl`](https://github.com/TuringLang/Bijectors.jl) package for this purpose. |
| 47 | +This corresponds to the automatic differentiation variational inference (ADVI) formulation[^KTRGB2017]. |
95 | 48 | ```julia
|
96 |
| -julia> using Plots |
97 |
| - |
98 |
| -julia> p_samples = rand(true_posterior, 10_000); q_samples = rand(q, 10_000); |
| 49 | +using Bijectors |
99 | 50 |
|
100 |
| -julia> p1 = histogram(p_samples[1, :], label="p"); histogram!(q_samples[1, :], alpha=0.7, label="q") |
101 |
| - |
102 |
| -julia> title!(raw"$\mu_1$") |
103 |
| - |
104 |
| -julia> p2 = histogram(p_samples[2, :], label="p"); histogram!(q_samples[2, :], alpha=0.7, label="q") |
105 |
| - |
106 |
| -julia> title!(raw"$\mu_2$") |
107 |
| - |
108 |
| -julia> plot(p1, p2) |
| 51 | +function Bijectors.bijector(model::NormalLogNormal) |
| 52 | + (; μ_x, σ_x, μ_y, Σ_y) = model |
| 53 | + Bijectors.Stacked( |
| 54 | + Bijectors.bijector.([LogNormal(μ_x, σ_x), MvNormal(μ_y, Σ_y)]), |
| 55 | + [1:1, 2:1+length(μ_y)]) |
| 56 | +end |
109 | 57 | ```
|
110 |
| - |
111 |
| - |
112 |
| -### Simple example: using Advanced.jl to directly minimize the KL-divergence between two distributions `p(z)` and `q(z)` |
113 |
| -In VI we aim to approximate the true posterior `p(z ∣ x)` by some approximate variational posterior `q(z)` by maximizing the ELBO: |
114 |
| - |
115 |
| - ELBO(q) = 𝔼_q[log p(x, z) - log q(z)] |
116 |
| - |
117 |
| -Observe that we can express the ELBO as the negative KL-divergence between `p(x, ⋅)` and `q(⋅)`: |
118 |
| - |
119 |
| - ELBO(q) = - 𝔼_q[log (q(z) / p(x, z))] |
120 |
| - = - KL(q(⋅) || p(x, ⋅)) |
121 | 58 |
|
122 |
| -So if we apply VI to something that isn't an actual posterior, i.e. there's no data involved and we write `p(z ∣ x) = p(z)`, we're really just minimizing the KL-divergence between the distributions. |
| 59 | +A simpler approach is to use `Turing`, where a `Turing.Model` can be automatically be converted into a `LogDensityProblem` and a corresponding `bijector` is automatically generated. |
123 | 60 |
|
124 |
| -Therefore, we can try out `AdvancedVI.jl` real quick by applying using the interface to minimize the KL-divergence between two distributions: |
125 |
| - |
126 |
| -```julia |
127 |
| -julia> using Distributions, DistributionsAD, AdvancedVI |
128 |
| - |
129 |
| -julia> # Target distribution |
130 |
| - p = MvNormal(ones(2)) |
131 |
| -ZeroMeanDiagNormal( |
132 |
| -dim: 2 |
133 |
| -μ: [0.0, 0.0] |
134 |
| -Σ: [1.0 0.0; 0.0 1.0] |
135 |
| -) |
136 |
| - |
137 |
| -julia> logπ(z) = logpdf(p, z) |
138 |
| -logπ (generic function with 1 method) |
139 |
| - |
140 |
| -julia> # Make a choice of VI algorithm |
141 |
| - advi = ADVI(10, 1000) |
142 |
| -ADVI{AdvancedVI.ForwardDiffAD{40}}(10, 1000) |
143 |
| -``` |
144 |
| -Now there are two different ways of specifying the approximate posterior (and its family); the first is by providing a mapping from parameters to distribution `θ ↦ q(⋅∣θ)`: |
145 |
| -```julia |
146 |
| -julia> # Using a function z ↦ q(⋅∣z) |
147 |
| - getq(θ) = TuringDiagMvNormal(θ[1:2], exp.(θ[3:4])) |
148 |
| -getq (generic function with 1 method) |
149 |
| - |
150 |
| -julia> # Perform VI |
151 |
| - q = vi(logπ, advi, getq, randn(4)) |
152 |
| -┌ Info: [ADVI] Should only be seen once: optimizer created for θ |
153 |
| -└ objectid(θ) = 0x5ddb564423896704 |
154 |
| -[ADVI] Optimizing...100% Time: 0:00:01 |
155 |
| -TuringDiagMvNormal{Array{Float64,1},Array{Float64,1}}(m=[-0.012691337868985757, -0.0004442434543332919], σ=[1.0334797673569802, 0.9957355128767893]) |
156 |
| -``` |
157 |
| -Or we can check the ELBO (which in this case since, as mentioned, doesn't involve data, is the negative KL-divergence): |
| 61 | +Let us instantiate a random normal-log-normal model. |
158 | 62 | ```julia
|
159 |
| -julia> AdvancedVI.elbo(advi, q, logπ, 1000) # empirical estimate |
160 |
| -0.08031049170093245 |
| 63 | +using LinearAlgebra |
| 64 | + |
| 65 | +n_dims = 10 |
| 66 | +μ_x = randn() |
| 67 | +σ_x = exp.(randn()) |
| 68 | +μ_y = randn(n_dims) |
| 69 | +σ_y = exp.(randn(n_dims)) |
| 70 | +model = NormalLogNormal(μ_x, σ_x, μ_y, Diagonal(σ_y.^2)) |
161 | 71 | ```
|
162 |
| -It's worth noting that the actual value of the ELBO doesn't really tell us too much about the quality of fit. In this particular case, because we're *directly* minimizing the KL-divergence, we can only say something useful if we reach 0, in which case we have obtained the true distribution. |
163 | 72 |
|
164 |
| -Let's just quickly check the mean-squared error between the `log p(z)` and `log q(z)` for a random set of samples from the target `p`: |
| 73 | +We can perform VI with stochastic gradient descent (SGD) using reparameterization gradient estimates of the ELBO[^TL2014][^RMW2014][^KW2014] as follows: |
165 | 74 | ```julia
|
166 |
| -julia> zs = rand(p, 100); |
167 |
| - |
168 |
| -julia> mean(abs2, logpdf(q, zs) - logpdf(p, zs)) |
169 |
| -0.0014889109427524852 |
170 |
| -``` |
171 |
| -That doesn't look too bad! |
172 |
| - |
173 |
| -## Implementing your own training loop |
174 |
| -Sometimes it might be convenient to roll your own training loop rather than using `vi(...)`. Here's some psuedo-code for how one would do that when used together with Turing.jl: |
175 |
| - |
176 |
| -```julia |
177 |
| -using Turing, AdvancedVI, DiffResults |
178 |
| -using Turing: Variational |
179 |
| - |
180 |
| -using ProgressMeter |
181 |
| - |
182 |
| -# Assuming you have an instance of a Turing model (`model`) |
183 |
| - |
184 |
| -# 1. Create log-joint needed for ELBO evaluation |
185 |
| -logπ = Variational.make_logjoint(model) |
186 |
| - |
187 |
| -# 2. Define objective |
188 |
| -variational_objective = Variational.ELBO() |
189 |
| - |
190 |
| -# 3. Optimizer |
191 |
| -optimizer = Variational.DecayedADAGrad() |
192 |
| - |
193 |
| -# 4. VI-algorithm |
194 |
| -alg = ADVI(10, 1000) |
195 |
| - |
196 |
| -# 5. Variational distribution |
197 |
| -function getq(θ) |
198 |
| - # ... |
199 |
| -end |
200 |
| - |
201 |
| -# 6. [OPTIONAL] Implement convergence criterion |
202 |
| -function hasconverged(args...) |
203 |
| - # ... |
204 |
| -end |
205 |
| - |
206 |
| -# 7. [OPTIONAL] Implement a callback for tracking stats |
207 |
| -function callback(args...) |
208 |
| - # ... |
209 |
| -end |
210 |
| - |
211 |
| -# 8. Train |
212 |
| -converged = false |
213 |
| -step = 1 |
214 |
| - |
215 |
| -prog = ProgressMeter.Progress(num_steps, 1) |
216 |
| - |
217 |
| -diff_results = DiffResults.GradientResult(θ_init) |
218 |
| - |
219 |
| -while (step ≤ num_steps) && !converged |
220 |
| - # 1. Compute gradient and objective value; results are stored in `diff_results` |
221 |
| - AdvancedVI.grad!(variational_objective, alg, getq, model, diff_results) |
222 |
| - |
223 |
| - # 2. Extract gradient from `diff_result` |
224 |
| - ∇ = DiffResults.gradient(diff_result) |
225 |
| - |
226 |
| - # 3. Apply optimizer, e.g. multiplying by step-size |
227 |
| - Δ = apply!(optimizer, θ, ∇) |
228 |
| - |
229 |
| - # 4. Update parameters |
230 |
| - @. θ = θ - Δ |
231 |
| - |
232 |
| - # 5. Do whatever analysis you want |
233 |
| - callback(args...) |
234 |
| - |
235 |
| - # 6. Update |
236 |
| - converged = hasconverged(...) # or something user-defined |
237 |
| - step += 1 |
| 75 | +using Optimisers |
| 76 | +using ADTypes, ForwardDiff |
| 77 | +using AdvancedVI |
| 78 | + |
| 79 | +# ELBO objective with the reparameterization gradient |
| 80 | +n_montecarlo = 10 |
| 81 | +elbo = AdvancedVI.RepGradELBO(n_montecarlo) |
| 82 | + |
| 83 | +# Mean-field Gaussian variational family |
| 84 | +d = LogDensityProblems.dimension(model) |
| 85 | +μ = zeros(d) |
| 86 | +L = Diagonal(ones(d)) |
| 87 | +q = AdvancedVI.MeanFieldGaussian(μ, L) |
| 88 | + |
| 89 | +# Match support by applying the `model`'s inverse bijector |
| 90 | +b = Bijectors.bijector(model) |
| 91 | +binv = inverse(b) |
| 92 | +q_transformed = Bijectors.TransformedDistribution(q, binv) |
| 93 | + |
| 94 | + |
| 95 | +# Run inference |
| 96 | +max_iter = 10^3 |
| 97 | +q, stats, _ = AdvancedVI.optimize( |
| 98 | + model, |
| 99 | + elbo, |
| 100 | + q_transformed, |
| 101 | + max_iter; |
| 102 | + adbackend = ADTypes.AutoForwardDiff(), |
| 103 | + optimizer = Optimisers.Adam(1e-3) |
| 104 | +) |
238 | 105 |
|
239 |
| - ProgressMeter.next!(prog) |
240 |
| -end |
| 106 | +# Evaluate final ELBO with 10^3 Monte Carlo samples |
| 107 | +estimate_objective(elbo, q, model; n_samples=10^4) |
241 | 108 | ```
|
242 | 109 |
|
| 110 | +For more examples and details, please refer to the documentation. |
243 | 111 |
|
244 | 112 | ## References
|
245 |
| - |
246 |
| -- Jordan, Michael I., Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. "An introduction to variational methods for graphical models." Machine learning 37, no. 2 (1999): 183-233. |
247 |
| -- Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. "Variational inference: A review for statisticians." Journal of the American statistical Association 112, no. 518 (2017): 859-877. |
248 |
| -- Kucukelbir, Alp, Rajesh Ranganath, Andrew Gelman, and David Blei. "Automatic variational inference in Stan." In Advances in Neural Information Processing Systems, pp. 568-576. 2015. |
249 |
| -- Salimans, Tim, and David A. Knowles. "Fixed-form variational posterior approximation through stochastic linear regression." Bayesian Analysis 8, no. 4 (2013): 837-882. |
250 |
| -- Beal, Matthew James. Variational algorithms for approximate Bayesian inference. 2003. |
| 113 | +[^TL2014]: Titsias, M., & Lázaro-Gredilla, M. (2014, June). Doubly stochastic variational Bayes for non-conjugate inference. In *International Conference on Machine Learning*. PMLR. |
| 114 | +[^RMW2014]: Rezende, D. J., Mohamed, S., & Wierstra, D. (2014, June). Stochastic backpropagation and approximate inference in deep generative models. In *International Conference on Machine Learning*. PMLR. |
| 115 | +[^KW2014]: Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In *International Conference on Learning Representations*. |
| 116 | +[^KTRGB2017]: Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic differentiation variational inference. *Journal of machine learning research*. |
0 commit comments