JuliaCon 2022
Don’t put all your 🥚 in one 🧺.
[…] parameters correspond to a diverse variety of compelling explanations for the data. (Wilson 2020)
\(\theta\) is a random variable. Shouldn’t we treat it that way?
\[ p(y|x,\mathcal{D}) = \int p(y|x,\theta)p(\theta|\mathcal{D})d\theta \qquad(1)\]
Intractable!
In practice we typically rely on a plugin approximation (Murphy 2022).
\[ p(y|x,\mathcal{D}) = \int p(y|x,\theta)p(\theta|\mathcal{D})d\theta \approx p(y|x,\hat\theta) \qquad(2)\]
Yes, “plugin” is literal … can we do better?
Yes, we can!
MCMC (see Turing
)
Variational Inference (Blundell et al. 2015)
Monte Carlo Dropout (Gal and Ghahramani 2016)
Deep Ensembles (Lakshminarayanan, Pritzel, and Blundell 2017)
. . .
We first need to estimate the weight posterior \(p(\theta|\mathcal{D})\) …
Idea 💡: Taylor approximation at the mode.
Now we can rely on MC or Probit Approximation to compute posterior predictive (classification).
LaplaceRedux.jl
- a small package 📦What started out as my first coding project Julia …
… has turned into a small package 📦 with great potential.
LaplaceRedux.jl
and another blog post.. . .
We assume a Gaussian prior for our weights … \[ p(\theta) \sim \mathcal{N} \left( \theta | \mathbf{0}, \lambda^{-1} \mathbf{I} \right)=\mathcal{N} \left( \theta | \mathbf{0}, \mathbf{H}_0^{-1} \right) \qquad(3)\]
. . .
… which corresponds to logit binary crossentropy loss with weight decay:
\[ \ell(\theta)= - \sum_{n}^N [y_n \log \mu_n + (1-y_n)\log (1-\mu_n)] + \\ \frac{1}{2} (\theta-\theta_0)^T\mathbf{H}_0(\theta-\theta_0) \qquad(4)\]
. . .
For Logistic Regression we have the Hessian in closed form (p. 338 in Murphy (2022)):
\[ \nabla_{\theta}\nabla_{\theta}^\mathsf{T}\ell(\theta) = \frac{1}{N} \sum_{n}^N(\mu_n(1-\mu_n)\mathbf{x}_n)\mathbf{x}_n^\mathsf{T} + \mathbf{H}_0 \qquad(5)\]
. . .
# Hessian:
function ∇∇𝓁(θ,θ_0,H_0,X,y)
N = length(y)
μ = sigmoid(θ,X)
H = ∑(μ[n] * (1-μ[n]) * X[n,:] * X[n,:]' for n=1:N)
return H + H_0
end
Gotta love Julia ❤️💜💚
. . .
Logistic Regression can be done in Flux
…
. . .
… but now we autograd! Leveraged in LaplaceRedux
.
. . .
An actual MLP …
. . .
… same API call:
. . .
. . .
Low prior uncertainty \(\rightarrow\) posterior dominated by prior. High prior uncertainty \(\rightarrow\) posterior approaches MLE.
We’re really been using linearized neural networks …
. . .
Applying the GNN approximation […] turns the underlying probabilistic model locally from a BNN into a GLM […] Because we have effectively done inference in the GGN-linearized model, we should instead predict using these modified features. — Immer, Korzepa, and Bauer (2020)
Learn about Laplace Redux by implementing it in Julia.
Turn code into a small package.
Submit to JuliaCon 2022 and share the idea.
. . .
Package is bare-bones at this point and needs a lot of work.
Effortless Bayesian Deep Learning through Laplace Redux – JuliaCon 2022 – Patrick Altmeyer