Analyzing Inverse Problems with Invertible Neural Networks

This work got accepted at ICLR 2019

In a recent collaboration with experts from natural and medical sciences, we show how Invertible Neural Networks can help us deal with the ill-posed inverse problems that often arise in these fields. This page aims to provide an intuitive introduction to the idea.

Click on the thumbnails to jump to different sections.

A quick introduction to Invertible Neural Networks

Ambiguous Inverse Problems and how we can solve them

Verifying our approach with a simple toy example

Application to data from two real-world problems

Some closing thoughts and a look at what’s next

1. Invertible Neural Networks

The basic building block of our Invertible Neural Network is the affine coupling layer popularized by the Real NVP model. It works by splitting the input data into two parts $[\mathbf{u}_1, \mathbf{u}_2]$ , which are transformed by learned functions $s_i, t_i$ and coupled in an alternating fashion like so:

Diagram of a 'Coupling Layer', the building block of our Invertible Neural Network

where $\odot$ is element-wise multiplication. The output is just the concatenation of the resulting parts $[\mathbf{v}_1, \mathbf{v}_2]$ .

With only minor rearrangements, we can recover $[\mathbf{u}_1, \mathbf{u}_2]$ from $[\mathbf{v}_1, \mathbf{v}_2]$ to compute the inverse of the whole affine coupling layer:

Diagram of the inverse of a 'Coupling Layer'

with $\oslash$ being element-wise division.

Like in many scenarios, direct division can lead to numerical problems. So in practice we use the exponential function and clip extreme values of $s_i(\cdot)$ , which leads to a forward pass of the form

$\begin{align*} \mathbf{v}_1 &= \mathbf{u}_1 \odot \exp\!\big(s_2(\mathbf{u}_2)\big) + t_2(\mathbf{u}_2) \\ \mathbf{v}_2 &= \mathbf{u}_2 \odot \exp\!\big(s_1(\mathbf{v}_1)\big) + t_1(\mathbf{v}_1), \end{align*}$

with the inverse given by

$\begin{align*} \mathbf{u}_2 &= (\mathbf{v}_2 - t_1(\mathbf{v}_1)) \odot \exp\!\big(-s_1(\mathbf{v}_1)\big) \\ \mathbf{u}_1 &= (\mathbf{v}_1 - t_2(\mathbf{u}_2)) \odot \exp\!\big(-s_2(\mathbf{u}_2)\big). \end{align*}$

To construct deep invertible networks, we can simply chain these affine coupling layers much like ResNet blocks.

Crucially, the transformations $s_i$ and $t_i$ themselves need not be invertible and can be represented by arbitrary neural networks, which are trained by standard backpropagation along the computation graph. What’s more, invertibility allows us to apply loss functions for the forward pass and the inverse pass at the same time, and compute gradients for $s_i$ and $t_i$ from either direction – an opportunity we will exploit later on.

Rules how to assign data to the upper or lower lane (i.e. how to split the input into $\mathbf{u}_1$ and $\mathbf{u}_2$ ) are still an active area of research. In a fully connected network, one typically splits in a random (but fixed!) way, and changes the assignment from layer to layer. When the data have spatial structure (think of images) and the transformations $s_i, t_i$ use a convolutional architecture, one usually divides along the channel dimension in every pixel. Recently, Kingma and Dhariwal proposed to learn the assignment.

So where’s the catch?

The big constraint for this kind of scheme is that input and output of each module must have the exact same dimensionality. INNs look like auto-encoders whose codes have the same size as the original data. While this appears strange at first, the encoding $\mathbf{y}$ can still disentangle complex data distributions $p(\mathbf{x})$ to the point where surprising data manipulations become possible. Look at the demo of the Glow network to get an idea of the method’s potential when applied to images.

Forward process without information loss

2. Ambiguous Inverse Problems

When we observe a system in the natural world, we generally can’t measure its internal parameters $\mathbf{x}$ directly. Instead, our observations $\mathbf{y}$ are produced by a forward process, which translates system parameters into observable quantities. Often, the forward process is well understood, but incurs a loss of information. For example, when the 3D world is projected onto a camera image, information about depth, surface normals, light source positions etc. is lost. As a result, different system states $\mathbf{x}$ are mapped onto identical observations $\mathbf{y}$ :

Forward process with information loss

The inverse process, which we need to infer parameters $\mathbf{x}$ from observations $\mathbf{y}$ , is therefore ambiguous and ill-posed, and its explicit modeling intractable. Instead, one applies statistical inference techniques to express the ambiguities in form of conditional probabilities $p(\mathbf{x} \,|\, \mathbf{y}$ ). Classical Bayesian methods for this problem like MCMC sampling or Approximate Bayesian Computation employ different sampling approaches, but quickly become very expensive even for moderate real-world problems.

Can we learn it?

In many domains, experts already posses sophisticated models for the forward process, and they can easily generate large data sets of matching state/observation pairs $\{(\mathbf{x}_i, \mathbf{y}_i)\}$ by simulation. An abundance of data makes machine learning, and especially neural networks, a promising approach. But direct supervised learning of $\mathbf{y} \rightarrow \mathbf{x}$ is problematic for ambiguous inverse problems. Using standard network architectures, the learned mapping will either pick only one of the eligible $\mathbf{x}$ for a given $\mathbf{y}$ , or even worse, will form an average between multiple correct, but incompatible inverses:

Ambiguous inverse process

We actually want our network to learn the full posterior distribution $p(\mathbf{x} \,|\, \mathbf{y})$ . We could learn to predict fitting parameters of a simple distribution, or make the network weights themselves variational, or even both, but in any case this will restrict us to one chosen (simple) family of distributions. We could also turn to conditional GANs, but these are notoriously difficult to train and often suffer from hard-to-detect mode collapse.

Resolving the ambiguity

What we propose to do instead is to introduce additional latent variables $\mathbf{z}$ which capture the information that would otherwise get lost in the forward process. Consequently, $\mathbf{x} \leftrightarrow [\mathbf{y}, \mathbf{z}]$ becomes a bijective mapping:

Forward process with additional latent variables

And this bijective mapping is a great fit for the Invertible Neural Networks we discussed in the beginning! Of course we have to make sure that $\mathbf{x}$ and $[\mathbf{y}, \mathbf{z}]$ have the same total dimensionality. But it turns out that we can cheat this rule, in a way, by artificially increasing the dimensionality of either side with zero padding. We can even pad both sides, which means that intermediate representations have higher dimensionality as well, making the model more powerful.

Assume for a moment that we have an invertible network which perfectly reproduces the simulation of the forward process, and arranges $\mathbf{z}$ to follow some simple distribution (e.g. standard normal). With this, we can approximate the distribution $p(\mathbf{x} \,|\, \mathbf{y})$ just by repeatedly sampling $\mathbf{z}$ and running the inverse pass of the network, i.e. $[\mathbf{y}, \mathbf{z}] \rightarrow \mathbf{x}$ . In other words, $p(\mathbf{x} \,|\, \mathbf{y})$ has been reparametrized into a deterministic function $\mathbf{x} = f(\mathbf{y}, \mathbf{z})$ with noise variable $\mathbf{z}$ . If you think this sounds like a conditional GAN, you are absolutely right. At test time, it’s the same thing! But the invertible architecture allows for a completely different training scheme, with some distinct advantages.

Loss functions used in training our Invertible Neural Networks

Because of the bijective nature of our network, we can train it to solve the well-posed forward process $\mathbf{x} \rightarrow \mathbf{y}$ in a supervised manner, instead of the ill-posed inverse process. We require the latent variables $\mathbf{z}$ to be independent of $\mathbf{y}$ , and to follow an easy-to-sample-from distribution, like $\mathcal{N}(\mathbf{0}, \mathbf{1})$ . Both conditions can be achieved with a Maximum Mean Discrepancy (MMD) loss, which matches two distributions by comparing samples.

As a mild form of regularization, we also apply MMD between $p(\mathbf{x})$ from our model (marginalized over all $\mathbf{y}$ and $\mathbf{z}$ ) and the prior represented by the training data. Here, we make use of the previously mentioned bi-directional training to accumulate gradients from loss terms on either end of the network. In the same way we could add a GAN-like discriminator loss on $\mathbf{x}$ . But in our applications so far, MMD turned out to be sufficient, so we can avoid the troubles of adversarial training. Finally, if we use zero padding for the input or output, we put a simple sparsity-enforcing loss on the dimensions in question.

Intuitively, our network learns to mimic the simulation, while splitting information about the ambiguity of its inverse off and transforming it into normally distributed latent variables. When we learn this forward transformation, we get the inverse for free, thanks to the invertible construction of our network. It is not even necessary to make any assumptions about $\mathbf{x}$ or the target distribution $p(\mathbf{x} \,|\, \mathbf{y})$ . Also, since each training pair $(\mathbf{x}_i, \mathbf{y}_i)$ must be mapped to some position $\mathbf{z}_i$ in latent space, it can be recovered when sampling $\mathbf{z}$ for the inverse direction. We observe this to be a strong mechanism against mode collapse.

3. Does it work? A Toy Example

To check how well we can recover the shape of $p(\mathbf{x} \,|\, \mathbf{y})$ , consider the following set of toy problems. Our parameters $\mathbf{x}$ are the 2D coordinates of points distributed according to a Gaussian Mixture Model with eight components, arranged in a circle. Samples from this $\mathbf{x}$ -distribution are colored depending on which mode they were drawn from – we take these color labels to be our measurements $\mathbf{y}$ . We test three different measurement “labellings”: each mode has a different color (left), some colors are shared (middle) or all modes have the same color (right). These correspond to progressively more ambiguous forward processes.

Ground truth distributions for the three setting of our toy example

The image above shows samples from the ground truth setup of the three different labellings. Corresponding samples from our network’s learned posterior distributions are shown in the video below, as they evolve over the course of training:

(click on video to play/pause/restart – or view final result directly)

All three experiments were performed with the same model architecture, all losses weighted equally. It is easy to see that our Invertible Neural Network is able to reproduce the original distributions reliably and accurately. In the paper’s supplemental material we also show results obtained by various established methods. Most of them struggle with this toy example when afforded the same number of trainable parameters.

Structure of the latent space

For the experiments shown above, we only used a two-dimensional latent space. This way we can visualize as an image how the model is using $\mathbf{z}$ , given a specific label $\mathbf{y}$ . For each coordinate $\mathbf{z}_{i}$ in latent space, we run $[\mathbf{y}, \mathbf{z}_{i}]$ through the inverse pass of our model to obtain a sample $\hat{\mathbf{x}}_{i}$ , then color the corresponding pixel as follows: The hue depends on the mode in $\mathbf{x}$ -space that $\hat{\mathbf{x}}_{i}$ is closest to. The intensity depends on how far away $\hat{\mathbf{x}}_{i}$ is from that mode.

For the label $\mathbf{y} = \text{\bfseries\color[rgb]{1.0,0.5,0.05}orange}$ in the middle setting, the latent space of our converged model looks like this:

Ground truth distributions for the three setting of our toy example

The colors used here have nothing to do with the the ones used as $\mathbf{y}$ -labels before. The two circles mark the areas that contain 50% and 90% of the probability mass of the Gaussian latent prior $p(\mathbf{z})$ , respectively. We can see that $\mathbf{z}$ -space is divided into two equally sized regions, corresponding to the two orange modes in the middle setting.

We can also compare the layout for $\mathbf{y} = \text{\bfseries\color[rgb]{0.12,0.47,0.71}blue}$ in all three settings. The video below shows how the latent space visualization develops over the entire course of training:

(click on video to play/pause/restart – or view final result directly)

4. Where can we apply this?

Ambiguous inverse problems as described above pop up in many places, and often enough scientists can simulate a system better than they can effectively observe it. We looked at two such problems, the first coming from medical science.

In medical science

To make optimal decisions during minimally invasive surgery, doctors ideally want to know a number of local properties of the tissue they operate on, such as oxygen saturation, layer thickness and blood flow. However for practical reasons, they can only inspect the tissue’s surface with a tiny multispectral camera. Many different configurations of the tissue parameters $\mathbf{x}$ can result in the same spectral response $\mathbf{y}$ , and so they end up with an ambiguous inverse problem. Using data from high-quality simulations of this process, we applied our new method to tackle this problem.

The following animation shows how we reconstruct the posterior $p(\mathbf{x} \,|\, \mathbf{y})$ with our fully trained Invertible Neural Network by sampling more and more instances of $\mathbf{z}$ for a single given observation $\mathbf{y}$ .

(click on video to play/pause/restart – or view final result directly)

Each panel develops the marginal posterior $p(\mathbf{x}_i \,|\, \mathbf{y})$ for a single parameter $\mathbf{x}_i$ , shown in orange. The gray areas show the prior $p(\mathbf{x})$ over the whole date set for context. The dotted lines are ground truth values from an actual $\mathbf{x}$ associated with $\mathbf{y}$ in the test set.

We can see that the network is very certain (and apparently spot-on) about oxygen saturation. For blood density, the posterior is visibly lopsided in order to avoid values outside of the prior’s support. Most interestingly, in the last two panels, posterior and prior are practically identical. This is our network telling us that we simply cannot derive any information about these parameters from our spectral measurements $\mathbf{y}$ .

Details about the data set and experiment can be found in the paper. There we also show that we can find correlations between the parameters’ posteriors, which are especially interesting to domain experts.

In astrophysics

A very similar problem arises in a branch of astrophysics that explores the life cycle of star clusters in interstellar gas clouds. The internal parameters of such systems, and simulations thereof, are very complex and interact a lot over time. But observation of real objects is again essentially limited to snapshots of the emitted light spectrum, introducing strong ambiguity. As before, we trained our network on data from state-of-the-art simulations.

And again, we can sample from the latent space to generate marginal posterior distributions with our fully trained model, given an observation $\mathbf{y}$ :

(click on video to play/pause/restart – or view final result directly)

Colors have the same meaning as before. The peculiar shape of the prior here is due to the very dynamic behavior of star cluster simulations over time.

In this scenario, we actually find distinctly multi-modal distributions for some parameters $\mathbf{x}_i$ . These multiple modes, combined with the correlation between marginal posteriors, offer great insights into the system. We can for example say that this specific observation $\mathbf{y}$ either corresponds to a young cluster with large expansion velocity, or to an older system that expands slowly.

5. Closing thoughts

An open question, which we share with e.g. Autoencoder architectures, is how to determine the intrinsic dimension of a task or dataset. For best results, $\mathbf{z}$ should be neither smaller nor larger than this.

The special structure of the coupling layers in Invertible Neural Networks may preclude direct application of some other architectural tricks. On the flipside, it offers a unique memory/computation trade-off as we don’t need to store activations from the forward pass for backpropagation. We are however not making use of this in our current work.

Overall, the permutation of variables between subsequent (blocks of) coupling layers seems to be a crucial point. Without permutation, variables $u_{1,i}, u_{2,j}$ from the two streams $\mathbf{u}_1, \mathbf{u}_2$ can only interact via coupling, and never within the same subnetwork. A generalization of the simple shuffle we use is the main technical contribution of the brand new Glow framework.

Several previous works have proposed training networks for both directions of a task, some even with coupled weights. But we are not aware of any setting where loss functions were placed on either end of one and the same network [Update: Flow-GAN does so with an invertible generator network]. This truly bi-directional training could open up many possibilities, and we are excited to see where else it might be used.

1. Invertible Neural Networks

So where’s the catch?

2. Ambiguous Inverse Problems

Can we learn it?

Resolving the ambiguity

3. Does it work? A Toy Example

Structure of the latent space

4. Where can we apply this?

In medical science

In astrophysics

5. Closing thoughts

Further Reading