Abstract
The incorporation of generative models as regularisers within variational formulations for inverse problems has proven effective across numerous image reconstruction tasks. However, the resulting optimisation problem is often non-convex and challenging to solve. In this work, we show that score-based generative models (SGMs) can be used in a graduated optimisation framework to solve inverse problems. We show that the resulting graduated non-convexity flow converge to stationary points of the original problem and provide a numerical convergence analysis of a 2D toy example. We further provide experiments on computed tomography image reconstruction, where we show that this framework is able to recover high-quality images, independent of the initial value. The experiments highlight the potential of using SGMs in graduated optimisation frameworks. The code is available111https://github.com/alexdenker/GradOpt-SGM.
Index Termsββgraduated optimisation, score-based generative models, optimisation, inverse problems
1 Introduction
Many problems in image reconstruction can be formulated as linear inverse problems, where the goal is to recover an image given noisy measurements that are related via
(1) |
Here is a linear forward operator and the noise. Inverse problems are often ill-posed and require regularisation to stabilise the reconstruction.Variational regularisation is a popular framework to address ill-posedness [1]. It formulates the reconstruction as an optimisation problem
(2) |
where quantifies fitness to the measurements, is a regulariser, and balances the two terms.For additive Gaussian noise the datafit is typically chosen as . Finding a suitable regulariser is a difficult task.In recent years a wide variety of deep learning methods have been developed that aim to learn a regulariser from a given dataset, see [2] for a review.
The statistical perspective on inverse problems identifies the regulariser with the negative log-likelihood of a prior distribution , i.e. .In this context, learning a prior can be formulated as learning a parametrised distribution as a proxy for [3].If we have access to the likelihood, we can consider the optimisation problem
(3) |
as an instance of variational regularisation.Generative regularisers of this form have been extensively studied for various inverse problems [3]. However, as the negative log-likelihood of the generative model is often highly non-convex, (3) is a challenging optimisation problem with many local minima and it strongly depends on the initialisation.
We propose to tackle this problem using score-based generative models (SGMs) [4] as regularisers. SGMs learn a sequence of gradually perturbed distributions, starting at the data distribution and terminating in pure noise. It was recently observed that this sequence can be used in annealed Langevin sampling [5, 6] and graduated optimisation [7]. Graduated optimisation [8], also known as continuation methods [9], is a heuristic for dealing with non-convex problems in which the objective function is replaced with a convex surrogate which can be solved efficiently. The surrogate objective is then gradually transformed to the original non-convex objective.
In this work, we exploit the connection between SGMs and graduated optimisation to solve the underlying non-convex optimisation problem and avoid local minima.We showcase our algorithms on computed tomography.
2 Background
2.1 Graduated Optimisation
Graduated non-convexity is a heuristic global optimisation method for solving non-convex minimisation problems, which creates a sequence of surrogate optimisation problems that approximate the original problem through gradual levels of smoothing or convexification [8].Namely, for a non-convex function let , such that and that is convex with a unique minimiser.The optimisation protocol consists of iterations over .In step we find the minimiser of , using as the starting value.The process is terminated by minimising the original function using as initialisation.
The success of graduated optimisation strongly depends on the construction of the embedding function . Under specific constraints on the type of embedding and the class of non-convex function under consideration, it is possible to get global convergence results [10].
2.2 Score-based Generative Models
Score-based generative models (SGMs) have emerged as a powerful tool for generative modelling [4, 11].SGMs consist of a (predefined) forward process, during which noise is gradually added; and a learnable reverse process, which allows transforming noise into samples.The forward process is prescribed by an ItΓ΄ stochastic differential equation (SDE)
(4) |
The drift function and the diffusion function are chosen such that the terminal distribution approximates a Gaussian, i.e. . Under certain conditions there exists a reverse diffusion process
(5) |
which runs backwards in time[12].SGMs approximate the score function by a neural network , which can be trained using denoising score matching [13]
For an affine drift , the transition kernel is given as with parameters [14].The SGM learns all intermediate distributions, i.e.
(6) |
Intermediate distributions are similar to Gaussian hom*otopy [9], as SDE without a drift gives .
3 Methods
3.1 Graduated non-convexity Flow
Using the SGM framework we define negative log-likelihoods
(7) |
We expect that is highly non-convex so that the corresponding optimisation problem
(8) |
is challenging to solve.However, due to the convolution with a Gaussian in (6), there exists a such that is convex for all [7]. Since is convex, the resulting optimisation problem is also convex at and can be solved using common convex optimisation methods. This motivates using SGMs as an embedding in a graduate optimisation scheme, givinga sequence of optimisation problems
(9) |
where is a regularisation parameter and the optimisation is initialised with . Instead of exactly solving (9) in each step, we perform one step of gradient descent,
(10) |
where is a pre-trained score-based model.This corresponds to Algorithm 2 in [7] with a predefined smoothing schedule.
3.2 Gradient-like Methods
The critical issue in surrogate optimisation methods is ensuring the convergence to an optimum of the original problem.For graduated non-convexity flow in Section 3.1 this boils down to update directions used in (10), which are not ensured to point in the direction of steepest descent.However, convergence of iterations of the form can still be shown provided is a descent direction, i.e. .Iterative methods of this type are also known as gradient-like methods [15, Section 8.3, p. 75] or sufficient descent methods [16].
We consider the case when the maximal number of iterations (i.e. the fineness of the smoothing schedule) tends to infinity.Let be such that
(11) |
To show the convergence of we use the following notion of gradient-like directions.
Definition 3.1 (Gradient-like Directions [15]).
Let be continuously differentiable and A sequence is called gradient-like with respect to and if for every subsequence converging to a non-stationary point of there exist and such that
- (a)
and
- (b)
Under these conditions every limit point of iterates produced by a gradient-like algorithm is a stationary point of (see [15, Theorem 8.9]).In order to show convergence we also require to be Lipschitz continuous, which ensures that search directions defined in (12) are gradient-like with respect to in (8) and iterates With this, we have the following convergence result.
Theorem 3.2.
Take and , given in (8) and (9), with continuously differentiable for with . Assume satisfy (11). Let be sequences generated with Algorithm 2, where is determined using the adaptive smoothing schedule. Moreover, we assume that is Lipschitz continuous with a global Lipschitz constant for all .Then every limit point of the sequence is a stationary point of
In case of the graduated non-convexity flow for the function (9) and iterations (10),the directions correspond to
(12) |
which do not necessarily satisfy the descent condition
(13) |
needed for convergence. We can address this issue through an adaptive smoothing schedule by selecting, in each iteration, the largest smoothing parameter such that the descent condition is satisfied. Such an index always exists as long as is not a stationary point of , since the maximal index in Algorithm 2 leads to and
Theorem 3.2 can be shown using this adaptive smoothing schedule and by ensuring the properties (a) and (b) in Definition 3.1 hold.
3.3 Energy-based Parametrisation
Adaptive step-size iterative methods require evaluating the objective function in each iteration to ensure convergence.However, traditional SGMs approximate only the score function and do not offer access to the likelihood.The alternative are energy-based models (EBM) [17], where the probabilistic model is parametrised as , for a scalar-valued neural network and a normalisation constant . EBMs can be trained using the score matching objective by defining [18], which can be evaluated using common automatic differentiation libraries. This comes with a higher computational cost than the score parametrisation.Following Du et al. [19], we parametrise the energy as
(14) |
where is implemented as a time-conditional U-Net [11]. The perturbed cost function is then given as
(15) |
Remark 1.
Likelihood in the EBM framework is defined by
(16) |
This introduces a time step dependency into the (intractable) normalisation constant .However, to compute the step sizes we only need to evaluate the objective up to additive constants, which means that does not need to be computed.Energy-based parametrisation was used in a similar fashion to compute Metropolis correction probabilities [19].
4 Experiments
In this section we investigate the numerical performance of Algorithm 2. We start with a toy example, where the data distribution is given by a Gaussian mixture model and the score can be computed analytically. For the second experiment we investigate computed tomography reconstruction on two datasets with a trained SGM. We use the PSNR and SSIM [20] to evaluate the reconstruction quality.
4.1 2D Toy Example
To illustrate the behaviour and the convergence properties, we consider a Gaussian mixture modelconsisting of five Gaussians.The density at time step is a Gaussian mixture model with a perturbed covariance matrix .The diffusion process is given by the forward SDE with perturbation kernel .Furthermore, we consider a simple two dimensional inverse problem with the forward operator and clean measurements We choose a constant regularisation parameter and adjust one mean of the Gaussian mixture model to the position in order to ensure that the global minimum of the cost function with is at We evaluate the graduated non-convexity flow (Algorithm 1) with a constant step size as well as the gradient-like method (Algorithm 2) with the adaptive smoothing schedule. The values are evenly spaced between and The goal is to analyse the algorithms in terms of their convergence properties with respect to the initialisations the initial smoothing parameter and the iteration number. To do so, we run the algorithms with 1300 iterations for equally spaced initial points on as well as 100 different values of , which are evenly distributed on a log scale. The resulting rate of trajectories converging to stationary points and the global minimum are shown as a pseudocolor plot in Fig.1. The dependence on the initialisations is shown by the left column, where isolines display the loss landscape of and the orange paths of iterates show exemplary trajectories for different and The dependence on the iteration number and is shown by the middle and right column.
4.2 Computed Tomography
We evaluate our approach on two datasets: Ellipses [21] and AAPM [22]. Ellipses contains px images of a varying number of ellipses with random orientation and size. The AAPM dataset consists of CT images from patients. We use images of patients to train the SGM, resulting in training images. We use the remaining patient (id C027) for evaluation.We simulate measurements using a parallel-beam Radon transform with angles and use and relative Gaussian noise, respectively. For both datasets, we train SGMs using the VPSDE [4]. We use a backtracking line search using the Armijo-Goldstein condition to determine a suitable step size with Barzilai-Borwein method[23] to find a candidate step size. We choose the parameter as , where is the standard deviation and the mean scaling of the perturbation kernel. The value was set according to a coarse sweep over one example image. In Fig.2 we show the result for the gradient like method on one AAPM example, including the PSNR, objective value, step size and gradient-like condition from (13). Fig.3 shows example reconstructions for both datasets.
Initial value
The gradient-like method (Alg.1) is deterministic, except for the choice of initialisation . We study the effect of for both Ellipses and AAPM, see Table1 by comparing Alg.1 with gradient descent for optimising the objective function at , see (8). For both datasets we use steps with a logarithmic time interval for Alg.1. We evaluate random initialisations for images.We compute mean PSNR and SSIM values over all reconstructions. Further, we compute the mean standard deviation value over images. We see that for both Ellipses and AAPM the mean PSNR and mean SSIM are higher, while the mean standard deviation is lower. That is, the gradient-like method achieves better reconstructions more consistently.
Gradient Descent | Alg.1 | |||
---|---|---|---|---|
Ellipses | PSNR | mean | ||
std | ||||
SSIM | mean | |||
std | ||||
AAPM | PSNR | mean | ||
std | ||||
SSIM | mean | |||
std |
AAPM | Ellipses | |||
---|---|---|---|---|
PSNR | SSIM | PSNR | SSIM | |
Constant step size | ||||
Adaptive step size |
Adaptive step size
The energy-based parametrisation comes with a higher computational cost. However, it allows uing adaptive step sizes for Alg.2. In Table2 we show that this leads to a better reconstruction than a fixed step size. Compare also the results in Fig.2, which shows that the computed step size changes by several orders of magnitude over the iterations. Also, we need fewer iteration ( vs ) to get convincing reconstructions.
5 Discussion
In the toy example in Section 4.1 it seems that the gradient-like condition in Definition3.1 hinders the ability of the algorithm to reach the global optimum, as the search direction always points in the same halfspace as . One possible approach to circumvent this is to only enforce this condition after a set number of iterations in order to ensure the convergence to a stationary point of .However, for the high-dimensional imaging experiments the gradient-like condition does not seem to be too restrictive. As we see in Fig.2, the condition is satisfied for all iterations and we still achieve a good reconstruction. Further, the adaptive step size rule leads to a improvement in terms of both PSNR and SSIM, while using fewer iterations, which warrants the additional computational cost of the step size search.
6 Conclusion
In this work we show that SGMs can be used in a graduated optimisation framework to solve inverse problems. Further, we propose to use an energy-based parametrisation, which enables the use of line search method for finding a suitable step size. Using the theory of gradient-like directions, we can prove convergence to stationary points. We hypothesise that this method is also able to better escape saddle points and converge to local minima. This research question is left for further work.
References
- [1]O.Scherzer, M.Grasmair, H.Grossauer, M.Haltmeier, and F.Lenzen,Variational methods in imaging, vol. 167,Springer, 2009.
- [2]S.Arridge, P.Maass, O.Γktem, and C.-B. SchΓΆnlieb,βSolving inverse problems using data-driven models,βActa Numerica, vol. 28, pp. 1β174, 2019.
- [3]M.Duff, N.Campbell, and M.J. Ehrhardt,βRegularising inverse problems with generative machine learningmodels,βJ. Math. Imaging Vis., vol. 66, no. 1, pp. 37β56, 2024.
- [4]Y.Song, J.Sohl-Dickstein, D.P. Kingma, A.Kumar, S.Ermon, and B.Poole,βScore-based generative modeling through stochastic differentialequations,βin ICLR, 2021.
- [5]Y.Song and S.Ermon,βGenerative modeling by estimating gradients of the datadistribution,βNeurIPS, vol. 32, 2019.
- [6]Y.Sun, Z.Wu, Y.Chen, B.T. Feng, and K.L. Bouman,βProvable probabilistic imaging using score-based generativepriors,βarXiv preprint arXiv:2310.10835, 2023.
- [7]E.Kobler and T.Pock,βLearning gradually non-convex image priors using score matching,βarXiv preprint arXiv:2302.10502, 2023.
- [8]A.Blake and A.Zisserman,Visual reconstruction,MIT press, 1987.
- [9]H.Mobahi and J.W. Fisher,βOn the link between gaussian hom*otopy continuation and convexenvelopes,βin EMMCVPR 2015. Springer, 2015, pp. 43β56.
- [10]E.Hazan, K.Y. Levy, and S.Shalev-Shwartz,βOn graduated optimization for stochastic non-convex problems,βin ICML. PMLR, 2016, pp. 1833β1841.
- [11]P.Dhariwal and A.Nichol,βDiffusion models beat gans on image synthesis,βNeurIPS, vol. 34, pp. 8780β8794, 2021.
- [12]B.Anderson,βReverse-time diffusion equation models,βStoc. Proc. Appl., vol. 12, no. 3, pp. 313β326, 1982.
- [13]P.Vincent,βA connection between score matching and denoising autoencoders,βNeural computation, vol. 23, no. 7, pp. 1661β1674, 2011.
- [14]S.SΓ€rkkΓ€ and A.Solin,Applied stochastic differential equations, vol.10,Cambridge University Press, 2019.
- [15]C.Geiger and C.Kanzow,Numerische Verfahren zur LΓΆsung unrestringierterOptimierungsaufgaben,Springer-Verlag, 2013.
- [16]X.An, S.Li, and Y.Xiao,βSufficient descent directions in unconstrained optimization,βComput. Optim. Appl., vol. 48, no. 3, pp. 515β532, 2011.
- [17]G.E. Hinton,βTraining products of experts by minimizing contrastivedivergence,βNeural computation, vol. 14, no. 8, pp. 1771β1800, 2002.
- [18]T.Salimans and J.Ho,βShould EBMs model the energy or the score?,βin Energy Based Models Workshop - ICLR, 2021.
- [19]Y.Du etal.,βReduce, reuse, recycle: Compositional generation with energy-baseddiffusion models and mcmc,βin ICML. PMLR, 2023, pp. 8489β8510.
- [20]Z.Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli,βImage quality assessment: from error visibility to structuralsimilarity,βIEEE Trans. Image Process., vol. 13, no. 4, pp. 600β612, 2004.
- [21]R.Barbano and etal.,βAn educated warm start for deep image prior-based micro ctreconstruction,βIEEE Trans. Comput. Imaging, vol. 8, pp. 1210β1222, 2022.
- [22]C.H. McCollough etal.,βLow-dose ct for the detection and classification of metastaticliver lesions: results of the 2016 low dose ct grand challenge,βMedical physics, vol. 44, no. 10, pp. e339βe352, 2017.
- [23]J.Barzilai and J.M. Borwein,βTwo-point step size gradient methods,βIMA J. of numerical analysis, vol. 8, no. 1, pp. 141β148,1988.
Appendix
Proof of Theorem3.2
Theorem 3.2.
Let and be given as in (8) and (9) and let be continuously differentiable for with . Furthermore, let be as in (11) and as well as be the sequences generated based on the update rules of Algorithm 2, where is determined by using the adaptive smoothing schedule. Moreover, we assume that is Lipschitz continuous with a global Lipschitz constant for all , i.e. there exists a constant such that
Then every limit point of the sequence is a stationary point of
Proof.
As described in Section 3.2, the adaptive smoothing schedule selects in the -th iteration of Algorithm 2 the largest smoothing parameter such that the descent condition in equation (13) with and holds. In other words, we choose to be
Such an index always exists as long as is not a stationary point of (which is ensured by Algorithm 2 due to the stopping criterion in Line 2), as we have
where the equation follows from the continuity of Hence, there exists a such that
for all The above choice of the ensures that for all which is a needed criterion for the gradient-like algorithm to converge according to [15]. The sequence is a subsequence of and fulfills the properties as well as
Proof by contradiction: Let be a non-stationary point of and let be a subsequence, which converges to The aim is to show both properties (a) and (b) in Definition 3.1. We would then obtain that is gradient-like with respect to and Hence, by [15, Theorem 8.9], we would obtain that every accumulation point of is a stationary point of which would give the desired contradiction to the assumption that
We first show that which corresponds to property (a) of Definition 3.1. It holds that
As is continuously differentiable and with we have that for and the sequence is bounded, i.e.there exists a constant such that .Therefore, we have
which shows property (a) of Defintion 3.1. To prove property (b), we compute
As is continuous and for with being a non-stationary point of we have that
The sequence converges and it follows
By definition of the convergence, it holds
By choosing we obtain in particular
which shows property (b) of Definition 3.1.β