Generative models and normalizing flow based models have made great progress in recent years both in their theoretical development as well as in a growing number of applications. As such models become applied more and more with it increases the desire for predictive uncertainty to know when to trust the underlying model. In this extended abstract we target the application area of Large Hadron Collider (LHC) simulations and show how to extend normalizing flows with probabilistic Bayesian Neural Network based transformations to model LHC events with uncertainties.
Marco Bellagente; Michel Luchmann; Manuel Haussmann; Tilman Plehn
Score-based methods represented as stochastic differential equations on a continuous time domain have recently proven successful as a non-adversarial generative model.
Training such models relies on denoising score matching, which can be seen as multi-scale denoising autoencoders.
Here, we augment the denoising score-matching framework to enable representation learning without any supervised signal.
GANs and VAEs learn representations by directly transforming latent codes to data samples.
In contrast, score-based representation learning relies on a new formulation of the denoising score-matching objective and thus encodes information needed for denoising.
We show how this difference allows for manual control of the level of detail encoded in the representation.
Mobility-on-demand (MoD) systems represent a rapidly developing mode of transportation wherein travel requests are dynamically handled by a coordinated fleet of vehicles. Crucially, the efficiency of an MoD system highly depends on how well supply and demand distributions are aligned in spatio-temporal space (i.e., to satisfy user demand, cars have to be available in the correct place and at the desired time). When modelling urban mobility as temporal sequences, current approaches typically rely on either (i) a spatial discretization (e.g. ConvLSTMs), or (ii) a Gaussian mixture model to describe the conditional output distribution.
In this paper, we argue that both of these approaches could exhibit structural limitations when faced with highly complex data distributions such as for urban mobility densities. To address this issue, we introduce recurrent flow networks which combine deterministic and stochastic recurrent hidden states with conditional normalizing flows and show how the added flexibility allows our model to generate distributions matching potentially complex urban topologies.
Order-agnostic autoregressive distribution estimation (OADE), i.e., autoregressive distribution estimation where the features can occur in an arbitrary order, is a challenging problem in generative machine learning. Prior work on OADE has encoded feature identity (e.g., pixel location) by assigning each feature to a distinct fixed position in an input vector. As a result, architectures built for these inputs must strategically mask either the input or model weights to learn the various conditional distributions necessary for inferring the full joint distribution of the dataset in an order-agnostic way. In this paper, we propose an alternative approach for encoding feature identities, where each feature's identity is included alongside its value in the input. This feature identity encoding strategy allows neural architectures designed for sequential data to be applied to the OADE task without modification. As a proof of concept, we show that a Transformer trained on this input (which we refer to as "the DEformer", i.e., the distribution estimating Transformer) can effectively model binarized-MNIST, approaching the average negative log-likelihood of fixed order autoregressive distribution estimating algorithms while still being entirely order-agnostic.
In this paper, we present a new class of invertible transformations with an application to flow-based generative models. We indicate that many well-known invertible transformations in reversible logic and reversible neural networks could be derived from our proposition. Next, we propose two new coupling layers that are important building blocks of flow-based generative models. In the experiments on digit data, we present how these new coupling layers could be used in Integer Discrete Flows (IDF), and that they achieve better results than standard coupling layers used in IDF and RealNVP.
Normalizing flows are a class of flexible deep generative models that offer easy likelihood computation. Despite their empirical success, there is little theoretical understanding of their expressiveness. In this work, we study residual flows, a class of normalizing flows composed of Lipschitz residual blocks. We prove residual flows are universal approximators in maximum mean discrepancy. We provide upper bounds on the number of residual blocks to achieve approximation under different assumptions.
Density estimation is an important technique for characterizing distributions given observations. Much existing research on density estimation has focused on cases wherein the data lies in a Euclidean space. However, some kinds of data are not well-modeled by supposing that their underlying geometry is Euclidean. Instead, it can be useful to model such data as lying on a {\it manifold} with some known structure. For instance, some kinds of data may be known to lie on the surface of a sphere. We study the problem of estimating densities on manifolds. We propose a method, inspired by the literature on "dequantization," which we interpret through the lens of a coordinate transformation of an ambient Euclidean space and a smooth manifold of interest. Using methods from normalizing flows, we apply this method to the dequantization of smooth manifold structures in order to model densities on the sphere, tori, and the orthogonal group.
Discrete-time diffusion-based generative models and score matching methods have shown promising results in modeling high-dimensional image data. Recently, Song et al. (2021) show that diffusion processes can be reverted via learning the score function, i.e. the gradient of the log-density of the perturbed data. They propose to plug the learned score function into an inverse formula to define a generative diffusion process. Despite the empirical success, a theoretical underpinning of this procedure is still lacking. In this work, we approach the (continuous-time) generative diffusion directly and derive a variational framework for likelihood estimation, which includes continuous-time normalizing flows as a special case, and can be seen as an infinitely deep variational autoencoder. Under this framework, we show that minimizing the score-matching loss is equivalent to maximizing the ELBO of the plug-in reverse SDE proposed by Song et al. (2021), bridging the theoretical gap.
In this work, we propose FastDPM, a unified framework for fast sampling in diffusion probabilistic models. FastDPM generalizes previous methods and gives rise to new algorithms with improved sample quality. We systematically investigate the fast sampling methods under this framework across different domains, on different datasets, and with different amount of conditional information provided for generation. We find the performance of a particular method depends on data domains (e.g., image or audio), the trade-off between sampling speed and sample quality, and the amount of conditional information. We further provide insights and recipes on the choice of methods for practitioners.
Normalizing flows leverage the Change of Variables Formula (CVF) to define flexible density models. Yet, the requirement of smooth transformations (diffeomorphisms) in the CVF poses a significant challenge in the construction of these models. To enlarge the design space of flows, we introduce $\mathcal{L}$-diffeomorphisms as generalized transformations which may violate these requirements on zero Lebesgue-measure sets. This relaxation allows e.g. the use of non-smooth activations functions such as ReLU. Finally, we apply the obtained results to planar, radial, and contractive residual flows.
Niklas Koenen; Marvin N. Wright; Peter Maass; Jens Behrmann
Normalizing flows, which learn a distribution by transforming the data to samples from a Gaussian base distribution, have proven powerful density approximations. But their expressive power is limited by this choice of the base distribution. We, therefore, propose to generalize the base distribution to a more elaborate copula distribution to capture the properties of the target distribution more accurately. In a first empirical analysis, we demonstrate that this replacement can dramatically improve the vanilla normalizing flows in terms of flexibility, stability, and effectivity for heavy-tailed data. Our results suggest that the improvements are related to an increased local Lipschitz-stability of the learned flow.
Several methods from two separate lines of works, namely, data augmentation (DA) and adversarial training techniques, rely on perturbations done in latent space. Often, these methods are either non-interpretable due to their non-invertibility or are notoriously difficult to train due to their numerous hyperparameters. We exploit the exactly reversible encoder-decoder structure of normalizing flows to perform perturbations in the latent space. We demonstrate that these on-manifold perturbations match the performance of advanced DA techniques---reaching $96.6\%$ test accuracy for CIFAR-10 using ResNet-18 and outperform existing methods particularly in low data regimes---yielding $10$--$25\%$ relative improvement of test accuracy from classical training. We find our latent adversarial perturbations, adaptive to the classifier throughout its training, are most effective.
Oğuz Kaan Yüksel; Sebastian U Stich; Martin Jaggi; Tatjana Chavdarova
We introduce a transferable generative model for physically realistic conformer ensembles of small molecules. The model is auto-regressive and places atoms sequentially using a learned, transferable and equivariant function implemented using a normalizing flow. The model uses physico-chemical information from previously placed atoms in a molecule to place the next atom. Our preliminary results show the approach can effectively generate physically realistic conformations for previously-unseen molecules.
Juan Viguera Diez; Sara Romeo Atance; Ola Engkvist; Rocío Mercado; Simon Olsson
Normalizing flows can generate complex target distributions and thus show promise in many applications in Bayesian statistics as an alternative or complement to MCMC for sampling posteriors.
Since no data set from the target posterior distribution is available beforehand, the flow is typically trained using the reverse Kullback-Leibler (KL) divergence that only requires samples from a base distribution. This strategy may perform poorly when the posterior is complicated and hard to sample with an untrained normalizing flow.
Here we explore a distinct training strategy, using the direct KL divergence as loss, in which samples from the posterior are generated by (i) assisting a local MCMC algorithm on the posterior with a normalizing flow to accelerate its mixing rate and (ii) using the data generated this way to train the flow.
The method only requires a limited amount of \textit{a~priori} input about the posterior, and can be used to estimate the evidence required for model validation, as we illustrate on examples.
Marylou Gabrié; Grant M. Rotskoff; Eric Vanden-Eijnden
Normalizing Flows are bijective distribution mappings that benefits from Lipschitz regularization. In this paper, we tackle the constraints imposed by the bi-Lipschitz property of Normalizing Flows and especially what property the dataset shall satisfy in order to be correctly mapped by a given network structure. First, the failures potentially met by the mapping to learn a density distribution are exposed under a theoretical point of view. Then we discuss more complex latent distributions to undermine or to support potential remedies.
Alexandre Verine; Yann Chevaleyre; Fabrice Rossi; benjamin negrevergne
Discrete flow-based models are a recently proposed class of generative models that learn invertible transformations for discrete random variables. Since they do not require data dequantization and maximize an exact likelihood objective, they can be used in a straight-forward manner for lossless compression. In this paper, we introduce a new discrete flow-based model for categorical random variables: Discrete Denoising Flows (DDFs). In contrast with other discrete flow-based models, our model can be locally trained without introducing gradient bias. We show that DDFs outperform Discrete Flows on modelling a toy example, binary MNIST and Cityscapes segmentation maps, measured in log-likelihood.
Normalizing flows allow for tractable maximum likelihood estimation of their parameters but are incapable of modelling low-dimensional manifold structure in observed data. Flows which injectively map from low- to high-dimensional space provide promise for fixing this issue, but the resulting likelihood-based objective becomes more challenging to evaluate. Current approaches avoid computing the entire objective -- which may induce pathological behaviour -- or assume the manifold structure is known beforehand and thus are not widely applicable. Instead, we propose two methods relying on tricks from automatic differentiation and numerical linear algebra to either evaluate or approximate the full likelihood objective, performing end-to-end manifold learning and density estimation. We study the trade-offs between our methods, demonstrate improved results over previous injective flows, and show promising results on out-of-distribution detection.
Anthony L. Caterini; Gabriel Loaiza-Ganem; Geoff Pleiss; John Patrick Cunningham
Scoring matching (SM), and its related counterpart, Stein discrepancy (SD) have achieved great success in model training and evaluations. However, recent research shows their limitations when dealing with certain types of distributions. One possible fix is incorporating the original score matching (or Stein discrepancy) with a diffusion matrix, which is called diffusion score matching (DSM) (or diffusion Stein discrepancy (DSD)) . However, the lack of the interpretation of the diffusion limits its usage within simple distributions and manually chosen matrix. In this work, we plan to fill this gap by interpreting the diffusion matrix using normalizing flows. Specifically, we theoretically prove that DSM (or DSD) is equivalent to the original score matching (or score matching) evaluated in the transformed space defined by the normalizing flow, where the diffusion matrix is the inverse of the flow's Jacobian matrix. In addition, we also build its connection to Riemannian manifolds, and further extend it to continuous flows, where the change of DSM is characterized by an ODE.
Simulation-based inference (SBI) has emerged as a family of methods for performing inference on complex simulation models with intractable likelihood functions. A common bottleneck in SBI is the construction of low-dimensional summary statistics of the data. In this respect, time-series data, often being high-dimensional, multivariate, and complex in structure, present a particular challenge. To address this we introduce deep signature statistics, a principled and automated method for combining summary statistic selection for time-series data with neural SBI methods. Our approach leverages deep signature transforms, trained concurrently with a neural density estimator, to produce informative statistics for multivariate sequential data that encode important geometric properties of the underlying path. We obtain competitive results across benchmark models.
Normalizing flows are diffeomorphisms which are parameterized by neural networks. As a result, they can induce coordinate transformations in the tangent space of the data manifold. In this work, we demonstrate that such transformations can be used to generate interpretable explanations for decisions of neural networks. More specifically, we perform gradient ascent in the base space of the flow to generate counterfactuals which are classified with great confidence as a specified target class. We analyze this generation process theoretically using Riemannian differential geometry and establish a rigorous theoretical connection between gradient ascent on the data manifold and in the base space of the flow.
Normalizing flows are among the most popular paradigms in generative modeling, especially for images, primarily because we can efficiently evaluate the likelihood of a data point. Training normalizing flows can be difficult because models which produce good samples typically need to be extremely deep and can often be poorly conditioned: since they are parametrized as invertible maps from $\mathbb{R}^d \to \mathbb{R}^d$, and typical training data like images intuitively is lower-dimensional, the learned maps often have Jacobians that are close to being singular. In our paper, we tackle representational aspects around depth and conditioning of normalizing flows: both for general invertible architectures, and for a particular common architecture, affine couplings. We prove that $\Theta(1)$ affine coupling layers suffice to exactly represent a permutation or $1 \times 1$ convolution, as used in GLOW, showing that representationally the choice of partition is not a bottleneck for depth. We also show that shallow affine coupling networks are universal approximators in Wasserstein distance if ill-conditioning is allowed, and experimentally investigate related phenomena involving padding. Finally, we show a depth lower bound for general flow architectures with few neurons per layer and bounded Lipschitz constant.
While normalizing flows for continuous data have been extensively researched, flows for discrete data have only recently been explored. These prior models, however, suffer from limitations that are distinct from those of continuous flows. Most notably, discrete flow-based models cannot be straightforwardly optimized with conventional deep learning methods because gradients of discrete functions are undefined or zero, and backpropagation can be computationally burdensome compared to alternative discrete algorithms such as decision tree algorithms. Previous works approximate pseudo-gradients of the discrete functions but do not solve the problem on a fundamental level. Our approach seeks to reduce computational burden and remove the need for pseudo-gradients by developing a discrete flow based on decision trees---building upon the success of efficient tree-based methods for classification and regression for discrete data. We first define a tree-structured permutation (TSP) that compactly encodes a permutation of discrete data where the inverse is easy to compute; thus, we can efficiently compute the density value and sample new data. We then propose a decision tree algorithm to learn TSPs that estimates the tree structure and simple permutations at each node via a novel criteria. We empirically demonstrate the feasibility of our method on multiple datasets.
We develop an iterative (greedy) deep learning algorithm which is able to transform an arbitrary probability distribution function (PDF) into the target PDF. The model is based on iterative Optimal Transport of a series of 1D slices, matching on each slice the marginal PDF to the target. As special cases of this algorithm, we introduce two sliced iterative Normalizing Flow (SINF) models, which map from the data to the latent space (GIS) and vice versa (SIG). We show that SIG is able to generate high quality samples of image datasets, which match the GAN benchmarks. GIS obtains competitive results on density estimation tasks compared to the density trained NFs. SINF has very few hyperparameters and is very stable during training. When trained on small training sets it is both faster and achieves higher $p(x)$ than current alternatives.
Unsupervised dataset alignment estimates a transformation that maps two or more source domains to a shared aligned domain given only the domain datasets. This task has many applications including generative modeling, unsupervised domain adaptation, and socially aware learning. Most prior works use adversarial learning (i.e., min-max optimization), which can be challenging to optimize and evaluate. A few recent works explore non-adversarial flow-based (i.e., invertible) approaches, but they lack a unified perspective. Therefore, we propose to unify and generalize previous flow-based approaches under a single non-adversarial framework, which we prove is equivalent to minimizing an upper bound on the Jensen-Shannon Divergence (JSD). Importantly, our problem reduces to a min-min, i.e., cooperative, problem and can provide a natural evaluation metric for unsupervised dataset alignment.
Training and using modern neural-network based latent-variable generative models (like Variational Autoencoders) often require simultaneously training a generative direction along with an inferential (encoding) direction, which approximates the posterior distribution over the latent variables. Thus, the question arises: how complex does the inferential model need to be, in order to be able to accurately model the posterior distribution of a given generative model? In this paper, we identify an important property of the generative map impacting the required size of the encoder. We show that if the generative map is ``strongly invertible" (in a sense we suitably formalize), the inferential model need not be much more complex. Conversely, we prove that there exist non-invertible generative maps, for which the encoding direction needs to be exponentially larger (under standard assumptions in computational complexity). Importantly, we do not require the generative model to be layerwise invertible, which a lot of the related literature assumes and isn't satisfied by many architectures used in practice (e.g. convolution and pooling based networks). Thus, we provide theoretical support for the empirical wisdom that learning deep generative models is harder when data lies on a low-dimensional manifold.
Normalizing flows are generative models that provide tractable density estimation by transforming a simple distribution into a complex one. However, flows cannot directly model data supported on an unknown low-dimensional manifold. We propose Conformal Embedding Flows, which learn low-dimensional manifolds with tractable densities. We argue that composing a standard flow with a trainable conformal embedding is the most natural way to model manifold-supported data. To this end, we present a series of conformal building blocks and demonstrate experimentally that flows can model manifold-supported distributions without sacrificing tractable likelihoods.
In this work we describe OMEN, a neural ODE based normalizing flow for the prediction of marginal distributions at flexible evaluation horizons, and apply it to agent position forecasting.
OMEN's architecture embeds an assumption that marginal distributions of a given agent moving forward in time are related, allowing for an efficient representation of marginal distributions through time and allowing for reliable interpolation between prediction horizons seen in training.
Experiments on a popular agent forecasting dataset demonstrate significant improvements over most baseline approaches, and comparable performance to the state of the art while providing the new functionality of reliable interpolation of predicted marginal distributions between prediction horizons as demonstrated with synthetic data.
Alexander Radovic; Jiawei He; Janahan Ramanan; Marcus A Brubaker; Andreas Lehrmann
Tractably modelling distributions over manifolds has long been an important goal in the natural sciences. Recent work has focused on developing general machine learning models to learn such distributions. However, for many applications these distributions must respect manifold symmetries---a trait which most previous models disregard. In this paper, we lay the theoretical foundations for learning symmetry-invariant distributions on arbitrary manifolds via equivariant manifold flows. We demonstrate the utility of our approach by using it to learn gauge invariant densities over SU(n) in the context of quantum field theory.
Isay Katsman; Aaron Lou; Derek Lim; Qingxuan Jiang; Ser-Nam Lim; Christopher De Sa
Learning new tasks continuously without forgetting on a constantly changing data distribution is essential for real-world problems but extremely challenging for modern deep learning. In this work we propose HCL, a Hybrid generative-discriminative approach to Continual Learning for classification. We model the distribution of each task and each class with a normalizing flow. The flow is used to learn the data distribution, perform classification, identify task changes and avoid forgetting, all leveraging the invertibility and exact likelihood which are uniquely enabled by the normalizing flow model. We use the generative capabilities of the flow to avoid catastrophic forgetting through generative replay and a novel functional regularization technique. For task identification, we use state-of-the-art anomaly detection techniques based on measuring the typicality of model's statistics. We demonstrate the strong performance of HCL on a range of continual learning benchmarks such as split-MNIST, split-CIFAR and SVHN-MNIST.
Polina Kirichenko; Mehrdad Farajtabar; Dushyant Rao; Balaji Lakshminarayanan; Nir Levine; Ang Li; Huiyi Hu; Andrew Gordon Wilson; Razvan Pascanu
Recent work has shown that Continuous Normalizing Flows (CNFs) can serve as generative models of images with exact likelihood calculation and invertible generation/density estimation. In this work we introduce a Multi-Resolution variant of such models (MRCNF). We introduce a transformation between resolutions that allows for no change in the log likelihood. We show that this approach yields comparable likelihood values for various image datasets, with improved performance at higher resolutions, with fewer parameters, using only 1 GPU.
Vikram Voleti; Chris Finlay; Adam M Oberman; Christopher Pal
Affine-coupling models (Dinh et al., 2014; 2016)are a particularly common type of normalizing flows, for which the Jacobian of the latent-to-observable-variable transformation is triangular, allowing the likelihood to be computed in linear time. Despite the widespread usage of affinecouplings, the special structure of the architecture makes understanding their representational power challenging. The question of universal approximation was only recently resolved by three parallel papers (Huang et al., 2020; Zhanget al., 2020; Koehler et al., 2020) – who showed reasonably regular distributions can be approximated arbitrarily well using affine couplings –albeit with networks with a nearly-singular Jacobian. As ill-conditioned Jacobians are an obstacle for likelihood-based training, the funda-mental question remains: which distributions can be approximated using well-conditioned affine coupling flows? In this paper, we show that any log-concave distribution can be approximated using well-conditioned affine-coupling flows. Interms of proof techniques, we uncover and lever-age deep connections between affine coupling architectures, underdamped Langevin dynamics (a stochastic differential equation often used to sample from Gibbs measures) and H ́enon maps (a structured dynamical system that appears in the study of symplectic diffeomorphisms). In terms of informing practice, we approximate a padded version of the input distribution with iid Gaussians – a strategy which (Koehler et al., 2020) empirically observed to result in better-conditioned flows, but had hitherto no theoretical grounding. Our proof can thus be seen as providing theoretical evidence for the benefits of Gaussian padding when training normalizing flows.
Holden Lee; Chirag Pabbaraju; Anish Prasad Sevekari; Andrej Risteski
This work introduces a predominantly parallel, end-to-end TTS model based on normalizing flows.
It extends prior parallel approaches by additionally modeling speech rhythm as a separate generative distribution to facilitate variable token duration during inference. We further propose a robust framework for the on-line extraction of speech-text alignments -- a critical yet highly unstable learning problem in end-to-end TTS frameworks. Our experiments demonstrate that our proposed techniques yield improved alignment quality, better output diversity compared to controlled baselines.
Kevin J. Shih; Rafael Valle; Rohan Badlani; Adrian Lancucki; Wei Ping; Bryan Catanzaro
Denoising diffusion probabilistic models have shown impressive results for generation of sequences by iteratively corrupting each example and then learning to map corrupted versions back to the original. However, previous work has largely focused on in-place corruption, adding noise to each pixel or token individually while keeping their locations the same. In this work, we consider a broader class of corruption processes and denoising models over sequence data that can insert and delete elements, while still being efficient to train and sample from. We demonstrate that these models outperform standard in-place models on an arithmetic sequence task, and that when trained on the text8 dataset they can be used to fix spelling errors without any fine-tuning.
Daniel D. Johnson; Jacob Austin; Rianne van den Berg; Daniel Tarlow
Current black-box variational inference (BBVI) methods require the user to make numerous design choices---such as the selection of variational objective and approximating family---yet there is little principled guidance on how to do so. We develop a conceptual framework and set of experimental tools to understand the effects of these choices, which we leverage to propose best practices for maximizing posterior approximation accuracy. Our approach is based on studying the pre-asymptotic tail behavior of the density ratios between the joint distribution and the variational approximation, then exploiting insights and tools from the importance sampling literature. We focus on normalizing flow models and give recommendations on how to be used(and diagnostics) in BBVI, though we are not limited to them.
Akash Kumar Dhaka; Alejandro Catalina; Manushi Welandawe; Michael Riis Andersen; Jonathan H. Huggins; Aki Vehtari
Among likelihood-based approaches for deep generative modelling, variational autoencoders (VAEs) offer scalable amortized posterior inference and fast sampling. However, VAEs are also more and more outperformed by competing models such as normalizing flows (NFs), deep-energy models, or the new denoising diffusion probabilistic models (DDPMs). In this preliminary work, we improve VAEs by demonstrating how DDPMs can be used for modelling the prior distribution of the latent variables. The diffusion prior model improves upon Gaussian priors of classical VAEs and is competitive with NF-based priors. Finally, we hypothesize that hierarchical VAEs could similarly benefit from the enhanced capacity of diffusion priors.
Normalizing flows are a powerful class of generative models demonstrating strong performance in several speech and vision problems. In contrast to other generative models, normalizing flows have tractable likelihoods and allow for stable training. However, they have to be carefully designed to be bijective functions with efficient Jacobian calculation. In practice, these requirements lead to overparameterized and sophisticated architectures that are inferior to alternative feed-forward models in terms of inference time and memory consumption. In this work, we investigate whether one can distill knowledge from flow-based models to more efficient alternatives. We provide a positive answer to this question by proposing a simple distillation approach and demonstrating its effectiveness on state-of-the-art conditional flow-based models for image super-resolution and speech synthesis.
The {\em skew-geometric Jensen-Shannon divergence} $\left(\textrm{JS}^{\textrm{G}_{\alpha}}\right)$ allows for an intuitive interpolation between forward and reverse Kullback-Leibler (KL) divergence based on the skew parameter $\alpha$. While the benefits of the skew in $\textrm{JS}^{\textrm{G}_{\alpha}}$ are clear---balancing forward/reverse KL in a comprehensible manner---the choice of optimal skew remains opaque and requires an expensive grid search. In this paper we introduce $\alpha$-VAEs, which extend the $\textrm{JS}^{\textrm{G}_{\alpha}}$ variational autoencoder by allowing for learnable, and therefore data-dependent, skew. We motivate the use of a parameterised skew in the dual divergence by analysing trends dependent on data complexity in synthetic examples. We also prove and discuss the dependency of the divergence minimum on the input data and encoder parameters, before empirically demonstrating that this dependency does not reduce to either direction of KL divergence for benchmark datasets. Finally, we demonstrate that optimised skew values consistently converge across a range of initial values and provide improved denoising and reconstruction properties. These render $\alpha$-VAEs an efficient and practical modelling choice across a range of tasks, datasets, and domains.
Jacob Deasy; Tom Andrew McIver; Nikola Simidjievski; Pietro Lio