This is about making Diffusion Models (DMs) more efficient for training and inference. The underlying idea is that the raw pixel space has a lot of detail that’s unnecessary for understanding the image semantically (which is why we can apply massive compression and still get coherent images). Their goal is to find a compressed latent space that removes unnecessary high-frequency details, and apply the DM modeling in this space. They achieve this practically by a two-phased training procedure where the first phase trains an autoencoder.
Their autoencoder uses a perceptual loss and a patch-based adversarial objective, which apparently improves local realism and reduces blur, commonly seen with standard L1 or L2 reconstruction losses. Importantly, they use CNN backbones so that the latent space is still a 2D, image-like structure, rather than trying to model the image as a 1D vector. Their encoder becomes, in effect, a trainable downsampling algorithm.
To model conditioning information, they present a number of possible methods. They most emphasize a cross-attention mechanism, which takes a representation of the conditioning information and computes attention where the query is the DM latent, the key is the conditioning information and the value is the conditioning information projected back into z-space. (I’m not sure how/if this information is combined back with the original latent image.)