Diffusion models have been around for a while now and I‘ve always wondered what a good use-case for diffusion models could be. Should I just generate images of baby Yoda riding a bicycle or maybe an astronaut riding a horse?
Well, a recent release [paper] by the Google Research team has led to a crazy good use case i.e. generating looping videos with dynamics that we experience due to motion caused by wind, water currents, respiration, and other natural factors.
The basic idea behind this paper is to bring natural object dynamics to images in response to an interactive user excitation. The dataset consists of a large collection of automatically extracted motion trajectories of real video sequences to predict a neural stochastic motion texture i.e. a set of coefficients of a motion basis that characterize each pixel’s trajectory into the future.
An Overview
The goal is to get an input image I0 and generate a video of length T featuring oscillation dynamics. GID (Generative Image Dynamics) uses a frequency-coordinated diffusion sampling process to predict a per-pixel long-term motion representation in the Fourier domain, which is called a neural stochastic motion texture. This representation can be converted into dense motion trajectories that span an entire video.
The system comprises of 2 modules:
- Motion prediction module
- Image based rendering module
Motion prediction module: It consists of a Latent Diffusion Model (LDM) that predicts a neural stochastic motion texture (basically a frequency representation of per-pixel motion trajectories) for an input image I0. The predicted neural stochastic motion texture is then transformed to a sequence of motion displacement fields F using inverse discrete Fourier transform. These fields help to determine the position of each input pixel at each future time step.