Recreating the Seminal Algorithm of Gatys et al.
Neural Style Transfer (NST), introduced by Gatys et al. in 2015, is a fascinating technique that merges the content of one image with the style of another. The "style" refers to textures, colors, and visual patterns, while the "content" is the higher-level structure of the image. The key insight was realizing that a pre-trained Convolutional Neural Network (CNN) could be used to mathematically define and separate these two properties.
At its heart, the algorithm works by optimizing a generated image to minimize a two-part loss function: a content loss that ensures the image looks like the content source, and a style loss that ensures it has the textures and patterns of the style source.
$$ \mathcal{L}_{\text{total}} = \alpha \cdot \mathcal{L}_{\text{content}} + \beta \cdot \mathcal{L}_{\text{style}} $$As an image passes through a CNN, early layers capture fine-grained details (like edges), while deeper layers capture more abstract, high-level features that define the image's content. We can therefore measure the difference in content between two images by comparing their feature maps from a deep layer in the network.
Let $F^l$ be the feature map of our generated image at a deep layer $l$, and $P^l$ be the feature map of the original content image at the same layer. The content loss is simply the squared-error loss between these two feature maps.
$$ \mathcal{L}_{\text{content}} = \frac{1}{2} \sum_{i,j} \left( F_{ij}^l - P_{ij}^l \right)^2 $$Style is a more complex property, captured across multiple layers of the network. To represent style, we use the Gram matrix. The Gram matrix of a layer's activations captures the correlations between different feature maps. In essence, it measures which features tend to activate together, which corresponds to the texture and patterns in that layer's receptive field.
The style loss for a single layer is the squared-error loss between the Gram matrices of the generated image ($G^l$) and the style reference image ($A^l$).
$$ \mathcal{L}_{\text{style}}^l = \frac{1}{4 N_l^2 M_l^2} \sum_{i,j} \left( G_{ij}^l - A_{ij}^l \right)^2 $$The total style loss is the weighted sum of these individual layer losses, allowing us to capture stylistic elements at different scales, from fine textures to broader brush strokes.
$$ \mathcal{L}_{\text{style}} = \sum_{l} w_l \mathcal{L}_{\text{style}}^l $$The complete source code and instructions to reproduce these results are available on GitHub.