crush depth

Consistent Environments In Stable Diffusion

I've been working with Stable Diffusion lately.

One acknowledged "problem" with these kinds of text-to-image systems is that it's difficult to generate consistent subjects and/or background environments between images. Consider the case of a comic book artist using the system to create background images for each panel: If the characters are all standing in a room, then it's probably desirable for the room to look the same in each panel.

I've been working on a mostly-automated system for generating consistent scenes by combining an image-to-image workflow, 3D rendered images produced from Blender, and some scripting and techniques lifted from deferred rendering to produce a camera-dependent text prompt automatically.

Thinking Machines

Image generation systems like Stable Diffusion aren't intelligent. The system doesn't really understand what it's generating, and there's no way to tell it "give me the same room again but from a different camera angle". It has no concept of what a room is, or what a camera is.

It's somewhat difficult to describe processes in Stable Diffusion, because it's very tempting to talk about the system in terms of intelligence. We can talk about Stable Diffusion "improvising" objects and so on, but there's really nothing like human improvisation and cognition occurring. During generation, the system is taking manually-guided paths through a huge set of random numbers according to a huge, cleverly-constructed database of weight values. Nevertheless, for the sake of making this article readable, I'll talk about Stable Diffusion as if it's an intelligent system making choices, and how we can guide that system into making the choices we want.

In order to produce a consistent scene, it appears to be necessary to start with consistent source images, and to provide an accurate text prompt.

Requirements: Text Prompt

Stable Diffusion works with a text prompt. A text prompt describes the objects that should be generated in a scene. Without going into a ton of detail, the Stable Diffusion models were trained on a dataset called LAION-5b. The dataset consists of around 5.8 billion images, each of which were annotated with text describing the images. When you provide Stable Diffusion with a text prompt consisting of "A photograph of a ginger cat", it will sample from images in the original dataset that were annotated with photograph, ginger cat, and so on.

The generation process is incremental, and there appears to be a certain degree of feedback in the generation process; the system appears to inspect its own output during generation and will effectively try to "improvise" objects that it believes it might be generating if nothing in the text prompt appears to describe the random pile of pixels it is in the process of refining. More concretely, if the generation process starts producing something that looks orange and furry, and nothing in the prompt says ginger cat, then the system might produce an image of a dog with ginger hair. It might just as easily produce an image of a wicker basket!

It's therefore necessary, when trying to generate a consistent scene, to describe every important object visible in the scene in the text prompt. This requirement can lead to problems when trying to generate a scene from multiple different camera viewpoints in a room:

  • If an object O is visible in camera viewpoint A but not visible in camera viewpoint B, and the artist forgets to remove the description of O from the text prompt when generating an image from B, Stable Diffusion might replace one of the other objects in the scene with an improvised version of O, because it believes that the description of O matches something else that is visible in B.

  • If an object O is visible in both camera viewpoint A and camera viewpoint B, and the artist forgets to include a description of O in the prompt when generating an image from B, then O might appear completely different in A and B because, lacking a description of O, Stable Diffusion will essentially just improvise O.

Additionally, the way the text prompt works in Stable Diffusion is that text that appears earlier in the prompt is effectively treated as more significant when generating the image. It's therefore recommended that objects that are supposed to be more prominently displayed in the foreground be described first, and less significant objects in the background be described last.

We would therefore like to have a system that, for each visible object in the current camera viewpoint, a one-sentence textual description of the object is returned. If an object isn't visible, then no description is returned. The textual descriptions must be returned in ascending order of distance from the camera; the nearest objects have their descriptions returned first.

Requirements: Source Images

Stable Diffusion is capable of re-contextualizing existing images. The diffusion process, in a pure text-to-image workflow, essentially takes an image consisting entirely of gaussian noise and progressively denoises the image until the result (hopefully) looks like an image described by the text prompt.

Denoising

In an image-to-image workflow, the process is exactly the same except for the fact that the starting point isn't an image that is 100% noise. The input image is typically an existing drawing or photograph, and the workflow is instructed to introduce, say, 50% noise into the image, and then start the denoising process from that noisy image. This allows Stable Diffusion to essentially redraw an existing image, replacing forms and subjects by an artist-configurable amount.

For example, we can take the following rather badly-generated image:

Starting Coffee

We can feed this image into Stable Diffusion, and specify a text prompt that reads two cups of coffee next to a coffee pot. We can specify that we want the process to introduce 100% noise into the image before starting the denoising process. We get the following result:

100% Coffee

While the image is certainly coffee-themed, it bears almost no relation to the original image. This is because we specified 100% noise as the starting point; it's essentially as if we didn't bother to supply a starting image at all. If we instead specify that we want 75% noise introduced, we get the following image:

75% Coffee

This is starting to get closer to the original image. The composition is a little closer, and the colours are starting to match. Once we specify that we want 50% noise introduced, we get the following image:

50% Coffee

The image is very close to the original image, but the details are somewhat more refined. If we continue by going down to, say, 25% denoising, we get:

25% Coffee

The result is, at first glance, almost indistinguishable from the original image. Now, this would be fine if we were trying to achieve photorealism and the source image we started with was already photorealistic. However, if what we have as a source image isn't photorealistic, then going below about 50% denoising constrains the generation process far too much; the resulting image will be just as unrealistic as the source image.

We have a tension between specifying a noise amount that is low enough so that the system doesn't generate an image that's completely different from our source image, but high enough to give the system enough room to improvise and (hopefully) produce a photorealistic image from a non-photorealistic input.

To produce a consistent set of source images to render a room from multiple viewpoints, we'll need to build a 3D model in Blender that contains basic colours, forms, and some lighting, but that makes no real attempt to be photorealistic. We'll depend on Stable Diffusion to take those images and produce a consistent but photorealistic output. We assume that we can't start out with photorealistic images; if we already had photographs taken in a room from all the camera angles we wanted... Why would be we using Stable Diffusion in the first place?

In practice, I've found that noise values in the range 50-55% will work. We'll come back to this later on.

Automating Prompts

As mentioned earlier, we want to be able to have Blender produce the bulk of our text prompt for us in an automated manner. We need to do the following:

  1. Annotate individual objects in a scene with one-sentence descriptions. For example, "A red sofa", "A potted plant with pink flowers", etc.
  2. When rendering an image of the scene, produce a list S of the objects that are actually visible from the current viewpoint.
  3. Sort the list S such that the objects are ordered from nearest to furthest from the camera.
  4. For each object in S, extract the assigned text description and concatenate the descriptions into a comma-separated list for direct use in Stable Diffusion.

An example objectClass.blend scene is provided that implements all of the above.

Annotating Objects

For each of the objects in the scene, we have to create a custom string-typed Description property on the object that contains our text prompt for that object:

Object 0

Object 1

Object 2

Object 3

Note that we also specify a numeric, non-zero pass index for each object. Every object that has the same text prompt should have the same pass index, for reasons that will become clear shortly.

Object Visibility

Blender, unfortunately, doesn't provide an API for determining object visibility. My original line of thought was to take the frustum of the currently active camera in the scene, and calculate the set of objects that intersect that frustum. Unfortunately, this is more work than I'm willing to attempt in a clumsy language like Python, and it also wouldn't take into account whether objects in the frustum are occluded by other objects. If an object is completely hidden behind another object in the foreground, then it shouldn't appear in the text prompt.

Instead, I turned to a technique used in some forms of deferred rendering. Essentially, we render an image of the scene where each pixel in the image contains exactly one integer value denoting the numeric index of the object that produced the surface at that pixel. More specifically, the pass index that we specified for each object earlier is written directly to the output image.

For example, whilst the standard colour render output of the example scene looks like this:

Render

The "object index pass" output looks like this (text added for clarity):

Index

The pixels that make up the cone all have integer value 4. The pixels that make up the sphere all have integer value 2, and so on. The pixels that make up the background have integer value 0 (there's a tiny bit of the background visible in the top-left corner of the image). We'll effectively ignore all pixels with value 0.

The same scene rendered from a different viewpoint produces the following images:

Render 2

Index 2

Note that, from this viewpoint, the cone is almost entirely occluded by the cube.

Now, unfortunately, due to a limitation in Blender, the EEVEE realtime renderer doesn't have an option to output object IDs directly as a render pass. I'm impatient and don't want to spend the time waiting for Cycles to render my scenes, so instead we can achieve the same result by using an AOV Output in each scene material.

First, create a new Value-typed AOV Output in the scene's View Layer properties:

AOV 0

Now, for each material used in the scene, declare an AOV output node and attach the object's (confusingly named) Object Index to it:

AOV 1

Finally, in the Compositor settings, attach the AOV output to a Viewer Node:

AOV 2

The reason for this last step is that we're about to write a script to examine the AOV image, and Blender has an infuriating internal limitation where it won't give us direct access to the pixels of a rendered image unless that rendered image came from a Viewer Node.

We now have images that can effectively tell us which objects are visible in the corresponding coloured render images. Note that we have the same limitation here as in traditional deferred rendering: We can't handle translucent or transparent objects. This hasn't, so far, turned out to be much of a problem in practice.

Producing The Prompt

We now need to write a small Python script that can take the rendered images and produce a prompt. The complete script is included as a text object in the included Blender scene, but I'll walk through the code here.

The first step is to get a reference to the currently active camera in the scene, and set up the various output files we'll be using:

import bpy
import math
import numpy

_output_directory = "/shared-tmp"

class ClassifiedObject:
    index: int
    distance: int
    description: str


_objects = []
_camera = bpy.context.scene.camera

print('# Camera: {0}'.format(_camera.name))

_output_props = '{0}/{1}.properties'.format(_output_directory, _camera.name)
_output_prompt = '{0}/{1}.prompt.txt'.format(_output_directory, _camera.name)
_output_samples = '{0}/{1}.npy'.format(_output_directory, _camera.name)

print('# Properties: {0}'.format(_output_props))
print('# Prompt:     {0}'.format(_output_prompt))
print('# Samples:    {0}'.format(_output_samples))

We'll hold references to visible objects in values of type ClassifiedObject.

We then iterate over all images in the scene and calculate the euclidean distance between the origin of each object, and the camera.

def _distance_to_camera(_camera, _object):
    _object_pos = _object.location
    _camera_pos = _camera.location
    _delta_x = _camera_pos.x - _object_pos.x
    _delta_y = _camera_pos.y - _object_pos.y
    _delta_z = _camera_pos.z - _object_pos.z
    _delta_x_sq = _delta_x * _delta_x
    _delta_y_sq = _delta_y * _delta_y
    _delta_z_sq = _delta_z * _delta_z
    _sum = _delta_x_sq + _delta_y_sq + _delta_z_sq
    return math.sqrt(_sum)

for _o in bpy.data.objects:
    if _o.pass_index:
        _r = ClassifiedObject()
        _r.distance = _distance_to_camera(_camera, _o)
        _r.object = _o
        _r.index = _o.pass_index
        _r.description = _o['Description']
        _objects.append(_r)

We then sort the objects we saved in ascending order of distance, so that objects nearest to the camera appear earliest in the list, and we save the objects to an external properties file in case we want to process them later:

_objects.sort(key=lambda o: o.distance)

with open(_output_props, "w") as f:
    for _o in _objects:
        f.write('objects.{0}.index = {0}\n'.format(_o.index))
        f.write('objects.{0}.description = {1}\n'.format(_o.index, _o.description))
        f.write('objects.{0}.name = {1}\n'.format(_o.index, _o.object.name))
        f.write('objects.{0}.distance = {1}\n'.format(_o.index, _o.distance))

We then get a reference to the image containing our integer object indices, and save the pixels to a numpy array. Normal Python is too slow to process images this large, so we need to do the processing using numpy to avoid having to wait twenty or so minutes each time we want to process an image. Using numpy, the operation is completed in a couple of seconds. The raw pixel array returned from the viewer node is actually an RGBA image where each pixel consists of four 32-bit floating point values. Our AOV output is automatically expanded such that the same value will be placed into each of the four colour channels, so we can actually avoid sampling all four channels and just use the values in the red channel. The [::4] expression indicates that we start from element 0, process the full length of the array, but skip forward by 4 components each time.

_image = bpy.data.images['Viewer Node']
_data = numpy.array(_image.pixels)[::4]

We now need to iterate over all the pixels in the image and count the number of times each object index appears. Astute readers might be wondering why we don't just maintain an int -> boolean map where we set the value of a key to True whenever we encounter that object index in the image. There are two reasons we can't do this.

Firstly, if an object is almost entirely occluded by another object, but a few stray pixels of the object are still visible, we don't want to be mentioning that object in the prompt. Doing so would introduce the problem discussed earlier where an object is erroneously injected into the scene by Stable Diffusion because the prompt said one should be in the scene somewhere.

Secondly, there will be pixels in the image that do not correspond to any object index. Why does this happen? Inspect the edges of objects in the AOV image from earlier:

Sampling

Many pixels in the image will be some linear blend of the neighbouring pixels due to antialiasing. This will occur even if the Viewport and Render samples are set to 1 (although it will be reduced). This is not a problem in practice, it just means that we need to round pixel values to integers, and consider a threshold value under which objects are assumed not to be visible. We can accomplish this as follows:

_visibility = {}
_visibility_max = _image.size[0] * _image.size[1]
_visibility_threshold = 0.005

for _x in _data:
    _xi = int(_x)
    if _xi:
        _c = _visibility.get(_xi) or 0
        _visibility[_xi] = _c + 1

print("# Processed.")
print('# Visibility: {0}'.format(_visibility))

_visible = set()

for _i in _visibility:
    _c = _visibility[_i]
    _f = float(_c) / _visibility_max
    if _f > _visibility_threshold:
        _visible.add(_i)

print('# Visible: {0}'.format(_visible))

We calculate _visibility_max as the total number of pixels in the image. We declare _visibility_threshold to be the threshold value at which an object is considered to be visible. Essentially, if the pixels of an object cover more than 0.5% of the screen, then the object is considered to be visible. Each time we encounter a pixel that has value xi, we increment the count for object xi in the _visibility set.

Finally, we iterate over the _visibility set and extract the indices of all objects that were considered to be visible, given the declared threshold.

The last step simply generates the prompt. We already have the list of objects in distance order, and we know which of those objects are actually visible, so we can simply iterate over the list and write the object descriptions to a file:

with open(_output_prompt, 'w') as f:
    for _o in _objects:
        if _o.index in _visible:
            f.write('{0},\n'.format(_o.description))

This prompt file can be consumed directly by Stable Diffusion in ComfyUI using a text file node.

To run the script (assuming the script is in a text block in Blender's text editor), we simply render the scene and then, with the text editor selected, press ALT-P to run the script.

This image:

Render 2

... Produces this output on Blender's console:

# Camera: Camera.002
# Properties: /shared-tmp/Camera.002.properties
# Prompt:     /shared-tmp/Camera.002.prompt.txt
# Samples:    /shared-tmp/Camera.002.npy
# Processing 2073600 elements...
# Processed.
# Visibility: {1: 453103, 2: 481314, 3: 616150, 4: 85}
# Visible: {1, 2, 3}
# Saved.

Note that it correctly determined that objects 1, 2, and 3 are visible (the floor, the sphere, and the cube, respectively). The cone was not considered to be visible. The resulting prompt was:

A sphere,
A cube,
A wooden floor,

From the other viewpoint:

Render

... The following output and prompt is produced:

# Camera: Camera.000
# Properties: /shared-tmp/Camera.000.properties
# Prompt:     /shared-tmp/Camera.000.prompt.txt
# Samples:    /shared-tmp/Camera.000.npy
# Processing 2073600 elements...
# Processed.
# Visibility: {1: 1144784, 2: 287860, 3: 408275, 4: 232240}
# Visible: {1, 2, 3, 4}
# Saved.
A sphere,
A cone,
A cube,
A wooden floor,

Note how the cube and floor are mentioned last, as they are furthest away.

In Practice

Using the above techniques, I built a small house in Blender, populated it with objects, and took the following renders:

House Plan

House 0

House 1

House 2

No attempt was made at photorealism. I took care to only provide objects with minimal forms and textures (often just flat colours). I ran the images through an image-to-image workflow at approximately 52% denoising, and received the following images:

House 0

House 1

House 2

While by no means perfect, the room is at least recognizable as being the same room from different angles. With some more precise prompting and some minor inpainting, I believe the room could be made completely consistent between shots.

We must negate the machines-that-think. Humans must set their own guidelines. This is not something machines can do. Reasoning depends upon programming, not on hardware, and we are the ultimate program! Our Jihad is a "dump program." We dump the things which destroy us as humans!

— Frank Herbert, Children of Dune

New PGP Keys

It's that time of year again.

Fingerprint                                       | Comment
---------------------------------------------------------------------------
E362 BB4F 16A9 981D E781 2F6E 10E4 AAD0 B00D 6CDD | 2024 personal
37A9 97D5 970E 145A B9DB 1409 A203 E72A D3BB E1CE | 2024 maven-rsa-key

Keys are published to the keyservers as usual.

Batch Files Are Not Your Only Option

A while ago I got into a fight with jpackage. Long story short, I concluded that it wasn't possible to use jpackage to produce an application that runs in "module path" mode but that also has one or more automatic modules.

It turns out that it is possible, but it's not obvious from the documentation at all and requires some extra steps. The example project demonstrates this.

Assume I've got an application containing modules com.io7m.demo.m1, com.io7m.demo.m2, and com.io7m.demo.m3. The com.io7m.demo.m3 module is the module that contains the main class (at com.io7m.demo.m3/com.io7m.demo.m3.M3). In this example, assume that com.io7m.demo.m2 is actually an automatic module and therefore would cause jlink to fail if it tried to process it.

I first grab a JDK from Foojay and unpack it:

$ wget -O jdk.tar.gz -c 'https://api.foojay.io/disco/v3.0/ids/9604be3e0c32fe96e73a67a132a64890/redirect'
$ mkdir -p jdk
$ tar -x -v --strip-components=1 -f jdk.tar.gz --directory jdk

Then I grab a JRE from Foojay and unpack it:

$ wget -O jre.tar.gz -c 'https://api.foojay.io/disco/v3.0/ids/3981936b6f6b297afee4f3950c85c559/redirect'
$ mkdir -p jre
$ tar -x -v --strip-components=1 -f jre.tar.gz --directory jre

I could reuse the same JDK from the first step, but the JRE is smaller and thus it makes the application distribution smaller as we won't be using jlink to strip out any unused modules.

I then build the application, and this produces a set of platform-independent modular jar files:

$ mvn clean package

I copy the jars into a jars directory:

$ cp ./m1/target/m1-20231111.jar jars
$ cp ./m2/target/m2-20231111.jar jars
$ cp ./m3/target/m3-20231111.jar jars

Then I call jpackage:

$ jpackage \
  --runtime-image jre \
  -t app-image \
  --module com.io7m.demo.m3 \
  --module-path jars 
  --name jpackagetest

The key argument that makes this work is the --runtime-image option. It effectively means "don't try to produce a reduced jlink runtime".

This produces an application that works correctly:

$ file jpackagetest/bin/jpackagetest
jpackagetest/bin/jpackagetest: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, not stripped

$ ./jpackagetest/bin/jpackagetest 
M1: Module module com.io7m.demo.m1
M2: Module module m2
JRT: java.base
JRT: java.compiler
JRT: java.datatransfer
JRT: java.desktop
...

We can see from the first two lines of output that both com.io7m.demo.m1 and (the badly-named) m2 are on the module path and have not been placed on the class path. This means that any services declared in the module descriptors will actually work properly.

We can take a look at the internal configuration:

$ cat jpackagetest/lib/app/jpackagetest.cfg 
[Application]
app.mainmodule=com.io7m.demo.m3/com.io7m.demo.m3.M3

[JavaOptions]
java-options=-Djpackage.app-version=20231111
java-options=--module-path
java-options=$APPDIR/mods

We can see that the internal configuration uses an (undocumented) $APPDIR variable that expands to the full path to a mods directory inside the application distribution. The mods directory contains the unmodified application jars:

$ ls jpackagetest/lib/app/mods/
m1-20231111.jar  m2-20231111.jar  m3-20231111.jar

$ sha256sum jars/*
f8de3acf245428576dcf2ea47f5eb46cf64bb1a5daf43281e9fc39179cb3154f  jars/m1-20231111.jar
6ad0f7357cf03dcc654a3f9b8fa8ce658826fc996436dc848165f6f92973bb90  jars/m2-20231111.jar
b5c4d7d858dad6f819d224dd056b9b54009896a02b0cd5c357cf463de0d9fdd2  jars/m3-20231111.jar

$ sha256sum jpackagetest/lib/app/mods/*
f8de3acf245428576dcf2ea47f5eb46cf64bb1a5daf43281e9fc39179cb3154f  jpackagetest/lib/app/mods/m1-20231111.jar
6ad0f7357cf03dcc654a3f9b8fa8ce658826fc996436dc848165f6f92973bb90  jpackagetest/lib/app/mods/m2-20231111.jar
b5c4d7d858dad6f819d224dd056b9b54009896a02b0cd5c357cf463de0d9fdd2  jpackagetest/lib/app/mods/m3-20231111.jar

Now to try to get this working on Windows with the elderly wix tools...

Vulkan Memory Allocation (Part 1 of N)

I've recently been looking into the allocation of device memory in Vulkan. It turns out there's a lot of complexity around balancing the various constraints that the API imposes.

Device Memory

Vulkan has the concepts of device memory and host memory. Host memory is the common memory that developers are accustomed to: It's the memory that is accessed and controlled by the host CPU on the current machine, and the memory that is returned by malloc() and friends. We won't talk much about host memory here, because there's nothing new or interesting to say about it, and it works the same way in Vulkan as in regular programming environments.

Device memory, on the other hand, is memory that is directly accessible to whatever the GPU is on the current system. Device memory is exposed to the Vulkan programmer via a number of independent heaps. For example, on a system (at the time of writing) with an AMD Radeon RX 5700 GPU, device memory is exposed as three different heaps:

  • An 8gb heap of "device local" memory. This is 8gb of GDDR6 soldered directly to the card.
  • A portion of the host system's RAM exposed to the GPU. On my workstation with 32gb of RAM, the GPU is provided with a 16gb heap of system memory.
  • A small 256mb heap that is directly accessible to both the GPU and the host system (more on this later).

The different heaps have different performance characteristics and different capabilities. For example, some heaps can be directly memory-mapped for reading and writing by the host computer using the vkMapMemory function (similar to the POSIX mmap function). Some heaps are solely and directly accessible to the GPU and therefore are the fastest memory for the GPU to read and write. In order for the host CPU to read and write this memory, explicit transfer commands must be executed using the Vulkan API. Reads and writes to memory that is not directly connected to the GPU must typically go over the PCI bus, and are therefore slower than reads and writes to directly GPU-connected memory in relative terms.

Naturally, different types of GPU expose different sets of heaps. On systems with GPUs integrated into the CPU, there might only be a single heap. For example, on a fairly elderly laptop with an Intel HD 620 embedded GPU, there is simply one 12gb heap that is directly accessible by both the GPU and host CPU.

Vulkan also introduces the concept of memory types. A memory type is a rather vague concept, but it can be considered as a kind of access method for memory in a given heap. For example, a memory type for a given heap might advertise that it can be memory-mapped directly from the host CPU. A different memory type might advertise that there's CPU-side caching of the memory. Another memory type might advertise that it is incoherent, and therefore requires explicit flush operations in order to make any writes to the memory visible to the GPU.

Implementations might require that certain types of GPU resources be allocated in heaps using specific memory types. For example, some NVIDIA GPUs strictly separate memory allocations made to hold color images, allocations made to hold arrays of structured data, and so on. These memory type requirements can exist even when allocations of different types are being made to the same heap.

As mentioned earlier, differing performance characteristics between heaps means that developers will want to place different kinds of resources in different heaps in order to take advantage of the properties of each heap. For example, if the programmer knows that a given texture is immutable, and that it will be loaded from disk once and then sampled repeatedly by the GPU when rendering, then it makes sense for this texture to be placed into the fastest, directly GPU-connected memory. On the other hand, consider what happens if a developer knows that the contents of a texture are mutable and will be updated on every frame: In one manner or another, the contents of that texture are almost certainly going to traverse the PCI bus each time it is updated (assuming a discrete GPU with a separate device-local heap). Therefore, it makes sense for that texture to be allocated in a heap that is directly CPU-accessible and have the GPU read from that memory as needed. Directly GPU-connected memory tends to be somewhat of a more precious and less abundant resource, so there's little to be gained by wasting it on a texture that will need to be transferred anew to the GPU on every frame anyway! The small 256mb heap mentioned at the start of this article is explicitly intended for those kinds of transfers: The CPU can quickly write data into that heap and then instruct Vulkan to perform a transfer from that heap into the main device-local heap. This is essentially a heap for staging buffers.

Allocation

When allocating a resource, the developer must ask the Vulkan API what the memory requirements will be for the given resource. Essentially, the conversation goes like this:

Developer: I want to create a 256x256 RGBA8 texture, and I want the texture to be laid out in the most optimal form for fast access by the GPU. What are the memory requirements for this data?

Vulkan: You will need to allocate 262144 bytes of memory for this texture, using memory type T, and the allocated memory must be aligned to a 65536 byte boundary.

The blissfully unaware developer then calls vkAllocateMemory, passing it a memory size of 262144 and a memory type T. The specification for vkAllocateMemory actually guarantees that whatever memory it returns will always obey the alignment restrictions for any kind of resource possible, so the developer doesn't need to worry about the 65536 byte alignment restriction above.

This all works fine for a while, but after having allocated a hundred or so textures like this, suddenly vkAllocateMemory returns VK_ERROR_OUT_OF_DEVICE_MEMORY. The developer immediately starts doubting their own sanity; after all, their GPU has a 8gb heap, and they've only allocated about ~26mb of textures. What's gone wrong?

Well, Vulkan imposes a limit on the number of active allocations that can exist at any give time. This is advertised to the developer in the maxMemoryAllocationCount field of the VkPhysicalDeviceLimits structure returned by the vkGetPhysicalDeviceProperties function. The Vulkan specification guarantees that this limit will be at least 4096, although it does give a laundry list of reasons why the limit in practice might be lower. In fact, in some situations, the limit can be much lower than this. To quote the Vulkan specification for vkAllocateMemory:

As a guideline, the Vulkan conformance test suite requires that at least 80 minimum-size allocations can exist concurrently when no other uses of protected memory are active in the system.

In practical terms, this means that Vulkan developers are required to ask for a small number of large chunks of memory, and then manually sub-allocate that memory for use with resources. This is where the real complexity begins.

Sub-Allocation Constraints

There are a practically unlimited number of possible ways to manage memory, and there are entire books on the subject. Vulkan developers wishing to sub-allocate memory must come up with algorithms that balance at least the following (often contradictory) requirements:

  1. A fixed-size heap must be divided up into separate allocations with at least one allocation for each memory type that will be used by the application. It's not possible to know which memory type a Vulkan implementation will need for each kind of resource in the application, and it's also not possible to know (without relying on hardware-specific information ahead of time) what kind of heaps will be available, so the heap divisions cannot be decided statically.
  2. A fixed-size heap must be divided up into as small a number of separate allocations as possible, in order to stay below the maxMemoryAllocationCount limit for vkAllocateMemory.
  3. A fixed-size heap must be divided up into separate allocations where each allocation is not larger than an implementation-defined limit. The developer must examine the maxMemoryAllocationSize of the VkPhysicalDeviceMaintenance3Properties structure returned by the vkGetPhysicalDeviceProperties2 function.
  4. Developers must obey alignment restrictions themselves. The vkAllocateMemory function is guaranteed to return memory that is suitably aligned for any possible resource, but developers sub-allocating from one of these allocations must ensure that they place resources at correctly-aligned offsets relative to the start of that allocation.
  5. Fast GPU memory can be in relatively short supply; it's important that as little of it is wasted as possible. Allocation schemes that result in a high degree of memory fragmentation are unsuitable, as this will result in a lot of precious GPU memory becoming unusable.
  6. There are often soft-realtime constraints. If a rendering operation of some kind needs memory right now, and there's a deadline of 16ms of rendering time in order to meet a 60hz frame rate, an allocation algorithm that takes a second or two to find free memory is unusable.
  7. Sub-allocations must consist of contiguous memory. There is no way to have, for example, a single texture use memory from multiple disjoint regions of memory. It follows that each allocation must be at least large enough to hold resources of the expected size. For example, on most platforms, allocating a 4096x1024 RGBA8 texture will require roughly 16mb of storage. If we divide the heap up into allocations no larger than 8mb, we will never be able to store a texture of this size.

Edit: Textures and buffers can use non-contiguous memory via sparse resources. Support for this cannot be relied upon.

It is fairly difficult to come up with memory allocation algorithms that will meet all of these requirements.

Developers are expected to use large allocations in order to stay below the limit on the number of active allocations imposed by vkAllocateMemory, but at the same time they can't use allocations that are too large and would exceed the maxMemoryAllocationSize limit. Developers don't know what sizes and types of heaps they will be presented with, so allocation sizes must be decided by educated guesses and heuristics, and probing the heap sizes at application startup.

In order to obey alignment restrictions, reduce memory fragmentation and avoid wasting too much memory by having unused space between aligned objects, it's almost certainly necessary to bucket sub-allocations by size, and place them into separate regions of memory. If this bucketing is not performed, then large numbers of small sub-allocations within an allocation can result in there not being enough contiguous space for a larger sub-allocation, even if there is otherwise enough non-contiguous free space for it.

How should sub-allocations be bucketed, though? The Vulkan specification does provide some guarantees as to what the returned alignment restrictions will be, not limited to:

The alignment member is a power of two.

If usage included VK_BUFFER_USAGE_STORAGE_BUFFER_BIT, alignment must be an integer multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment.

However, the alignment requirements can vary wildly between platforms. As an example, I wrote a small program that tried asking for the memory requirements for an RGBA8 texture in every combination of power-of-two sizes up to a maximum of 4096 (the largest texture width/height guaranteed to be supported by all Vulkan implementations). I specifically asked for textures using the tiling type VK_IMAGE_TILING_OPTIMAL as there is very little reason to use the discouraged VK_IMAGE_TILING_LINEAR. Use of VK_IMAGE_TILING_LINEAR can relax storage/alignment restrictions at the cost of much slower rendering performance.

I ran the program on a selection of platforms:

  • LinuxIntelMesa: Linux, Intel(R) HD Graphics 620, Mesa driver
  • LinuxAMDRADV: Linux, AMD Radeon RX 5700, RADV driver
  • WindowsAMDProp: Windows, AMD Integrated, proprietary driver
  • WindowsNVIDIAProp: Windows, NVIDIA GeForce GTX 1660 Ti, proprietary driver

The following graph shows the alignment requirements for every image size on every driver (click for full-size image):

Alignment Requirements

Some observations can be made from this data:

  • On the LinuxIntelMesa platform, the required alignment for image data is always 4096. This is almost certainly something to do with the fact that the GPU is integrated with the CPU, and simply expects image data to be aligned to the native page size of the platform.
  • On the WindowsNVIDIAProp platform, the required alignment for image data is always 1024.
  • On the LinuxAMDRADV platform, the required alignment for image data is either 4096 or 65536. Strangely, there appears to be no clear relation that explains why an image might require 65536 byte alignment instead of 4096 byte alignment. The first image size to require 65536 byte alignment is 128x128, which coincidentally requires 65536 bytes of storage. However, a smaller image size such as 256x64 also requires 65536 bytes of storage, but only has a reported alignment requirement of 4096 bytes.
  • The WindowsAMDProp platform behaves similarly to the LinuxAMDRADV platform except that it often allows for a smaller alignment of 256 bytes. Even some very large images such as 16x4096 can require a 256 byte alignment.
  • These platforms do, at least, tend to stick to a very small set of alignment values.

Similarly, the data for the storage requirements for each size of image (click for full-size image):

Storage Requirements

Some observations can be made from this data:

  • Images of the exact same size and format can take different amounts of space on different platforms. A 128x16 image using 4 bytes per pixel should theoretically take 128 * 16 * 4 = 8192 bytes of storage space, but it actually requires 20504, 16384, or 8192 bytes depending on the target platform.
  • All platforms seem to have some minimum storage size, such that even microscopic images will require a minimum amount of storage. This suggests some kind of internal block or page-based allocation scheme in the hardware. On LinuxIntelMesa, images will always consume at least 8126 bytes. On LinuxAMDRADV, images will always consume at least 4096 bytes. On WindowsNVIDIAProp, images will always consume at least 512 bytes. On WindowsAMDProp, images will always consume at least 256 bytes.
  • Frequently, images with similar values on one dimension will require the same amount of storage. For example, 2x4096 and 4x4096 sized images require the same amount of storage on all surveyed platforms. This is true for some platforms all the way up to 64x4096!
  • On LinuxIntelMesa, storage sizes vary a lot, and are often not powers of two. On the other platforms, storage sizes are always a power of two.

The raw datasets are available:

With all of this data, it suffices to say that it is not possible for an allocator to use any kind of statically-determined, platform-independent size-based bucketing policy; the storage and alignment requirements for any given image differ wildly across platforms and seem to bear very little relation to the dimensions of the images.

However, textures are fairly complex in the sense of having lots of different properties such as format, number of layers, number of mipmaps, tiling mode, etc. We know that most GPUs have hardware specifically dedicated to texture operations, and so we can infer that a lot of the odd storage and alignment restrictions might be due to the idiosyncrasies of that hardware.

Vulkan developers also work with buffers, which can more or less be thought of as arrays that live in device memory. Do buffers also have the same storage and alignment oddities in practice? I wrote another program that requests memory requirements for a range of different sizes of buffer and ran it on the set of platforms above. I requested a buffer with a usage of type VK_BUFFER_USAGE_STORAGE_BUFFER_BIT, although trying different usage flags didn't seem to change the numbers returned on any platform, so we can probably assume that the values will be fairly consistent for all usage flags.

The following graph shows the alignment requirements for a range of buffer sizes on every driver (click for full-size image):

Alignment Requirements

Only one observation can be made from this data:

  • The alignment requirements appear to be fixed on each platform and are irrespective of the requested buffer size.

The following graph shows the storage requirements for a range of buffer sizes on every driver (click for full-size image):

Storage Requirements

Some observations can be made from this data:

  • On all platforms, for buffers larger than about 32 bytes, the storage size required for a given buffer is typically within about eight bytes of the original requested size.
  • On LinuxAMDRADV and WindowsNVIDIAProp, requesting a buffer of less than 16 bytes simply results in an allocation of 16 bytes.
  • For buffers larger than about 1000 bytes (the threshold will almost certainly turn out to be 1024), the required storage size is exactly equal to the requested buffer size.

The raw datasets are available:

So, in practice, on these particular platforms, buffers do not appear to have such a wide range of storage and alignment requirements.

Assumptions

By combining some of the measurements we've seen so far, and by seeing what guarantees the Vulkan spec gives us, we can try to put together a set of assumptions that might help in putting together a system for allocating memory that can satisfy all the fairly painful requirements Vulkan demands.

Firstly, I believe that allocations for textures should be treated separately from allocations for buffers.

For textures: On the platforms we surveyed, textures have alignment requirements that fall within the integer powers of two in the range [2⁸, 2¹⁶]. We could therefore divide the heap into allocations based on alignment size and memory type. On the platforms we surveyed, this would effectively avoid creating too many allocations, because there were at most three different alignment values on a given platform.

When a texture is required that has an alignment size S and memory type T, we sub-allocate from an existing allocation that has been created for alignment size S and memory type T, or create a new one if either the existing allocations are full, or no allocations exist. Within an allocation, we can track blocks of size S. By working in terms of blocks of size S, we guarantee that sub-allocations always have the correct alignment. Additionally, by fixing S on a per-allocation basis, we reduce wasted space: There will be no occurrences of small, unaligned sub-allocations breaking up contiguous free space and preventing larger aligned sub-allocations from being created.

We could choose to also group allocations by texture size, so allocations would be created for a combination of alignment size S, memory type T, and texture size P. I think this would likely be a bad idea unless the application only used a single texture size; in applications that used a wide range of texture sizes this would result in a large number of different allocations being created, and it's possible the allocation count limit could be reached.

In terms of sizing the allocations used for textures, we can simplify the situation further if we are willing to limit the maximum size of textures that the application will accept. We can see from the existing data that a 4096x4096 texture using four bytes per pixel will require just over 64mb of storage space. Many existing GPUs are capable of using textures at sizes of 8192x8192 and above. We could make the simplifying assumption that any textures over, say, 2048x2048 are classed as humongous and would therefore use a different allocation scheme. The Java virtual machine takes a similar approach for objects that have a size that is over a certain percentage of the heap size.

If we had an 8gb heap and divided it up into 32mb allocations, we could cover the entire heap in around 250 allocations, and each allocation would be able to store a 2048x2048 texture with room to spare. The same heap divided into 128mb allocations would need just over 62 allocations to cover the entire heap. A 128mb allocation would easily hold at least one 4096x4096 texture. However, the larger the individual allocations, the more likely it is that the entirety of the heap could be used up before allocations could be created for all the required combinations of S and T. We can derive a rough heuristic for the allocation size for a heap of size H where the maximum allowed size for a resource is M:

∃d. H / d ≃ K, size = max(M, d) 

That is, there's some allocation size d that will divide the heap into roughly K parts. The maximum allowed size of a resource is M. Therefore, the size used for allocations should be whichever of d or M is larger. If we choose K = 62 and are satisfied with resources that are at most 64mb, then size = max(M, d) = max(64000000, 133333333) = 133333333.

We could simplify matters further by requiring that the application provide up-front hints as to the range of texture sizes and formats that it is going to use (and the ways in which those textures are going to be used). This would be an impossibly onerous restriction for a general-purpose malloc(), but it's perfectly feasible for a typical rendering engine.

This would allow us to evaluate the memory requirements of all the combinations of S and T that are likely to be encountered when the application runs, and try to arrange for an optimal set of allocations of sizes suitable for the system's heap size. Obviously, the ideal situation for this kind of allocator would be that the application would use exactly one size of texture, and would use those textures in exactly one way. This is rarely the case for real applications!

Within an allocation, we would take care to sub-allocate blocks using a best-fit algorithm in order to reduce fragmentation. Most best-fit algorithms run in O(N) time over the set of free spaces, but the size of this set can be kept small by merging adjacent free spaces when deallocating sub-allocations.

For humongous textures, the situation is slightly unclear. Unless the application is routinely texturing all of its models with massive images, then those humongous textures are likely to be render targets. If they aren't render targets, then the application likely has bigger problems! I suspect that the right thing to do in this case is to simply reject the allocation and tell the user "if you want to allocate a render target, then use this specific API for doing so". The render target textures can then be created as dedicated allocations and won't interfere with the rest of the texture memory allocation scheme.

For buffers: The situation appears to be much simpler. On all platforms surveyed, the alignment restrictions for buffers fall within a small range of small powers of two, and don't appear to change based on the buffer parameters at all. We can use the same kind of S and T based bucketing scheme, but be happy in the knowledge that all of our created allocations will probably have the same S value.

Next Steps

I'm going to start work on a Java API to try to codify all of the above. Ideally there would be an API to examine the current platform and suggest reasonable allocation defaults, and a separate API to actually manage the heap(s) according to the chosen values. The first API would work along the lines of "here's the size of my heap, here are the texture sizes and formats I'm going to use; give me what you think are sensible allocation sizes".

There'll also need to be some introspection tools to measure properties such as contiguous space usage, fragmentation, etc.

Compaction and defragmentation is a topic I've not covered. It doesn't really seem like there's much to it other than "take all allocations and then sort all sub-allocations by size to free up space at the ends of allocations". It's slightly harder to actually implement because there will be existing Vulkan resources referring to memory that will have "moved" after defragmentation. The difficulty is really just a matter of designing a safe API around it.

To Fastmail

I'm shutting down my mail server and moving to fastmail. If everything works correctly, nothing should appear to be any different to the outside world; all old email addresses will continue to work.