crush depth

Becoming Organized

I have an idiotic number of GitHub repositories. There are, at the time of writing, 536 repositories in my account, of which I'm the "source" for 454 of those. At last count, I'm maintaining 159 of my own projects, plus less-than-a-hundred-but-more-than-ten open source and closed source projects for third parties.

When I registered on GitHub back in something like 2011, there was no concept of an "organization". People had personal accounts and that was that.

GitHub exposes many configuration methods that can be applied on a per-organization basis. For example, if a CI process requires access to secrets, those secrets can be configured in a single place and shared across all builds in an organization. If the same CI process is used in a personal account, the secrets it uses have to be configured in every repository that use the CI process. This is obviously unmaintainable.

I've decided to finally set up an organization, and I'll be transferring actively maintained ongoing projects there.

Seasonic SS-600H2U Issues

I recently moved the components of an existing server into a new SC-316 case. I ordered a Seasonic SS-600H2U power supply to replace the somewhat elderly Seasonic ATX power supply that was in the old case.

Unfortunately, when I turned on the power supply for the first time, I was greeted with an angry vacuum cleaner sound that suggested bad fan bearings.

See the following video:

Noise

I filed a support request with ServerCase, sent them the video, and they wordlessly sent me a new power supply without even asking for the old one back. Thanks very much!

A Workflow For Wide Outpainting

Edit: https://github.com/io7m/com.io7m.visual.comfyui.wideoutpaint

One of the acknowledged limitations with Stable Diffusion is that the underlying models used for image generation have a definite preferred size of images. For example, the Stable Diffusion 1.5 model prefers to generate images that are 512x512 pixels in size. SDXL, on the other hand, prefers to generate images that are 1024x1024 pixels in size. Models will typically also allow for some factors of the width or height values to be used, as long as the product of the dimensions equals the same number of resulting pixels as the "native" image size. For example, there's a list of "permitted" dimensions that will work for SDXL: 1024x1024, 1152x896, 1216x832, and so on. The products of each of these pairs of dimensions always equals 1048576.

This is all fine, except that often we want to generate images larger than the native size. There are methods such as model-based upscaling (Real-ESRGAN and SwinIR being the current favourites), but these don't really result in more detail being created in the image, they more or less just sharpen lines and give definition to forms.

It would obviously be preferable if we could just ask Stable Diffusion to generate larger images in the first place. Unfortunately, if we ask any of the models to generate images larger than their native image size, the results are often problematic, to say the least. Here's a prompt:

high tech office courtyard, spring, mexico, sharp focus, fine detail

From Stable Diffusion 1.5 using the Realistic Vision checkpoint, this yields the following image at 512x512 resolution:

Image 0

That's fine. There's nothing really wrong with it. But what happens if we want the same image but at 1536x512 instead? Here's the exact same sampling settings and prompt, but with the image size set to 1536x512:

Image 1

It's clearly not just the same image but wider. It's an entirely different location. Worse, specifying large initial image sizes like this can lead to subject duplication depending on the prompt. Here's another prompt:

woman sitting in office, spain, business suit, sharp focus, fine detail

Image 2

Now here's the exact same prompt and sampling setup, but at 1536x512:

Image 3

We get subject duplication.

Outpainting

It turns out that outpainting can be used quite successfully to generate images much larger than the native model image size, and can avoid problems such as subject duplication.

The process of outpainting is merely a special case of the process of inpainting. For inpainting, we mark a region of an image with a mask and instruct Stable Diffusion to regenerate the contents of that region without affecting anything outside the region. For outpainting, we simply pad the edges of an existing image to make it larger, and mark those padded regions with a mask for inpainting.

The basic approach is to select a desired output size such that the dimensions are a multiple of the native image dimension size for the model being used. For example, I'll use 1536x512: Three times the width of the Stable Diffusion 1.5 native size.

The workflow is as follows:

  1. Generate a 512x512 starting image C. This is the "center" image.

    Stage 0

  2. Pad C rightwards by 512 pixels, yielding a padded image D.

    Stage 1

  3. Feed D into an inpainting sampling stage. This will fill in the padded area using contextual information from C, without rewriting any of the original image content from C. This will yield an image E.

    Stage 2

  4. Pad E leftwards by 512 pixels, yielding a padded image F.

    Stage 3

  5. Feed F into an inpainting sampling stage. This will fill in the padded area using contextual information from E, without rewriting any of the original image content from E. This will yield an image G.

    Stage 4

  6. Optionally, run G through another sampling step set to resample the entire image at a low denoising value. This can help bring out fine details, and can correct any visible seams between image regions that might have been created by the above steps.

By breaking the generation of the full size image into separate steps, we can apply different prompts to different sections of the image and effectively avoid even the possibility of subject duplication. For example, let's try this prompt again:

woman sitting in office, spain, business suit, sharp focus, fine detail

We'll use this prompt for the center image, but we'll adjust the prompt for the left and right images:

office, spain, sharp focus, fine detail

We magically obtain the following image:

Image 4

ComfyUI

I've included a ComfyUI workflow that implements the above steps. Execution should be assumed to proceed downwards and rightwards. Execution starts at the top left.

The initial groups of nodes at the top left allow for specifying a base image size, prompts for each image region, and the checkpoints used for image generation. Note that the checkpoint for the center image must be a non-inpainting model, whilst the left and right image checkpoints must be inpainting models. ComfyUI will fail with a nonsensical error at generation time if the wrong kinds of checkpoints are used.

Workflow 0

Execution proceeds to the CenterSample stage which, unsurprisingly, generates the center image:

Workflow 1

Execution then proceeds to the PadRight and RightSample stages, which pad the image rightwards and then produce the rightmost image, respectively:

Workflow 2

Execution then proceeds to the PadLeft and LeftSample stages, which pad the image leftwards and then produce the leftmost image, respectively:

Workflow 3

Execution then proceeds to a FinalCleanup stage that runs a ~50% denoising pass over the entire image using a non-inpainting checkpoint. As mentioned, this can fine-tune details in the image, and eliminate any potential visual seams. Note that we must use a Set Latent Noise Mask node; the latent image will be arriving from the previous LeftSample stage with a mask that restricts denoising to the region covered by the leftmost image. If we want to denoise the entire image, we must reset this mask to one that covers the entire image.

Workflow 4

The output stage displays the preview images for each step of the process:

Workflow 5

send-pack: unexpected disconnect while reading sideband packet

Mostly a post just to remind my future self.

I run a Forgejo installation locally. I'd been seeing the following error on trying to git push to some repositories:

$ git push --all
Enumerating objects: 12, done.
Counting objects: 100% (12/12), done.
Delta compression using up to 12 threads
Compressing objects: 100% (8/8), done.
send-pack: unexpected disconnect while reading sideband packet
Writing objects: 100% (8/8), 3.00 MiB | 9.56 MiB/s, done.
Total 8 (delta 4), reused 0 (delta 0), pack-reused 0 (from 0)
fatal: the remote end hung up unexpectedly
Everything up-to-date

The forgejo server is running in a container behind an nginx proxy.

The logs for nginx showed every request returning a 200 status code, so that was no help.

The logs for forgejo showed:

podman[1712113]: 2024/04/13 09:36:59 ...eb/routing/logger.go:102:func1() [I] router: completed GET /git/example.git/info/refs?service=git-receive-pack for 10.0.2.100:0, 401 Unauthorized in 1.0ms @ repo/githttp.go:532(repo.GetInfoRefs)
forge01[1712171]: 2024/04/13 09:36:59 ...eb/routing/logger.go:102:func1() [I] router: completed GET /git/example.git/info/refs?service=git-receive-pack for 10.0.2.100:0, 401 Unauthorized in 1.0ms @ repo/githttp.go:532(repo.GetInfoRefs)
forge01[1712171]: 2024/04/13 09:36:59 ...eb/routing/logger.go:102:func1() [I] router: completed GET /git/example.git/info/refs?service=git-receive-pack for 10.0.2.100:0, 200 OK in 145.1ms @ repo/githttp.go:532(repo.GetInfoRefs)
podman[1712113]: 2024/04/13 09:36:59 ...eb/routing/logger.go:102:func1() [I] router: completed GET /git/example.git/info/refs?service=git-receive-pack for 10.0.2.100:0, 200 OK in 145.1ms @ repo/githttp.go:532(repo.GetInfoRefs)
forge01[1712171]: 2024/04/13 09:36:59 ...eb/routing/logger.go:102:func1() [I] router: completed POST /git/example.git/git-receive-pack for 10.0.2.100:0, 0  in 143.8ms @ repo/githttp.go:500(repo.ServiceReceivePack)
podman[1712113]: 2024/04/13 09:36:59 ...eb/routing/logger.go:102:func1() [I] router: completed POST /git/example.git/git-receive-pack for 10.0.2.100:0, 0  in 143.8ms @ repo/githttp.go:500(repo.ServiceReceivePack)
forge01[1712171]: 2024/04/13 09:36:59 .../web/repo/githttp.go:485:serviceRPC() [E] Fail to serve RPC(receive-pack) in /var/lib/gitea/git/repositories/git/example.git: exit status 128 - fatal: the remote end hung up unexpectedly
forge01[1712171]: 2024/04/13 09:36:59 ...eb/routing/logger.go:102:func1() [I] router: completed POST /git/example.git/git-receive-pack for 10.0.2.100:0, 0  in 116.1ms @ repo/githttp.go:500(repo.ServiceReceivePack)
podman[1712113]: 2024/04/13 09:36:59 .../web/repo/githttp.go:485:serviceRPC() [E] Fail to serve RPC(receive-pack) in /var/lib/gitea/git/repositories/git/example.git: exit status 128 - fatal: the remote end hung up unexpectedly
podman[1712113]: 2024/04/13 09:36:59 ...eb/routing/logger.go:102:func1() [I] router: completed POST /git/example.git/git-receive-pack for 10.0.2.100:0, 0  in 116.1ms @ repo/githttp.go:500(repo.ServiceReceivePack)

Less than helpful, but it does seem to place blame on the client.

Apparently, what I needed was:

$ git config http.postBuffer 157286400

After doing that, everything magically worked:

$ git push
Enumerating objects: 12, done.
Counting objects: 100% (12/12), done.
Delta compression using up to 12 threads
Compressing objects: 100% (8/8), done.
Writing objects: 100% (8/8), 3.00 MiB | 153.50 MiB/s, done.
Total 8 (delta 4), reused 0 (delta 0), pack-reused 0 (from 0)
remote: . Processing 1 references
remote: Processed 1 references in total
To https://forgejo/git/example.git
   bf6389..5f216b  master -> master

The git documentation says:

http.postBuffer

Maximum size in bytes of the buffer used by smart HTTP transports when POSTing data to the remote system. For requests larger than this buffer size, HTTP/1.1 and Transfer-Encoding: chunked is used to avoid creating a massive pack file locally. Default is 1 MiB, which is sufficient for most requests.

I don't know why this fixes the problem.

Consistent Environments In Stable Diffusion

I've been working with Stable Diffusion lately.

One acknowledged "problem" with these kinds of text-to-image systems is that it's difficult to generate consistent subjects and/or background environments between images. Consider the case of a comic book artist using the system to create background images for each panel: If the characters are all standing in a room, then it's probably desirable for the room to look the same in each panel.

I've been working on a mostly-automated system for generating consistent scenes by combining an image-to-image workflow, 3D rendered images produced from Blender, and some scripting and techniques lifted from deferred rendering to produce a camera-dependent text prompt automatically.

Thinking Machines

Image generation systems like Stable Diffusion aren't intelligent. The system doesn't really understand what it's generating, and there's no way to tell it "give me the same room again but from a different camera angle". It has no concept of what a room is, or what a camera is.

It's somewhat difficult to describe processes in Stable Diffusion, because it's very tempting to talk about the system in terms of intelligence. We can talk about Stable Diffusion "improvising" objects and so on, but there's really nothing like human improvisation and cognition occurring. During generation, the system is taking manually-guided paths through a huge set of random numbers according to a huge, cleverly-constructed database of weight values. Nevertheless, for the sake of making this article readable, I'll talk about Stable Diffusion as if it's an intelligent system making choices, and how we can guide that system into making the choices we want.

In order to produce a consistent scene, it appears to be necessary to start with consistent source images, and to provide an accurate text prompt.

Requirements: Text Prompt

Stable Diffusion works with a text prompt. A text prompt describes the objects that should be generated in a scene. Without going into a ton of detail, the Stable Diffusion models were trained on a dataset called LAION-5b. The dataset consists of around 5.8 billion images, each of which were annotated with text describing the images. When you provide Stable Diffusion with a text prompt consisting of "A photograph of a ginger cat", it will sample from images in the original dataset that were annotated with photograph, ginger cat, and so on.

The generation process is incremental, and there appears to be a certain degree of feedback in the generation process; the system appears to inspect its own output during generation and will effectively try to "improvise" objects that it believes it might be generating if nothing in the text prompt appears to describe the random pile of pixels it is in the process of refining. More concretely, if the generation process starts producing something that looks orange and furry, and nothing in the prompt says ginger cat, then the system might produce an image of a dog with ginger hair. It might just as easily produce an image of a wicker basket!

It's therefore necessary, when trying to generate a consistent scene, to describe every important object visible in the scene in the text prompt. This requirement can lead to problems when trying to generate a scene from multiple different camera viewpoints in a room:

  • If an object O is visible in camera viewpoint A but not visible in camera viewpoint B, and the artist forgets to remove the description of O from the text prompt when generating an image from B, Stable Diffusion might replace one of the other objects in the scene with an improvised version of O, because it believes that the description of O matches something else that is visible in B.

  • If an object O is visible in both camera viewpoint A and camera viewpoint B, and the artist forgets to include a description of O in the prompt when generating an image from B, then O might appear completely different in A and B because, lacking a description of O, Stable Diffusion will essentially just improvise O.

Additionally, the way the text prompt works in Stable Diffusion is that text that appears earlier in the prompt is effectively treated as more significant when generating the image. It's therefore recommended that objects that are supposed to be more prominently displayed in the foreground be described first, and less significant objects in the background be described last.

We would therefore like to have a system that, for each visible object in the current camera viewpoint, a one-sentence textual description of the object is returned. If an object isn't visible, then no description is returned. The textual descriptions must be returned in ascending order of distance from the camera; the nearest objects have their descriptions returned first.

Requirements: Source Images

Stable Diffusion is capable of re-contextualizing existing images. The diffusion process, in a pure text-to-image workflow, essentially takes an image consisting entirely of gaussian noise and progressively denoises the image until the result (hopefully) looks like an image described by the text prompt.

Denoising

In an image-to-image workflow, the process is exactly the same except for the fact that the starting point isn't an image that is 100% noise. The input image is typically an existing drawing or photograph, and the workflow is instructed to introduce, say, 50% noise into the image, and then start the denoising process from that noisy image. This allows Stable Diffusion to essentially redraw an existing image, replacing forms and subjects by an artist-configurable amount.

For example, we can take the following rather badly-generated image:

Starting Coffee

We can feed this image into Stable Diffusion, and specify a text prompt that reads two cups of coffee next to a coffee pot. We can specify that we want the process to introduce 100% noise into the image before starting the denoising process. We get the following result:

100% Coffee

While the image is certainly coffee-themed, it bears almost no relation to the original image. This is because we specified 100% noise as the starting point; it's essentially as if we didn't bother to supply a starting image at all. If we instead specify that we want 75% noise introduced, we get the following image:

75% Coffee

This is starting to get closer to the original image. The composition is a little closer, and the colours are starting to match. Once we specify that we want 50% noise introduced, we get the following image:

50% Coffee

The image is very close to the original image, but the details are somewhat more refined. If we continue by going down to, say, 25% denoising, we get:

25% Coffee

The result is, at first glance, almost indistinguishable from the original image. Now, this would be fine if we were trying to achieve photorealism and the source image we started with was already photorealistic. However, if what we have as a source image isn't photorealistic, then going below about 50% denoising constrains the generation process far too much; the resulting image will be just as unrealistic as the source image.

We have a tension between specifying a noise amount that is low enough so that the system doesn't generate an image that's completely different from our source image, but high enough to give the system enough room to improvise and (hopefully) produce a photorealistic image from a non-photorealistic input.

To produce a consistent set of source images to render a room from multiple viewpoints, we'll need to build a 3D model in Blender that contains basic colours, forms, and some lighting, but that makes no real attempt to be photorealistic. We'll depend on Stable Diffusion to take those images and produce a consistent but photorealistic output. We assume that we can't start out with photorealistic images; if we already had photographs taken in a room from all the camera angles we wanted... Why would be we using Stable Diffusion in the first place?

In practice, I've found that noise values in the range 50-55% will work. We'll come back to this later on.

Automating Prompts

As mentioned earlier, we want to be able to have Blender produce the bulk of our text prompt for us in an automated manner. We need to do the following:

  1. Annotate individual objects in a scene with one-sentence descriptions. For example, "A red sofa", "A potted plant with pink flowers", etc.
  2. When rendering an image of the scene, produce a list S of the objects that are actually visible from the current viewpoint.
  3. Sort the list S such that the objects are ordered from nearest to furthest from the camera.
  4. For each object in S, extract the assigned text description and concatenate the descriptions into a comma-separated list for direct use in Stable Diffusion.

An example objectClass.blend scene is provided that implements all of the above.

Annotating Objects

For each of the objects in the scene, we have to create a custom string-typed Description property on the object that contains our text prompt for that object:

Object 0

Object 1

Object 2

Object 3

Note that we also specify a numeric, non-zero pass index for each object. Every object that has the same text prompt should have the same pass index, for reasons that will become clear shortly.

Object Visibility

Blender, unfortunately, doesn't provide an API for determining object visibility. My original line of thought was to take the frustum of the currently active camera in the scene, and calculate the set of objects that intersect that frustum. Unfortunately, this is more work than I'm willing to attempt in a clumsy language like Python, and it also wouldn't take into account whether objects in the frustum are occluded by other objects. If an object is completely hidden behind another object in the foreground, then it shouldn't appear in the text prompt.

Instead, I turned to a technique used in some forms of deferred rendering. Essentially, we render an image of the scene where each pixel in the image contains exactly one integer value denoting the numeric index of the object that produced the surface at that pixel. More specifically, the pass index that we specified for each object earlier is written directly to the output image.

For example, whilst the standard colour render output of the example scene looks like this:

Render

The "object index pass" output looks like this (text added for clarity):

Index

The pixels that make up the cone all have integer value 4. The pixels that make up the sphere all have integer value 2, and so on. The pixels that make up the background have integer value 0 (there's a tiny bit of the background visible in the top-left corner of the image). We'll effectively ignore all pixels with value 0.

The same scene rendered from a different viewpoint produces the following images:

Render 2

Index 2

Note that, from this viewpoint, the cone is almost entirely occluded by the cube.

Now, unfortunately, due to a limitation in Blender, the EEVEE realtime renderer doesn't have an option to output object IDs directly as a render pass. I'm impatient and don't want to spend the time waiting for Cycles to render my scenes, so instead we can achieve the same result by using an AOV Output in each scene material.

First, create a new Value-typed AOV Output in the scene's View Layer properties:

AOV 0

Now, for each material used in the scene, declare an AOV output node and attach the object's (confusingly named) Object Index to it:

AOV 1

Finally, in the Compositor settings, attach the AOV output to a Viewer Node:

AOV 2

The reason for this last step is that we're about to write a script to examine the AOV image, and Blender has an infuriating internal limitation where it won't give us direct access to the pixels of a rendered image unless that rendered image came from a Viewer Node.

We now have images that can effectively tell us which objects are visible in the corresponding coloured render images. Note that we have the same limitation here as in traditional deferred rendering: We can't handle translucent or transparent objects. This hasn't, so far, turned out to be much of a problem in practice.

Producing The Prompt

We now need to write a small Python script that can take the rendered images and produce a prompt. The complete script is included as a text object in the included Blender scene, but I'll walk through the code here.

The first step is to get a reference to the currently active camera in the scene, and set up the various output files we'll be using:

import bpy
import math
import numpy

_output_directory = "/shared-tmp"

class ClassifiedObject:
    index: int
    distance: int
    description: str


_objects = []
_camera = bpy.context.scene.camera

print('# Camera: {0}'.format(_camera.name))

_output_props = '{0}/{1}.properties'.format(_output_directory, _camera.name)
_output_prompt = '{0}/{1}.prompt.txt'.format(_output_directory, _camera.name)
_output_samples = '{0}/{1}.npy'.format(_output_directory, _camera.name)

print('# Properties: {0}'.format(_output_props))
print('# Prompt:     {0}'.format(_output_prompt))
print('# Samples:    {0}'.format(_output_samples))

We'll hold references to visible objects in values of type ClassifiedObject.

We then iterate over all images in the scene and calculate the euclidean distance between the origin of each object, and the camera.

def _distance_to_camera(_camera, _object):
    _object_pos = _object.location
    _camera_pos = _camera.location
    _delta_x = _camera_pos.x - _object_pos.x
    _delta_y = _camera_pos.y - _object_pos.y
    _delta_z = _camera_pos.z - _object_pos.z
    _delta_x_sq = _delta_x * _delta_x
    _delta_y_sq = _delta_y * _delta_y
    _delta_z_sq = _delta_z * _delta_z
    _sum = _delta_x_sq + _delta_y_sq + _delta_z_sq
    return math.sqrt(_sum)

for _o in bpy.data.objects:
    if _o.pass_index:
        _r = ClassifiedObject()
        _r.distance = _distance_to_camera(_camera, _o)
        _r.object = _o
        _r.index = _o.pass_index
        _r.description = _o['Description']
        _objects.append(_r)

We then sort the objects we saved in ascending order of distance, so that objects nearest to the camera appear earliest in the list, and we save the objects to an external properties file in case we want to process them later:

_objects.sort(key=lambda o: o.distance)

with open(_output_props, "w") as f:
    for _o in _objects:
        f.write('objects.{0}.index = {0}\n'.format(_o.index))
        f.write('objects.{0}.description = {1}\n'.format(_o.index, _o.description))
        f.write('objects.{0}.name = {1}\n'.format(_o.index, _o.object.name))
        f.write('objects.{0}.distance = {1}\n'.format(_o.index, _o.distance))

We then get a reference to the image containing our integer object indices, and save the pixels to a numpy array. Normal Python is too slow to process images this large, so we need to do the processing using numpy to avoid having to wait twenty or so minutes each time we want to process an image. Using numpy, the operation is completed in a couple of seconds. The raw pixel array returned from the viewer node is actually an RGBA image where each pixel consists of four 32-bit floating point values. Our AOV output is automatically expanded such that the same value will be placed into each of the four colour channels, so we can actually avoid sampling all four channels and just use the values in the red channel. The [::4] expression indicates that we start from element 0, process the full length of the array, but skip forward by 4 components each time.

_image = bpy.data.images['Viewer Node']
_data = numpy.array(_image.pixels)[::4]

We now need to iterate over all the pixels in the image and count the number of times each object index appears. Astute readers might be wondering why we don't just maintain an int -> boolean map where we set the value of a key to True whenever we encounter that object index in the image. There are two reasons we can't do this.

Firstly, if an object is almost entirely occluded by another object, but a few stray pixels of the object are still visible, we don't want to be mentioning that object in the prompt. Doing so would introduce the problem discussed earlier where an object is erroneously injected into the scene by Stable Diffusion because the prompt said one should be in the scene somewhere.

Secondly, there will be pixels in the image that do not correspond to any object index. Why does this happen? Inspect the edges of objects in the AOV image from earlier:

Sampling

Many pixels in the image will be some linear blend of the neighbouring pixels due to antialiasing. This will occur even if the Viewport and Render samples are set to 1 (although it will be reduced). This is not a problem in practice, it just means that we need to round pixel values to integers, and consider a threshold value under which objects are assumed not to be visible. We can accomplish this as follows:

_visibility = {}
_visibility_max = _image.size[0] * _image.size[1]
_visibility_threshold = 0.005

for _x in _data:
    _xi = int(_x)
    if _xi:
        _c = _visibility.get(_xi) or 0
        _visibility[_xi] = _c + 1

print("# Processed.")
print('# Visibility: {0}'.format(_visibility))

_visible = set()

for _i in _visibility:
    _c = _visibility[_i]
    _f = float(_c) / _visibility_max
    if _f > _visibility_threshold:
        _visible.add(_i)

print('# Visible: {0}'.format(_visible))

We calculate _visibility_max as the total number of pixels in the image. We declare _visibility_threshold to be the threshold value at which an object is considered to be visible. Essentially, if the pixels of an object cover more than 0.5% of the screen, then the object is considered to be visible. Each time we encounter a pixel that has value xi, we increment the count for object xi in the _visibility set.

Finally, we iterate over the _visibility set and extract the indices of all objects that were considered to be visible, given the declared threshold.

The last step simply generates the prompt. We already have the list of objects in distance order, and we know which of those objects are actually visible, so we can simply iterate over the list and write the object descriptions to a file:

with open(_output_prompt, 'w') as f:
    for _o in _objects:
        if _o.index in _visible:
            f.write('{0},\n'.format(_o.description))

This prompt file can be consumed directly by Stable Diffusion in ComfyUI using a text file node.

To run the script (assuming the script is in a text block in Blender's text editor), we simply render the scene and then, with the text editor selected, press ALT-P to run the script.

This image:

Render 2

... Produces this output on Blender's console:

# Camera: Camera.002
# Properties: /shared-tmp/Camera.002.properties
# Prompt:     /shared-tmp/Camera.002.prompt.txt
# Samples:    /shared-tmp/Camera.002.npy
# Processing 2073600 elements...
# Processed.
# Visibility: {1: 453103, 2: 481314, 3: 616150, 4: 85}
# Visible: {1, 2, 3}
# Saved.

Note that it correctly determined that objects 1, 2, and 3 are visible (the floor, the sphere, and the cube, respectively). The cone was not considered to be visible. The resulting prompt was:

A sphere,
A cube,
A wooden floor,

From the other viewpoint:

Render

... The following output and prompt is produced:

# Camera: Camera.000
# Properties: /shared-tmp/Camera.000.properties
# Prompt:     /shared-tmp/Camera.000.prompt.txt
# Samples:    /shared-tmp/Camera.000.npy
# Processing 2073600 elements...
# Processed.
# Visibility: {1: 1144784, 2: 287860, 3: 408275, 4: 232240}
# Visible: {1, 2, 3, 4}
# Saved.
A sphere,
A cone,
A cube,
A wooden floor,

Note how the cube and floor are mentioned last, as they are furthest away.

In Practice

Using the above techniques, I built a small house in Blender, populated it with objects, and took the following renders:

House Plan

House 0

House 1

House 2

No attempt was made at photorealism. I took care to only provide objects with minimal forms and textures (often just flat colours). I ran the images through an image-to-image workflow at approximately 52% denoising, and received the following images:

House 0

House 1

House 2

While by no means perfect, the room is at least recognizable as being the same room from different angles. With some more precise prompting and some minor inpainting, I believe the room could be made completely consistent between shots.

We must negate the machines-that-think. Humans must set their own guidelines. This is not something machines can do. Reasoning depends upon programming, not on hardware, and we are the ultimate program! Our Jihad is a "dump program." We dump the things which destroy us as humans!

— Frank Herbert, Children of Dune