I have an idiotic number of GitHub repositories. There are, at the time of
writing, 536
repositories in my account, of which I'm the "source" for
454
of those. At last count, I'm maintaining 159
of my own projects,
plus less-than-a-hundred-but-more-than-ten open source and closed source
projects for third parties.
When I registered on GitHub back in something like 2011, there was no concept of an "organization". People had personal accounts and that was that.
GitHub exposes many configuration methods that can be applied on a per-organization basis. For example, if a CI process requires access to secrets, those secrets can be configured in a single place and shared across all builds in an organization. If the same CI process is used in a personal account, the secrets it uses have to be configured in every repository that use the CI process. This is obviously unmaintainable.
I've decided to finally set up an organization, and I'll be transferring actively maintained ongoing projects there.
I recently moved the components of an existing server into a new SC-316 case. I ordered a Seasonic SS-600H2U power supply to replace the somewhat elderly Seasonic ATX power supply that was in the old case.
Unfortunately, when I turned on the power supply for the first time, I was greeted with an angry vacuum cleaner sound that suggested bad fan bearings.
See the following video:
I filed a support request with ServerCase, sent them the video, and they wordlessly sent me a new power supply without even asking for the old one back. Thanks very much!
Edit: https://github.com/io7m/com.io7m.visual.comfyui.wideoutpaint
One of the acknowledged limitations with
Stable Diffusion
is that the underlying models used for image generation have a definite
preferred size of images. For example, the Stable Diffusion 1.5 model prefers
to generate images that are 512x512
pixels in size. SDXL, on the other hand,
prefers to generate images that are 1024x1024
pixels in size. Models will
typically also allow for some factors of the width or height values to be
used, as long as the product of the dimensions equals the same number of
resulting pixels as the "native" image size. For example, there's a list of
"permitted" dimensions that will work for SDXL: 1024x1024
, 1152x896
,
1216x832
, and so on. The products of each of these pairs of dimensions always
equals 1048576
.
This is all fine, except that often we want to generate images larger than the native size. There are methods such as model-based upscaling (Real-ESRGAN and SwinIR being the current favourites), but these don't really result in more detail being created in the image, they more or less just sharpen lines and give definition to forms.
It would obviously be preferable if we could just ask Stable Diffusion to generate larger images in the first place. Unfortunately, if we ask any of the models to generate images larger than their native image size, the results are often problematic, to say the least. Here's a prompt:
high tech office courtyard, spring, mexico, sharp focus, fine detail
From Stable Diffusion 1.5 using the
Realistic Vision
checkpoint, this yields the following image at 512x512
resolution:
That's fine. There's nothing really wrong with it. But what happens if we
want the same image but at 1536x512
instead? Here's the exact same sampling
settings and prompt, but with the image size set to 1536x512
:
It's clearly not just the same image but wider. It's an entirely different location. Worse, specifying large initial image sizes like this can lead to subject duplication depending on the prompt. Here's another prompt:
woman sitting in office, spain, business suit, sharp focus, fine detail
Now here's the exact same prompt and sampling setup, but at 1536x512
:
We get subject duplication.
It turns out that outpainting can be used quite successfully to generate images much larger than the native model image size, and can avoid problems such as subject duplication.
The process of outpainting is merely a special case of the process of inpainting. For inpainting, we mark a region of an image with a mask and instruct Stable Diffusion to regenerate the contents of that region without affecting anything outside the region. For outpainting, we simply pad the edges of an existing image to make it larger, and mark those padded regions with a mask for inpainting.
The basic approach is to select a desired output size such that the dimensions
are a multiple of the native image dimension size for the model being used.
For example, I'll use 1536x512
: Three times the width of the Stable Diffusion
1.5 native size.
The workflow is as follows:
Generate a 512x512
starting image C
. This is the "center" image.
Pad C
rightwards by 512
pixels, yielding a padded image D
.
Feed D
into an inpainting sampling stage. This will fill in the
padded area using contextual information from C
, without rewriting
any of the original image content from C
. This will yield an image
E
.
Pad E
leftwards by 512
pixels, yielding a padded image F
.
Feed F
into an inpainting sampling stage. This will fill in the
padded area using contextual information from E
, without rewriting
any of the original image content from E
. This will yield an image
G
.
Optionally, run G
through another sampling step set to resample
the entire image at a low denoising value. This can help bring out
fine details, and can correct any visible seams between image regions
that might have been created by the above steps.
By breaking the generation of the full size image into separate steps, we can apply different prompts to different sections of the image and effectively avoid even the possibility of subject duplication. For example, let's try this prompt again:
woman sitting in office, spain, business suit, sharp focus, fine detail
We'll use this prompt for the center image, but we'll adjust the prompt for the left and right images:
office, spain, sharp focus, fine detail
We magically obtain the following image:
I've included a ComfyUI workflow that implements the above steps. Execution should be assumed to proceed downwards and rightwards. Execution starts at the top left.
The initial groups of nodes at the top left allow for specifying a base image size, prompts for each image region, and the checkpoints used for image generation. Note that the checkpoint for the center image must be a non-inpainting model, whilst the left and right image checkpoints must be inpainting models. ComfyUI will fail with a nonsensical error at generation time if the wrong kinds of checkpoints are used.
Execution proceeds to the CenterSample
stage which, unsurprisingly, generates
the center image:
Execution then proceeds to the PadRight
and RightSample
stages, which
pad the image rightwards and then produce the rightmost image, respectively:
Execution then proceeds to the PadLeft
and LeftSample
stages, which
pad the image leftwards and then produce the leftmost image, respectively:
Execution then proceeds to a FinalCleanup
stage that runs a ~50%
denoising
pass over the entire image using a non-inpainting checkpoint. As mentioned,
this can fine-tune details in the image, and eliminate any potential visual
seams. Note that we must use a Set Latent Noise Mask
node; the latent image
will be arriving from the previous LeftSample
stage with a mask that
restricts denoising to the region covered by the leftmost image. If we want
to denoise the entire image, we must reset this mask to one that covers the
entire image.
The output stage displays the preview images for each step of the process:
Mostly a post just to remind my future self.
I run a Forgejo installation locally. I'd been
seeing the following error on trying to git push
to some repositories:
$ git push --all Enumerating objects: 12, done. Counting objects: 100% (12/12), done. Delta compression using up to 12 threads Compressing objects: 100% (8/8), done. send-pack: unexpected disconnect while reading sideband packet Writing objects: 100% (8/8), 3.00 MiB | 9.56 MiB/s, done. Total 8 (delta 4), reused 0 (delta 0), pack-reused 0 (from 0) fatal: the remote end hung up unexpectedly Everything up-to-date
The forgejo
server is running in a container behind an
nginx proxy.
The logs for nginx
showed every request returning a 200
status code, so
that was no help.
The logs for forgejo
showed:
podman[1712113]: 2024/04/13 09:36:59 ...eb/routing/logger.go:102:func1() [I] router: completed GET /git/example.git/info/refs?service=git-receive-pack for 10.0.2.100:0, 401 Unauthorized in 1.0ms @ repo/githttp.go:532(repo.GetInfoRefs) forge01[1712171]: 2024/04/13 09:36:59 ...eb/routing/logger.go:102:func1() [I] router: completed GET /git/example.git/info/refs?service=git-receive-pack for 10.0.2.100:0, 401 Unauthorized in 1.0ms @ repo/githttp.go:532(repo.GetInfoRefs) forge01[1712171]: 2024/04/13 09:36:59 ...eb/routing/logger.go:102:func1() [I] router: completed GET /git/example.git/info/refs?service=git-receive-pack for 10.0.2.100:0, 200 OK in 145.1ms @ repo/githttp.go:532(repo.GetInfoRefs) podman[1712113]: 2024/04/13 09:36:59 ...eb/routing/logger.go:102:func1() [I] router: completed GET /git/example.git/info/refs?service=git-receive-pack for 10.0.2.100:0, 200 OK in 145.1ms @ repo/githttp.go:532(repo.GetInfoRefs) forge01[1712171]: 2024/04/13 09:36:59 ...eb/routing/logger.go:102:func1() [I] router: completed POST /git/example.git/git-receive-pack for 10.0.2.100:0, 0 in 143.8ms @ repo/githttp.go:500(repo.ServiceReceivePack) podman[1712113]: 2024/04/13 09:36:59 ...eb/routing/logger.go:102:func1() [I] router: completed POST /git/example.git/git-receive-pack for 10.0.2.100:0, 0 in 143.8ms @ repo/githttp.go:500(repo.ServiceReceivePack) forge01[1712171]: 2024/04/13 09:36:59 .../web/repo/githttp.go:485:serviceRPC() [E] Fail to serve RPC(receive-pack) in /var/lib/gitea/git/repositories/git/example.git: exit status 128 - fatal: the remote end hung up unexpectedly forge01[1712171]: 2024/04/13 09:36:59 ...eb/routing/logger.go:102:func1() [I] router: completed POST /git/example.git/git-receive-pack for 10.0.2.100:0, 0 in 116.1ms @ repo/githttp.go:500(repo.ServiceReceivePack) podman[1712113]: 2024/04/13 09:36:59 .../web/repo/githttp.go:485:serviceRPC() [E] Fail to serve RPC(receive-pack) in /var/lib/gitea/git/repositories/git/example.git: exit status 128 - fatal: the remote end hung up unexpectedly podman[1712113]: 2024/04/13 09:36:59 ...eb/routing/logger.go:102:func1() [I] router: completed POST /git/example.git/git-receive-pack for 10.0.2.100:0, 0 in 116.1ms @ repo/githttp.go:500(repo.ServiceReceivePack)
Less than helpful, but it does seem to place blame on the client.
Apparently, what I needed was:
$ git config http.postBuffer 157286400
After doing that, everything magically worked:
$ git push Enumerating objects: 12, done. Counting objects: 100% (12/12), done. Delta compression using up to 12 threads Compressing objects: 100% (8/8), done. Writing objects: 100% (8/8), 3.00 MiB | 153.50 MiB/s, done. Total 8 (delta 4), reused 0 (delta 0), pack-reused 0 (from 0) remote: . Processing 1 references remote: Processed 1 references in total To https://forgejo/git/example.git bf6389..5f216b master -> master
The git documentation says:
http.postBuffer
Maximum size in bytes of the buffer used by smart HTTP transports when POSTing data to the remote system. For requests larger than this buffer size, HTTP/1.1 and Transfer-Encoding: chunked is used to avoid creating a massive pack file locally. Default is 1 MiB, which is sufficient for most requests.
I don't know why this fixes the problem.
I've been working with Stable Diffusion lately.
One acknowledged "problem" with these kinds of text-to-image systems is that it's difficult to generate consistent subjects and/or background environments between images. Consider the case of a comic book artist using the system to create background images for each panel: If the characters are all standing in a room, then it's probably desirable for the room to look the same in each panel.
I've been working on a mostly-automated system for generating consistent scenes by combining an image-to-image workflow, 3D rendered images produced from Blender, and some scripting and techniques lifted from deferred rendering to produce a camera-dependent text prompt automatically.
Image generation systems like Stable Diffusion aren't intelligent. The system doesn't really understand what it's generating, and there's no way to tell it "give me the same room again but from a different camera angle". It has no concept of what a room is, or what a camera is.
It's somewhat difficult to describe processes in Stable Diffusion, because it's very tempting to talk about the system in terms of intelligence. We can talk about Stable Diffusion "improvising" objects and so on, but there's really nothing like human improvisation and cognition occurring. During generation, the system is taking manually-guided paths through a huge set of random numbers according to a huge, cleverly-constructed database of weight values. Nevertheless, for the sake of making this article readable, I'll talk about Stable Diffusion as if it's an intelligent system making choices, and how we can guide that system into making the choices we want.
In order to produce a consistent scene, it appears to be necessary to start with consistent source images, and to provide an accurate text prompt.
Stable Diffusion works with a text prompt. A text prompt describes the objects that should be generated in a scene. Without going into a ton of detail, the Stable Diffusion models were trained on a dataset called LAION-5b. The dataset consists of around 5.8 billion images, each of which were annotated with text describing the images. When you provide Stable Diffusion with a text prompt consisting of "A photograph of a ginger cat", it will sample from images in the original dataset that were annotated with photograph, ginger cat, and so on.
The generation process is incremental, and there appears to be a certain degree of feedback in the generation process; the system appears to inspect its own output during generation and will effectively try to "improvise" objects that it believes it might be generating if nothing in the text prompt appears to describe the random pile of pixels it is in the process of refining. More concretely, if the generation process starts producing something that looks orange and furry, and nothing in the prompt says ginger cat, then the system might produce an image of a dog with ginger hair. It might just as easily produce an image of a wicker basket!
It's therefore necessary, when trying to generate a consistent scene, to describe every important object visible in the scene in the text prompt. This requirement can lead to problems when trying to generate a scene from multiple different camera viewpoints in a room:
If an object O
is visible in camera viewpoint A
but not visible in
camera viewpoint B
, and the artist forgets to remove the description of
O
from the text prompt when generating an image from B
, Stable Diffusion
might replace one of the other objects in the scene with an improvised
version of O
, because it believes that the description of O
matches
something else that is visible in B
.
If an object O
is visible in both camera viewpoint A
and
camera viewpoint B
, and the artist forgets to include a description of
O
in the prompt when generating an image from B
, then O
might appear
completely different in A
and B
because, lacking a description of O
,
Stable Diffusion will essentially just improvise O
.
Additionally, the way the text prompt works in Stable Diffusion is that text that appears earlier in the prompt is effectively treated as more significant when generating the image. It's therefore recommended that objects that are supposed to be more prominently displayed in the foreground be described first, and less significant objects in the background be described last.
We would therefore like to have a system that, for each visible object in the current camera viewpoint, a one-sentence textual description of the object is returned. If an object isn't visible, then no description is returned. The textual descriptions must be returned in ascending order of distance from the camera; the nearest objects have their descriptions returned first.
Stable Diffusion is capable of re-contextualizing existing images. The diffusion process, in a pure text-to-image workflow, essentially takes an image consisting entirely of gaussian noise and progressively denoises the image until the result (hopefully) looks like an image described by the text prompt.
In an image-to-image workflow, the process is exactly the same except for the fact that the starting point isn't an image that is 100% noise. The input image is typically an existing drawing or photograph, and the workflow is instructed to introduce, say, 50% noise into the image, and then start the denoising process from that noisy image. This allows Stable Diffusion to essentially redraw an existing image, replacing forms and subjects by an artist-configurable amount.
For example, we can take the following rather badly-generated image:
We can feed this image into Stable Diffusion, and specify a text prompt that
reads two cups of coffee next to a coffee pot
. We can specify that we want
the process to introduce 100% noise into the image before starting the
denoising process. We get the following result:
While the image is certainly coffee-themed, it bears almost no relation to the original image. This is because we specified 100% noise as the starting point; it's essentially as if we didn't bother to supply a starting image at all. If we instead specify that we want 75% noise introduced, we get the following image:
This is starting to get closer to the original image. The composition is a little closer, and the colours are starting to match. Once we specify that we want 50% noise introduced, we get the following image:
The image is very close to the original image, but the details are somewhat more refined. If we continue by going down to, say, 25% denoising, we get:
The result is, at first glance, almost indistinguishable from the original image. Now, this would be fine if we were trying to achieve photorealism and the source image we started with was already photorealistic. However, if what we have as a source image isn't photorealistic, then going below about 50% denoising constrains the generation process far too much; the resulting image will be just as unrealistic as the source image.
We have a tension between specifying a noise amount that is low enough so that the system doesn't generate an image that's completely different from our source image, but high enough to give the system enough room to improvise and (hopefully) produce a photorealistic image from a non-photorealistic input.
To produce a consistent set of source images to render a room from multiple viewpoints, we'll need to build a 3D model in Blender that contains basic colours, forms, and some lighting, but that makes no real attempt to be photorealistic. We'll depend on Stable Diffusion to take those images and produce a consistent but photorealistic output. We assume that we can't start out with photorealistic images; if we already had photographs taken in a room from all the camera angles we wanted... Why would be we using Stable Diffusion in the first place?
In practice, I've found that noise values in the range 50-55% will work. We'll come back to this later on.
As mentioned earlier, we want to be able to have Blender produce the bulk of our text prompt for us in an automated manner. We need to do the following:
S
of the objects that
are actually visible from the current viewpoint.S
such that the objects are ordered from nearest to furthest
from the camera.S
, extract the assigned text description and concatenate
the descriptions into a comma-separated list for direct use in Stable
Diffusion.An example objectClass.blend scene is provided that implements all of the above.
For each of the objects in the scene, we have to create a custom string-typed
Description
property on the object that contains our text prompt for that
object:
Note that we also specify a numeric, non-zero pass index for each object. Every object that has the same text prompt should have the same pass index, for reasons that will become clear shortly.
Blender, unfortunately, doesn't provide an API for determining object visibility. My original line of thought was to take the frustum of the currently active camera in the scene, and calculate the set of objects that intersect that frustum. Unfortunately, this is more work than I'm willing to attempt in a clumsy language like Python, and it also wouldn't take into account whether objects in the frustum are occluded by other objects. If an object is completely hidden behind another object in the foreground, then it shouldn't appear in the text prompt.
Instead, I turned to a technique used in some forms of deferred rendering. Essentially, we render an image of the scene where each pixel in the image contains exactly one integer value denoting the numeric index of the object that produced the surface at that pixel. More specifically, the pass index that we specified for each object earlier is written directly to the output image.
For example, whilst the standard colour render output of the example scene looks like this:
The "object index pass" output looks like this (text added for clarity):
The pixels that make up the cone all have integer value 4
. The pixels that
make up the sphere all have integer value 2
, and so on. The pixels that make
up the background have integer value 0
(there's a tiny bit of the background
visible in the top-left corner of the image). We'll effectively ignore all
pixels with value 0
.
The same scene rendered from a different viewpoint produces the following images:
Note that, from this viewpoint, the cone is almost entirely occluded by the cube.
Now, unfortunately, due to a limitation in Blender, the EEVEE realtime renderer doesn't have an option to output object IDs directly as a render pass. I'm impatient and don't want to spend the time waiting for Cycles to render my scenes, so instead we can achieve the same result by using an AOV Output in each scene material.
First, create a new Value-typed AOV Output
in the scene's View Layer
properties:
Now, for each material used in the scene, declare an AOV output node and attach the object's (confusingly named) Object Index to it:
Finally, in the Compositor settings, attach the AOV output to a Viewer Node:
The reason for this last step is that we're about to write a script to examine the AOV image, and Blender has an infuriating internal limitation where it won't give us direct access to the pixels of a rendered image unless that rendered image came from a Viewer Node.
We now have images that can effectively tell us which objects are visible in the corresponding coloured render images. Note that we have the same limitation here as in traditional deferred rendering: We can't handle translucent or transparent objects. This hasn't, so far, turned out to be much of a problem in practice.
We now need to write a small Python script that can take the rendered images and produce a prompt. The complete script is included as a text object in the included Blender scene, but I'll walk through the code here.
The first step is to get a reference to the currently active camera in the scene, and set up the various output files we'll be using:
import bpy import math import numpy _output_directory = "/shared-tmp" class ClassifiedObject: index: int distance: int description: str _objects = [] _camera = bpy.context.scene.camera print('# Camera: {0}'.format(_camera.name)) _output_props = '{0}/{1}.properties'.format(_output_directory, _camera.name) _output_prompt = '{0}/{1}.prompt.txt'.format(_output_directory, _camera.name) _output_samples = '{0}/{1}.npy'.format(_output_directory, _camera.name) print('# Properties: {0}'.format(_output_props)) print('# Prompt: {0}'.format(_output_prompt)) print('# Samples: {0}'.format(_output_samples))
We'll hold references to visible objects in values of type ClassifiedObject
.
We then iterate over all images in the scene and calculate the euclidean distance between the origin of each object, and the camera.
def _distance_to_camera(_camera, _object): _object_pos = _object.location _camera_pos = _camera.location _delta_x = _camera_pos.x - _object_pos.x _delta_y = _camera_pos.y - _object_pos.y _delta_z = _camera_pos.z - _object_pos.z _delta_x_sq = _delta_x * _delta_x _delta_y_sq = _delta_y * _delta_y _delta_z_sq = _delta_z * _delta_z _sum = _delta_x_sq + _delta_y_sq + _delta_z_sq return math.sqrt(_sum) for _o in bpy.data.objects: if _o.pass_index: _r = ClassifiedObject() _r.distance = _distance_to_camera(_camera, _o) _r.object = _o _r.index = _o.pass_index _r.description = _o['Description'] _objects.append(_r)
We then sort the objects we saved in ascending order of distance, so that
objects nearest to the camera appear earliest in the list, and we save the
objects to an external properties
file in case we want to process them
later:
_objects.sort(key=lambda o: o.distance) with open(_output_props, "w") as f: for _o in _objects: f.write('objects.{0}.index = {0}\n'.format(_o.index)) f.write('objects.{0}.description = {1}\n'.format(_o.index, _o.description)) f.write('objects.{0}.name = {1}\n'.format(_o.index, _o.object.name)) f.write('objects.{0}.distance = {1}\n'.format(_o.index, _o.distance))
We then get a reference to the image containing our integer object indices,
and save the pixels to a numpy
array. Normal Python is too slow to process
images this large, so we need to do the processing using numpy
to avoid
having to wait twenty or so minutes each time we want to process an image. Using
numpy
, the operation is completed in a couple of seconds. The raw pixel array
returned from the viewer node is actually an RGBA image where each pixel
consists of four 32-bit floating point values. Our AOV output is automatically
expanded such that the same value will be placed into each of the four colour
channels, so we can actually avoid sampling all four channels and just use
the values in the red channel. The [::4]
expression indicates that we start
from element 0
, process the full length of the array, but skip forward by
4
components each time.
_image = bpy.data.images['Viewer Node'] _data = numpy.array(_image.pixels)[::4]
We now need to iterate over all the pixels in the image and count the number
of times each object index appears. Astute readers might be wondering why we
don't just maintain an int -> boolean
map where we set the value of a key
to True
whenever we encounter that object index in the image. There are two
reasons we can't do this.
Firstly, if an object is almost entirely occluded by another object, but a few stray pixels of the object are still visible, we don't want to be mentioning that object in the prompt. Doing so would introduce the problem discussed earlier where an object is erroneously injected into the scene by Stable Diffusion because the prompt said one should be in the scene somewhere.
Secondly, there will be pixels in the image that do not correspond to any object index. Why does this happen? Inspect the edges of objects in the AOV image from earlier:
Many pixels in the image will be some linear blend of the neighbouring pixels
due to antialiasing. This will occur even if the Viewport and Render samples
are set to 1
(although it will be reduced). This is not a problem in practice,
it just means that we need to round pixel values to integers, and consider
a threshold value under which objects are assumed not to be visible. We can
accomplish this as follows:
_visibility = {} _visibility_max = _image.size[0] * _image.size[1] _visibility_threshold = 0.005 for _x in _data: _xi = int(_x) if _xi: _c = _visibility.get(_xi) or 0 _visibility[_xi] = _c + 1 print("# Processed.") print('# Visibility: {0}'.format(_visibility)) _visible = set() for _i in _visibility: _c = _visibility[_i] _f = float(_c) / _visibility_max if _f > _visibility_threshold: _visible.add(_i) print('# Visible: {0}'.format(_visible))
We calculate _visibility_max
as the total number of pixels in the image.
We declare _visibility_threshold
to be the threshold value at which an object
is considered to be visible. Essentially, if the pixels of an object cover more
than 0.5% of the screen, then the object is considered to be visible. Each
time we encounter a pixel that has value xi
, we increment the count for
object xi
in the _visibility
set.
Finally, we iterate over the _visibility
set and extract the indices of all
objects that were considered to be visible, given the declared threshold.
The last step simply generates the prompt. We already have the list of objects in distance order, and we know which of those objects are actually visible, so we can simply iterate over the list and write the object descriptions to a file:
with open(_output_prompt, 'w') as f: for _o in _objects: if _o.index in _visible: f.write('{0},\n'.format(_o.description))
This prompt file can be consumed directly by Stable Diffusion in ComfyUI using a text file node.
To run the script (assuming the script is in a text block in Blender's text
editor), we simply render the scene and then, with the text editor selected,
press ALT-P
to run the script.
This image:
... Produces this output on Blender's console:
# Camera: Camera.002 # Properties: /shared-tmp/Camera.002.properties # Prompt: /shared-tmp/Camera.002.prompt.txt # Samples: /shared-tmp/Camera.002.npy # Processing 2073600 elements... # Processed. # Visibility: {1: 453103, 2: 481314, 3: 616150, 4: 85} # Visible: {1, 2, 3} # Saved.
Note that it correctly determined that objects 1
, 2
, and 3
are visible
(the floor, the sphere, and the cube, respectively). The cone was not considered
to be visible. The resulting prompt was:
A sphere, A cube, A wooden floor,
From the other viewpoint:
... The following output and prompt is produced:
# Camera: Camera.000 # Properties: /shared-tmp/Camera.000.properties # Prompt: /shared-tmp/Camera.000.prompt.txt # Samples: /shared-tmp/Camera.000.npy # Processing 2073600 elements... # Processed. # Visibility: {1: 1144784, 2: 287860, 3: 408275, 4: 232240} # Visible: {1, 2, 3, 4} # Saved.
A sphere, A cone, A cube, A wooden floor,
Note how the cube and floor are mentioned last, as they are furthest away.
Using the above techniques, I built a small house in Blender, populated it with objects, and took the following renders:
No attempt was made at photorealism. I took care to only provide objects with minimal forms and textures (often just flat colours). I ran the images through an image-to-image workflow at approximately 52% denoising, and received the following images:
While by no means perfect, the room is at least recognizable as being the same room from different angles. With some more precise prompting and some minor inpainting, I believe the room could be made completely consistent between shots.
We must negate the machines-that-think. Humans must set their own guidelines. This is not something machines can do. Reasoning depends upon programming, not on hardware, and we are the ultimate program! Our Jihad is a "dump program." We dump the things which destroy us as humans!
— Frank Herbert, Children of Dune