crush depth

A Low Budget Render Farm

I do a bit of 3D rendering now and then using Blender. Blender has multiple rendering engines: Blender Internal and Cycles. Blender Internal is a 1990s style CPU-based raytracer, and Cycles is a physically-based renderer that can be executed on either CPUs or GPUs. Cycles is technologically superior to Blender Internal in every way except for one: The performance when running on CPUs is essentially hopeless. In order to get reasonable render times, you really need a powerful GPU with an appropriate level of driver support. I refuse to use proprietary hardware drivers for both practical reasons and reasons of principle. This has meant that, to date, I haven't been able to use Cycles with GPU acceleration and have therefore stuck to using Blender Internal. As I don't aim for photorealism in anything I render (I prefer a sort of obviously-rendered pseudo-impressionist style), Blender Internal has been mostly sufficient.

However, I've recently moved to rendering 1080p video at 60fps. This means that even if it only takes ~10 seconds to render a single frame, that's still about 35 hours of rendering time to produce a 3.5 minute video.

I have a few machines around the place here that spend a fair amount of time mostly idle. I decided to look for ways to distribute rendering tasks across machines to use up the idle CPU time. The first thing I looked at was netrender. Unfortunately, netrender is both broken and abandonware. I spent a good few hours spinning up VM instances and trying to get it to work, but it didn't happen.

I've been using Jenkins for a while now to build and test code from hundreds of repositories that I maintain. I decided to see if it could be useful here... After a day or so of experimentation, it turns out that it makes an acceptable driver for a render farm!

I created a set of nodes used for rendering tasks. A node in Jenkins speak is an agent program running on a computer that accepts commands such as "check out this code now", "build this code now", etc. I won't bother to go into detail on this, as setting up nodes is something anyone with any Jenkins experience already knows how to do. Nodes can be assigned labels so that tasks can be assigned to specific machines. For example, you could add a linux label to machines running Linux, and then if you had software that would only build on Linux, you could set that job to only run on nodes that are labelled with linux. Basically, I created one node for each idle machine here and labelled them all with a blender label to distinguish them from the nodes I use to build code.

I then placed my Blender project and all of the required assets into a Git repository.

I created a new job in Jenkins that checks out the git repository above and runs the following pipeline definition included in the repository:

#!groovy

// Required: https://plugins.jenkins.io/pipeline-utility-steps

node {
  def nodes = nodesByLabel label:"blender"
  def nodesSorted = nodes.sort().toList()
  def nodeTasks = [:]
  def nodeCount = nodesSorted.size()
  for (int i = 0; i < nodeCount; ++i) {
    def nodeName = nodesSorted[i]
    def thisNodeIndex = i

    nodeTasks[nodeName] = {
      node(nodeName) {
        stage(nodeName) {
          checkout scm
          sh "./render.sh ${thisNodeIndex} ${nodeCount}"
        }
      }
    }
  }
  parallel nodeTasks
}

This uses the pipeline-utility-steps plugin to fetch a list of online nodes with a particular label from Jenkins. I make the simplifying assumption that all online nodes with the blender label will be participating in render tasks. I assign each node a number, and I create a new task that will run on each node that runs a render.sh shell script from the repository. The tasks are executed in parallel and the job is completed once all subtasks have run to completion.

The render.sh shell script is responsible for executing Blender. Blender has a full command-line interface for rendering images without opening a user interface. The main command line parameters we're interested in are:

blender \
  --background \
  coldplanet_lofi.blend \
  --scene Scene \
  --render-output "${OUTPUT}/########.png" \
  --render-format PNG \
  --frame-start "${NODE_INDEX}" \
  --frame-end "${FRAME_COUNT}" \
  --frame-jump "${NODE_COUNT}" \
  --render-anim

The --frame-start parameter indicates at which frame the node should start rendering. The --frame-end parameter indicates the last frame to render. The --frame-jump parameter indicates the number of frames to step forward each time. We pass in the node index (starting at 0) as the starting frame, and the number of nodes that are participating in rendering as the frame jump. Let's say there are 4 nodes rendering: Node 0 will start at frame 0, and will then render frame 4, then frame 8, and so on. Node 1 will start at frame 1, then render frame 5, and so on. This means that the work will be divided up equally between the nodes. There are no data dependencies between frames, so the work parallelizes easily.

When Jenkins is instructed to run the job, all of the machines pull the git repository and start rendering. At the end of the task, they upload all of their rendered frames to a server here in order to be stitched together into a video using ffmpeg. Initial results seem promising. I'm rendering with three nodes, and rendering times are roughly 30% of what they were with just the one node. I can't really ask for more than that.

There are some weaknesses to this approach:

  1. Rendering can't be easily resumed if the job is stopped. This could possibly be mitigated by building more intelligence into the render.sh file, but maybe not.

  2. Work is divided up equally, but nodes are not equal. I initially tried adding a node to the pool of nodes that was much slower than the others. It was assigned the same amount of work as the other nodes, but took far longer to finish. As a result, the work actually took longer than it would have if that node hadn't been involved at all. In fact, in the initial tests, it took longer than rendering on a single node!

  3. There's not really a pleasant way to estimate how much longer rendering is going to take. It's pretty hard to read the console output from each machine in the Jenkins UI.

Finally: All of this work is probably going to be irrelevant very soon. Blender 2.8 has a new realtime rendering engine - EEVEE - which should presumably mean that rendering 3.5 minutes of video will take me 3.5 minutes.