/ Chef

Multi-Node Installs with Chef

While using Chef, I've come across a few workflows that don't seem to "just fit." One of those is coordinating installation of applications that require multiple nodes. By that I mean those situations that some are now calling "legacy" environments: clustered (or something similar) application stacks that are designed to run on multiple servers, and which have not been modernized into something like microservices. They exist in great numbers still, and are widely used whether we like it or not.

So, what I mean by those things just not fitting into the Chef workflow, is that Chef is agent based, and is concerned with the defined configuration of the thing on which it is running (i.e. a server). Since multi-node application installations will often rely on a very strict set of steps taken on each server, often dependent on the last step having been completed, the standard idea of a chef-client run just falls short. For instance, consider a generic workflow for installing IBM Websphere cell at a fairly basic level (keep in mind the specific application isn't the point, and the words thing, doodad, whatchamacallit, or any other generic term can be substituted for brand specific nomenclature). Also keep in mind that this installation is part of the initial OS provisioning, so that the OS and application installations are done "at build time."

  1. serverA: Deployment Manger is installed, this is the cell's controller
    a. Websphere is installed
    b. a Deployment Manger profile is created and started, listening on the desired port
  2. ServerB-serverE:
    a. Websphere is installed
    b. an application profile is created and federated, or linked to the deployment manager of the cell (dependent on 1a)
  3. serverA: cell security is established
    a. global security rules are applied (this requires that 2b is complete on all application nodes)
  4. serverF: Webserver is installed, not dependent on any other step

The above flow is one that is easy to imagine being done manually. You can even see it being pretty simple to accomplish with a single, local script executing some other scripts on the servers remotely in a controlled manner. Doubtless the word Ansible, Salt, or [preferred agentless configuration manager] will come up during a discussion of how best to automate. What isn't always obvious is how to accomplish this with Chef as the preferred automation platform. Indeed, most conversations around how to do such things with Chef end up at the phrase "eventual convergance." I am not a fan of relying on eventual convergence. To me, it sounds like making the best out of an imperfect situation, or "just let it run a few times and it'll work out." It's selling the product and the person writing the code short. It creates a situation where environments might start using multiple products (i.e. Chef for the node specific stuff, and Salt for the multi-node coordination) where one product will do. I am a total believer in using the right tool for the task, but it I also think that if you have a tool that will do the job just fine, there is no need to make things more complicated by using something else as well.

In order to overcome the immediate issue of Chef not handling multi-node dependencies super well out of the box, I've had to use a few tricks which turned out to be really easy once I was able to wrap my head around how to use them. Also, I broke them down into two categories: waiting on, and waiting for.

Waiting On

I look at 'waiting on' actions as those things which are dependent on something being remotely available. Here are a few examples:

  • A port on a remote server being open/closed
  • A 200 response on a webpage
  • Specific content on a webpage
  • Successfully SSH to a remote server
  • A remote server no longer responding to pings

There could be many other examples, but it boils down to waiting for some action to happen, availability of resource, or content being present on a remote server or other actor. By definition, this is something which is totally out of the control of the local chef-client run, though it may be all part of a multi-node application install (as is the topic of this post). The important distinction is that these things cannot be assumed present because of the resource convergence happening in the recipe(s) running locally, and so there must be some stop and wait procedures worked into the code in some way. More on that in a bit.

Waiting For

I see 'waiting for' actions as those things which are dependent on something that will become available on the server where the chef-client run is happening. The list of things that this can encompass could contain:

  • A file being present, which is the result of a remote machine placing it there
  • A command which will only show the proper status after some outside resource has interacted with the underlying service. Websphere application nodes federating with the Deployment Manger (DMGR) node, or step 2 from above, is an example of this. A command can be run on the DMGR to show how many application nodes are federated, and that can be used to delay actions which require all nodes be properly federated before they are applied.
  • Anything else that is the result of a remote thing interacting with the local server

I want to break the list there, because it's worth pointing out that even in a multi-node install, most things not requiring the action of an outside resource should be taken care of in the flow of the recipe(s) itself. For instance, if I want to create and start a DMGR profile on a Websphere server, I have to first install Websphere and then I can do the profile work. The installation step should be reasonably idempotent, such that when my recipe needs to create the profile, it can assume that Websphere is installed just by virtue of making it to that step, because the install was an idempotent resource placed before the profile creation.

That being (long-windedly) said, there are also other instances where a 'waiting for' pause might be needed:

  • A service responding in a specific way. While it's not ideal, not all services are actually functional when they report started, and so getting the proper response might be more important than the start script finishing and returning 0.
  • Waiting on asynchronous configuration items to be legitimately complete. Similar to the one above, some configuration commands initiate an asynchronous action, but immediately return true (cough IBM HMC caugh)

Stop and wait

So then, in order to do multi-node coordinated cookbooks, there has to be a mechanism to stop and wait for the desired thing to happen. Given that, there are some tips to help in deciding if we can turn these uncontrollable (from the perspective of the running recipe) actions into idempotent checks:

  • The thing must be queryable in some way. The tests can be pretty much anything you imagine. Testing connectivity to a port, HTTP response codes, SSH connectivity validation, pinging a remote host, and parsing content of a remote file/webpage are all good examples of queryable stuff.
  • There must be some timeframe for availability. A chef-client will not run indefinitely, and even if it would, that is a bad way to code things. There must be some reasonable timefrome for availability of the resource in question. If there is no way to know when the resource will be available, then eventual convergence becomes your friend.
  • It must flow logically. That may be a statement that is taken for granted, but when checking for resources out of the control of the recipe, there must be some logical place where that resource querying fits. If, for any reason, the check seems random or ambiguous, the code suffers in readability. If this happens, check if the resource really has to be in a specific state. Remember, a human can wait on something out of habit, when that thing really isn't necessary. Code on the other hand, should be more logical than human brains! For instance, I may want to wait until I am sitting at the table to pour water into my glass, because that's how I've always done it. But if there is no other reason than habit, it's not logical to make someone else do the same thing, and so I can go ahead and let the pour water code process without creating a stop and wait for person to sit down block.

Once there is a valid situation for stop and wait, how can that be integrated into Chef code? Thankfully, the answer to that is "lots of ways." These types of checks are going to be ruby code inserted somewhere into the recipe(s), and given that, they can be inserted wherever ruby code blocks are available in Chef:

  • ruby_block resource
  • not_if
  • only_if
  • lazy
  • custom resource

There may be others, but those are the big ones. Here are a couple quick examples. They aren't complex, or even really the best solution to the scenarios, necessarily. The intent is just to show a things that can be done.

  1. I know that I have to wait for exactly 30 seconds after a service is started to continue on to the next step. So I can do something like a ruby_block that just does the wait. It can come after the resource requiring the wait, but would more logically be a notified or subscribed resource link.
service 'sluggish_service' do
  action :start
  notifies :run, 'ruby_block[sleep 30 seconds]', :immediately
end
	
ruby_block 'sleep 30 seconds' do
  block do
    sleep(30)
  end
  action :nothing
end
  1. In order to successfully perform an action on the local server, I need to wait on a port that should be available on a remote server first. Since this imaginary problem happens to come up in a custom resource, I can create a function that can be pretty easily used inline, like so:
...
if remote_port_open?(new_resource.remote_host, new_resource.remote_port)
  [some ruby code that does stuff]
end
...
def remote_port_open?(host, port)
  i = 0
  x = 5
  until i == x
    i += 1
    begin
      Timeout::timeout(1) do
        begin
          s = TCPSocket.new(host, port)
          s.close
          return true
        rescue Errno::ECONNREFUSED, Errno::EHOSTUNREACH
          Chef::Log.warn("#{host}:#{port} not yet available")
        end
      end
    rescue Timeout::Error
    end
    sleep 30 unless i == x
  end

  return false
end

Those are just some basic examples with one possible solution each. The trick is to use the method which is best for each situation. For instance, the code for waiting on a port could be embedded in a only_if {} block, maybe something like the following, which would run a script only when the local node can confirm the remote port is open. The actual code could be a singe check, or a looping check like above.

execute 'a registration script' do
  command '/usr/local/bin/register_to_remote.sh'
  only_if { [ruby code to check port] }
  not_if '/usr/local/bin/check_registration.sh'
end

Fail only when necessary

As a final note, regardless of where I put the code, I always try to avoid outright fails when a condition isn't met. It's best not to make a chef-client run fail if it isn't necessary. Since the idea is to coordinate installs in a short time frame, there are times when employing the "stop and wait" method is very helpful. However, if the timeframes in the waits is overrun for some reason, it's better to allow the convergence to fall back into the eventual.

As an example, if I am waiting on another server to complete it's steps before I can register my local node to it, I may want to stop and wait for that port to come open. However, since this is may be the first chef-client run, I don't want it to fail if the remote server is slow. This is because of things like the node data not being saved back to the Chef server, and messing up future runs as well. In that case, finishing the local portion of the installation will be delayed, but will pick back up where it left off (assuming I've coded my cookbook correctly) the next time chef-client runs on the local node. If you have lots of back and forth with dependencies, as the example installation flow at the top of this post, an eventual method could take some significant time, but it's better than failing completely if things are slow at one step or another for some reason. The Websphere installation mentioned takes about 15 minutes all together (including OS provisioning), but if things are slow and it goes into the eventual convergence method, it may not finish for another 30 minutes or so (because of the chef-client run interval), but it will eventually finish without further human intervention.

That is my take on coordinating multiple nodes with Chef. Hopefully it helps, or at least doesn't hurt! Let me know in the comments of Twitter what you think.