We're already talked about why we chose Amazon Web Services for hosting the beta of GOV.UK, and how we approached some of the decisions that came with it. Cloud hosting gives us the flexibility to rapidly iterate our infrastructure as the product develops and test how it performs, but this brings its own set of challenges.
When building a system that you're going to need to support it's vital you can easily keep track of how it's configured, so you can easily repeat its setup as you increase in scale, or if something goes wrong. In an environment where you're regularly adding and removing servers you need to be able to do that especially quickly and reliably.
The beta of GOV.UK is most definitely a beta; the requirements for handling scaling and resilience are looser than they'd be in a full "production" system. But we still need to have started down a path of understanding how we're going to run this thing for real and build on what we learned in the alpha phase.
In developing Alpha.gov.uk we made use of Puppet, a configuration management system that lets us describe in code how we want our servers configured. Puppet lets a developer lay out a set of components that they might run on a server and the specific configuration they want to apply, and then group those components into classes of machines.
As teams and the number of servers required grow this kind of system becomes a vital part of preserving knowledge; the server setup is all described in code, so a new team member can read that rather than dig around the servers looking for all the relevant configuration files. It also means that as configurations change or are tuned, those changes can quickly be applied to all the relevant servers.
Setting out to work on the beta we knew that we wanted to continue with Puppet, but that we needed to be more disciplined in our use of it than we had been up to that point. There's a bigger team, more moving parts, and the potential for a lot more traffic which will likely mean more servers. Where previously we had used some cobbled together scripts to manually apply puppet updates, we now have 'puppet masters', servers storing the latest configuration that all the others can consult for updates. And the way we define the components has been more cleanly divided up so that we can be much more specific in targetting individual servers configuration.
As well as the 'puppet masters' we've built GDS Provisioner, a provisioning tool that quickly launches a new server instance, declares the type of server it should be (eg. "cache", "frontend", "support", "backend") and registers it with the puppet master so it gets configured properly. It's built on top of the excellent Fog library that provides a standard interface to lots of different hosting options, but is customised to provide focused commands that let us get up and running very quickly. Again, it's worth stressing that the provisioning tool doesn't lock us in to our current hosting provider.
We've been working a lot on resilience of the infrastructure recently (a task that will, of course, continue for a long time to come) and it became apparent that we should have at least two more cache servers. Setting them up was a few minutes' work, setting the provisioner running and then checking that the new servers were running properly. Unfortunately due to the way the provisioning code's evolved it's one of the few pieces of work that we can't yet open source. However, it's very simple to use and we're hoping that before long we'll be able to find the time to tidy it up and release it to the world at large.
The combination of puppet and our provisioning tool make it very easy for us to manually build out our infrastructure but that's really just the tip of the iceberg. Over the public beta period we'll be thinking about how these tools tie in with monitoring so the system can seamlessly scale, starting and stopping new server instances as demand requires and providing a detailed audit trail of changes. We'll also need to spend time automating the "orchestration" of our servers to ensure that we can respond very rapidly to new requirements and situations.
James Stewart is a Technical Architect at GDS. You can follow @jystewart on Twitter