Clout deployment nirvana through Chef
I have a lot of appreciation for the amount of engineering that goes into building sophisticated tools for software deployment. A few years back, when I was working with Amazon, I learnt how large companies manage tasks like building and deploying complex and highly distributed software systems. The build system was called Bob and the deployment system was called Apollo. Though a little arcane in the beginning, after getting a hold on it, I realized how critical it was to have such a system in place, in Amazon. These two systems integrated with the development environment of the developer and exposed a seamless workflow from the time of code change till the time of production deployment. The developer checked in code with the right comments, scheduled a build and happily went on with more serious businesses (like playing TT!). Bob, took the responsibility of identifying all packages that depended on this code change and built them all on specific hardware. If nothing broke, a shiny new version number would be given to the the changed package. After this, the (poor) guy who was in charge of doing the production deployment had to ensure that the he created a version-set with this version number for the changed package. That's it!
Well, obviously I have cut a lot of details to save space (and not get sued by Amazon!) but the behaviour was roughly that. It seems like a straight forward job, but let me tell you that when you get to Amazon scale of software development, having your systems working all the time is a mind boggling exercise. Note that by scale, I mean the sheer amount of code that Amazon has to maintain. The scale of the problem that these libraries solved is a different story altogether.
I think that you can say that a company has reached "escape velocity" when they have a good build+deploy system, that can scale, in place. Once they reach this escape velocity, they are in a much better position to develop, maintain and grow complex code. I think Chef is helping Cloud computing reach that "escape velocity". The escape velocity that is needed to get onto the Cloud. First, some definitions. As far as I am concerned, Cloud computing comes with two guarantees 1) Provisioning and releasing hardware is extremely quick. So, scaling up to hundreds of machines or down to 1 (or 0) machines should be extremely quick. 2) Hardware cost to company is exactly equal to what the company uses. This basically means that a company that is on the Cloud does not pay for 'unused' machines.
Lets take an example. Suppose I own a Media house with a popular website. On a normal day, I would get about 1 lakh visits to my site. That's about 1 hit per second. However on a day when one of my editors have broken a sensational news, the number of visits may jump up to say, 50 lakhs (or even more if the news is exclusive!) So, how should I provision hardware for my web site? Let's say we begin with an architecture that consists of a front-end load-balancer, 2 app servers, and a database server. Let's say that each server is residing on a dedicated host. This is our "normal" day setup.Now, assuming that the database is never going to be a bottleneck (a huge assumption), we scale the system horizontally by throwing more app servers between the load-balancer and the database server.
The only problem is that I need to provision extra hardware. This is the time when the men are separated from the boys! The Amazons and the Googles of the world do this with the snap of a finger, but what about me? I place a request to my provider and he says that he will get it within 24 hrs, like I am supposed to feel good about it! And even if I get the hardware, by the time I wake up my ops guy and have him reconfigure my system, the traffic would have died down! Today's Cloud computing providers like Amazon and Rackspace solve the first problem for me pretty well. But the second problem is still a pain in the neck unless I want my ops team to label me a "slave driver"! This is where tools like Chef come in. If you had configured your system using Chef, this is what you would do:
$> knife node create new.machine1
$> knife node create new.machine2
$> knife node run_list add new.machine1 "role[appsrv]"
$> knife node run_list add new.machine2 "role[appsrv]"
$> knife bootstrap new.machine1
$> knife bootstrap new.machine2
$> ssh root@new.machine1 "chef-client"
$> ssh root@new.machine2 "chef-client"
Thats it! Now, all that needs to be done is to tell the load balancer that there are two more app servers available to share the load. If you have a physical load balancer (proxy) doing this job for you, all you need to do is run "chef-client" on the load-balancer after changing the required configuration file. An alternate approach would have been to create machine images, like Amazon's AMI for EC2, after having configured a working system. But I find it tedious for the following reason:
- It binds me to EC2 (or whoever has the image)
- I need to take a snapshot every time my configuration files change
- An AMI is of little use to fix problems with an existing system (e.g., if I want to "re-install" only the MySQL server)
- I need to take a snapshot every time I upgrade some part of my software (e.g., if I upgrade to Rails ver 3.2 from ver 3.0)
Chef takes a system admin problem and converts that into a development problem. This is because Chef does its stuff using what are called "Cookbooks". These Cookbooks have "Recipes" that describe how to go about building the system. The best part is that recipes can depend on other recipes much like the way one library can depend on another in a software application. Recipes are written in Ruby DSLs, which means that you can insert programming logic when cooking up recipes!! The best part is that Opscode (the company behind Chef) has created a platform where people can share and use other's recipes. I think that Cookbooks and Recipes are the right abstractions for sharing system setup information. These are much more fine grained when compared to AMIs and thus more reusable. The "Configuration as Code" and sharable Cookbooks create an eco-system, that I think lends itself to some very sophisticated configuration option. The relative ease with which this sophistication is achieved will, I think, accelerate the adoption of Cloud.



