Back in 2012, Quid made the decision to invest resources into automating our infrastructure with Chef.  A few years later, we have around 40 machine types spread over 400 machines powering our development and production environments.  Thanks to our now well developed cookbooks, we can rest easy should any machine need to be recreated.  With over 3 years of experience under our belt, we’d like to share some of the ups and downs of cooking with Chef.

The ups

Happy chef

Attributed to Elliott Brown (https://www.flickr.com/photos/ell-r-brown/)

  • The community artifacts
    When Chef first started gaining real popularity, the technology was really rough around the edges.  A community latched on anyways and has produced methodologies and artifacts which not only shifted Chef development but likely most future configuration management tools going forward.  The Chef community produces and maintains an impressive number of cookbooks covering anything remotely popular.  Resources like Bryan Berry’s Food Fight Show, Jaime Winsor’s environment pattern,  and Chef’s conference talks heavily shaped Quid’s Chef practices.  We are forever grateful.
  • Flexible and powerful fundamentals
    Even if you do not appreciate all the cooking analogies, Chef has done a very good job at providing flexible yet powerful capabilities with their server, environment, node, and cookbook setup.  We can easily pull from all of Chef’s resources, such as the community cookbooks, without compromising on our desired workflow and infrastructure setup.  We have built in house concepts like stacks, clusters, and environment aware data bags on top of Chef.   Our recipe setup will easily map to containers as we start exploring technologies like Docker.  Chef also provides many ways to dynamically override nearly anything, from a single node to an entire environment, allowing us to provision individual or clusters of machines just the way we like.
  • Documentation
    Chef has a very thorough and organized set of documentation.  We reference the documentation many times each day.  Anything not addressed through their documentation is often found answered on a site like StackOverflow by the developers themselves.
  • Stability
    We have had hundreds of nodes hitting our Chef server every 20 minutes for years with barely a hiccup.  Some companies have many thousands.  We worry about a lot of things in our infrastructure, but the reliability of the Chef server and client is not one of them.

The middle

Medicore chef

Attributed to Mike Mozart (https://www.flickr.com/photos/jeepersmedia/)

  • Outgrowing the community cookbooks
    Using the community cookbooks gave us a huge jump start on provisioning  the wide variety of software we use at Quid.   However, each community cookbook we have pulled into our Chef Server has amounted to a small amount of tech debt.   We are slowly replacing each community cookbook with our own equivalent.  The most common (but not only) reasons we replace a cookbook:

    1. The community cookbook does not have the configuration flexibility we require.
    2. We increasingly dislike relying on flaky external internet resources which the community cookbook makes difficult to override.
    3. The community cookbook pulls in too many dependencies or dependencies which are incompatible with other cookbooks in our infrastructure.
    4. The convention of the community cookbook deviates enough from our now well established in-house conventions that it creates too much confusion for not-as-Chef-familiar developers when debugging.
  • Berkshelf
    Berkshelf fundamentally transformed the way we developed cookbooks in a magnificent way.  Thanks to Berkshelf, we can treat each cookbook as a completely independent repository and development environment.  This is important for allowing each cookbook to iterate somewhat independently.  We cannot imagine working on cookbooks without something like Berkshelf.  However, Berkshelf is a quirky and sometimes awkward middleman between the Chef client and server.  Chef clearly recognizes the importance of Berkshelf by incorporating the software into the ChefDK, but we feel Chef should take this one step further and make the useful features of Berkshelf native to Chef while ditching the quirks which have caused our developers trouble/confusion (e.g. separate api server, local cookbook caching, lock files,  and differing dependency resolution).

The downs

Attributed to Kenny Louie (https://www.flickr.com/photos/kwl/)

Attributed to Kenny Louie (https://www.flickr.com/photos/kwl/)

  • Cookbook dependencies
    Each cookbook can declare a set of cookbook dependencies in its metadata.  The restriction to dependencies at the cookbook level ends up being incredibly frustrating.

    1. Most community cookbooks try to build cookbooks which work on a wide variety of environments.  This means they may depend on the ‘yum’ and ‘apt’ cookbook or the ‘aws’ and ‘rackspace’ cookbook to cover all scenarios.  Even though you may only be targeting CentOS and Rackspace, using the community cookbooks as is means pulling in all the dependents.  And all of their dependents.  This adds up quickly and really opens up dependency conflicts for cookbooks you technically do not need.
    2. Not having dependencies at a more granular level restricts us from moving cookbooks with more independent/modular recipes forward at different rates.  This means when updating a cookbook dependency, all recipes need to be compatible.   In a production setting, this can create overhead at the worst of times.
  • Learning curve
    Spend a year or so in Chef’s world and you become incredibly proficient.  There is no configuration challenge which intimidates you!   Unfortunately, we have learned this proficiency comes with one hell of a learning curve.  Some of the leading sources of confusion:

    • Ruby
      Ruby has enough syntactical quirks (e.g. `my_hash[:my_key]`/ `my_hash[‘my_key’]` or `my_function arg1`/ `my_function(arg1)` or etc…) to leave a developer new to Ruby scratching her head.   Then, throw in trying to learn Chef’s DSL and cooking analogies at the same time, you end up with  frustrated developers having a hard time distinguishing between convention, necessity, and magic.
    • Two pass Chef run
      A Chef run has multiple phases.  The first pass is the compilation of all the Ruby code and DSL into in-memory executable resources.  The second is an in-order execution of the compiled resources.  This is a gotcha minefield as even experienced Chef developers regularly overlook when code runs at compilation versus convergence.  Newer developers have little chance of immediately absorbing the multiple phases and will mess up ordering at least a handful of times.
    • Debugging
      Even once you get the hang of Ruby and start pumping out resources, the next challenge is when a Chef run fails.  The promise of something as simple as infrastructure as code falls apart when a resource fails to compile or execute.  The errors coming from Chef are often cryptic or outright misleading.  If something like a service fails to start, you are digging through a terminal looking at logs and needing an understanding of a supervisor like runit.  This is all well and good for somebody with sysadmin experience, but absolutely miserable for a developer just trying to automate their stack.
  • Drive-by developers struggle
    At Quid, our Infrastructure team strives to enable our developers to handle their stacks from inception to production.  This means we want each team to write and maintain their own cookbooks.  This allows infrastructure to focus on enabling others and  providing tools rather than bottle-necking a release to production. Despite the aforementioned Chef documentation and our own heavy investment in  examples, conventions, tutorials, and constantly fielding questions, it’s often not enough.   Our application developers often take a magnitude more time on changes than somebody from the Infrastructure team.  The application developers simply do not have the time or the need to get super familiar with Chef.  We continue to look for more newbie friendly approaches to Chef, system provisioning, and configuration management.

Overall, we’re relatively happy with Chef, especially given the choices we had at the start of our investment.   We also feel like Chef’s flexibility has put us in a good position to explore and incorporate all of the recent excitement in developer operations.  We’re looking forward to seeing how technologies like Docker, Mesos, and Kubernetes can take our Chef work to the next level.  Here’s to cooking up awesome in 2016!


Interested in helping us solve awesome problems? If so, then head over to our careers page!