Recently, I interviewed with Airbnb for a Site Reliability Engineer role. Airbnb are a really interesting and exciting company with one of the most beautifully designed websites I’ve ever seen. I’ve been aware of them for quite some time as I currently work for Lonely Planet and also travel quite often. Part of the interview process was a technical challenge. This is the writeup I submitted along with my solution.
You received a hostname and a username/password for an EC2 box. Make sure the credentials work. The user can sudo.
Please install java, git and maven on the host if they’re not already installed. (Completed with Chef.)
Build a script that can start / stop / restart the service as a background process on the provided machine. (Runit, via Chef)
Capture the sysout & syserr from the process while it’s running, and redirect them to Syslog. (Runit logger and rsyslog file monitoring)
Build a tool that periodically checks the service for its health. If a health check fails, the tool should trigger an alert via appropriate channels (e.g. email). (Runit, Nagios)
Make sure the service endpoint returns the expected response. (Nagios)
Parse this page and trigger an alert if:
- Daemon thread count > 10
- de.leibert.ExampleResource/p999 > 5ms
- percent-4xx-15m > 0.4
Bonus points: everything is configurable and it takes very little time to add monitoring for new metrics (Chef, Nagios, Graphite)
Nagios will work, but it’s not the right choice. Ideally I wanted to implement the http://metrics.codahale.com/manual/graphite/ module in to the webapp. Unfortunately, I’m not skilled in Scala. This would be my solution, given access to a dev.
Choice of tools and design
I used Nagios for the monitoring and alerting framework as it’s very easy to add additional checks. Graphite interfaces well with the metrics library in use and allows easy creation of new graphs. Ideally, Graphite would be installed on a separate server. As the application is further developed, I’d encourage the developers to implement statsd or similar and output application metrics to Graphite too.
I opted for a simple chef-solo solution for general server configuration as I like to represent infrastructure as code. When scaling this infrastructure, a chef-client and a dedicated centralised chef-server setup would be preferred. Not all software was installed via Chef, as the ‘solo’ solution doesn’t support some key features and I felt the full server install was too bulky for this project.
To run Chef, after making changes to a cookbook:
sudo chef-solo -j node.json -c solo.rb
The major potential problem with this system is a severe lack of redundancy due to the limitation of one server.
At a minimum, I’d implement an Amazon ELB and a second app server. EC2 Auto-scaling would be a quick win too.
As there is no sessions or databases, caching is very straight forward so I’d quickly add a Varnish layer.
I’d ideally centralise logging to a syslog server to allow correlation of data across nodes and layers of the web stack. Splunk or Loggly would be great solutions for this which also minimise operational overhead.
Location of code and configuration
A copy of all relevant configuration files and code is available in /home/ubuntu/MarkBarger.tar