Memorandums, Logbook & Shares: Case study

Source: http://codeascraft.etsy.com/2012/08/31/what-hardware-powers-etsy-com/

What Hardware Powers Etsy.com?

Posted by lozzd | Filed under engineering, infrastructure, operations

Traditionally, discussing hardware configurations when running a large website is something done inside private circles; and normally to discuss how vendor X did something very poorly, and vendor Y’s support sucks.

With the advent of the “cloud”, this has changed slightly. Suddenly people are talking about how big their instances are, and how many of them. And I think this is a great practice to get in to with physical servers in datacenters too. After all, none of this is intended to be some sort of competition; it’s about helping out people in similar situations as us, and broadcasting solutions that others may not know about… pretty much like everything else we post on this blog.

The great folk at 37signals started this trend recently by posting about their hardware configurations after attending Velocity conference… one of the aforementioned places where hardware gossiping will have taken place.

So, in the interest of continuing this trend here’s the classes of machine we use to power over $69.5 million of sales for our sellers in July

Database Class

As you may already know, we have quite a few pairs of MySQL machines to store our data, and as such we’re relying on them heavily for performance and (a certain amount of) reliability.

For any job that requires an all round performant box, with good storage, good processing power, and a good level of redundancy we utilise HP DL380 servers. These clock in at 2U of rack space, 2x 8 core Intel E5630 CPUs (@ 2.53ghz), 96GB of RAM (for that all important MySQL buffer cache) and 16x 15,000 RPM 146GB hard disks. This gives us the right balance of disk space to store user data, and spindles/RAM to retrieve it quickly enough. The machines have 4x 1gbit ethernet ports, but we only use one.

Why not SSDs?

We’re just starting to test our first round of database class machines with SSDs. Traditionally we’ve had other issues to solve first, such as getting the right balance of amount of user data (e.g. the amount of disk space used on a machine) vs the CPU and memory. However, as you’ll see in our other configs, we have plenty of SSDs throughout the infrastructure, so we certainly are going to give them a good testing for databases too.

A picture of our various types of hardware, with the HP to the left/middle and web/utility boxes on the right

Web/Gearman Worker/Memcache/Utility/Job Class

This is a pretty wide catch all, but in general we try and favour as few machine classes as possible, so a lot of our tasks from handling web traffic (Apache/PHP) to any box that performs a task where there are many of them/redundancy is solved at the app level we generally use one type of machine. This way hardware re-use is promoted and machines can change roles quickly and easily. Having said that, there are some slightly different configurations in this category for components that are easy to change, e.g. amount of memory and disks.

We’re pretty much in love with this 2U Supermicro chassis which allows for 4x nodes that share two power supplies and 12 3.5″ disks on the front of the chassis

Supermicro Chassis with 4 easily serviceable nodes

A general configuration for these would be 2x 8 core Intel E5620 CPUs (@ 2.40ghz), 12GB-96GB of RAM, and either a 600GB 7200pm hard disk or an Intel 160GB SSD.

Note the lack of RAID on these configurations; We’re pretty heavily reliant on Cobbler and Chef, which means rebuilding a system from scratch takes just 10 minutes. In our view, why power two drives when our datacenter staff can replace the drive and rebuild the machine and have it back in production in under 20 minutes? Obviously this only works where it is appropriate; clusters of machines where the data on each individual machine is not important. Web servers, for example, have no important data since logs are sent constantly to our centralised logging host, and the web code is easily deployed back on to the machine.

We have Nagios checks to let us know when the filesystem becomes un-writeable (and SMART checks also), so we know when a machine needs a new disk.

Each machine has 2x 1gbit ethernet ports, in this case we’re only using one.

Hadoop

In the last 12 months we’ve been working on building up our Hadoop cluster, and after evaluating a few hardware configurations ended up with a very similar chassis design to the one used above. However, we’re using a chassis with 24x 2.5″ disk slots on the front, instead of the 12x 3.5″ design used above.

Hadoop nodes… and a lot of disk lights

Each node (with 4 in a 2U chassis) has 2x 12 core Intel E5646 CPUs (@ 2.40ghz), 96GB of RAM, and 6x 1Tb 2.5″ 7200rpm disks. That’s 96 cores, 384GB of RAM and 24TB per 2U of rack space.

Our Hadoop jobs are very CPU heavy, and storage and disk throughput is less of an issue hence the small amount of disk space per node. If we had more I/O and storage requirements, we had also considered 2U Supermicro servers with 12x 3.5″ disks per node instead.

As with the above chassis, each node as 2x 1gbit ethernet ports, but we’re only utilising one at the minute.

This graph illustrates the power usage on one set of machines showing the difference between Hadoop jobs running and not

Search/Solr

Just a month ago, this would’ve been grouped into the general utility boxes above, but we’ve got something new and exciting for our search stack. Using the same chassis as in our general example, but this time using the awesome new Sandy Bridge line of Intel CPUs. We’ve got 2x 16 core Intel E5-2690 CPUs in these nodes, clocked at 2.90ghz, which gives us machines that can handle over 4 times the workload of the generic nodes above, whilst using the same density configuration and not that much more power. That’s 128x 2.9ghz CPU cores per 2U (granted, that includes HyperThreading).

This works so well because search is really CPU bound; we’ve been using SSDs to get around I/O issues in these machines for a few years now. The nodes have 96GB of RAM and a single 800GB SSD for the indexes. This follows the same pattern of not bothering with RAID; The SSD is perfectly fast enough on it’s own, and we haveBitTorrent index distribution which means getting the indexes to the machine is super fast.

Less machines = less to manage, less power, and less space.

Output of the “top” command with 32 cores on Sandy Bridge architecture

Backups

Supermicro wins this game too. We’re using the catchily named 6047R-E1R36N. The 36 in this model number is the important part… this is a 4u chassis, with 36x 3.5″ disks. We load up these chassis with 2TB 7200rpm drives, which when coupled with an LSI RAID controller with 1gb of battery backed write back cache gives a blistering 1.2 gigabytes/second sequential write throughput and a total of 60TB of usable disk space across two RAID6 volumes.

36 disk Supermicro chassis. Note the disks crammed into the back of the chassis as well as the front!

Why two RAID6 volumes? Well, it means a little more waste (4 drives for parity instead of 2) but as a result of that you do get a bit more resiliency against losing a number of drives, and rebuild times are halved if you just lose a single drive. Obviously RAID monitoring is pretty important, and we have checks for either SMART (single disk machines) or the various RAID utilities on all our other machines in Nagios.

In this case we’re taking advantage of the 2x 1gbit ethernet connections, bonded together to the switch to give us redundancy and the extra bandwidth we need. In the future we may even run fiber to these machines, to get the full potential out of the disks, but right now we don’t get above a gigabit/second for all our backups.

Special cases

Of course there are always exception to the rules. The only other hardware profile we have is HP DL360 servers (1u, 4x 2.5″ 15,000rpm 146GB SAS disks) which is for roles that don’t need much horsepower, but we deem important enough to have RAID. For example, DNS servers, LDAP servers, and our Hadoop Namenodes are all machines that don’t require much disk space, but need RAID for extra data safety than our regular single disk configurations.

Networking

I didn’t go into too much detail on the networking side of things in this post. Consider this part 1, and watch this space for our networking gurus to take you through our packet shuffling infrastructure at a later date.

Memorandums, Logbook & Shares

Tuesday, September 11, 2012

Case study - Hardware for etsy.com