Luk Brownlee
About
-
Posted Answers
Answer
Not too fancy, right? Maybe even … a little janky? This is building a computer the Google way:
This rack is now immortalized in the National Museum of American History. Urs Hölzle posted lots more juicy behind the scenes details, including the exact specifications:
When I left Stack Exchange (sorry, Stack Overflow) one of the things that excited me most was embarking on a new project using 100% open source tools. That project is, of course, Discourse.
Inspired by Google and their use of cheap, commodity x86 hardware to scale on top of the open source Linux OS, I also built our own servers. When I get stressed out, when I feel the world weighing heavy on my shoulders and I don't know where to turn … I build servers. It's therapeutic.
Don't judge me, man.
But more seriously, with the release of Intel's latest Skylake architecture, it's finally time to upgrade our 2013 era Discourse servers to the latest and greatest, something reflective of 2016 – which means building even more servers.
Discourse runs on a Ruby stack and one thing we learned early on is that Ruby demands exceptional single threaded performance, aka, a CPU running as fast as possible. Throwing umptazillion CPU cores at Ruby doesn't buy you a whole lot other than being able to handle more requests at the same time. Which is nice, but doesn't get you speed per se. Someone made a helpful technical video to illustrate exactly how this all works:
This is by no means exclusive to Ruby; other languages like JavaScript and Python also share this trait. And Discourse itself is a JavaScript application delivered through the browser, which exercises the mobile / laptop / desktop client CPU. Mobile devices reaching near-parity with desktop performance in single threaded performance is something we're betting on in a big way with Discourse.
So, good news! Although PC performance has been incremental at best in the last 5 years, between Haswell and Skylake, Intel managed to deliver a respectable per-thread performance bump. Since we are upgrading our servers from Ivy Bridge (very similar to the i7-3770k), the generation before Haswell, I'd expect a solid 33% performance improvement at minimum.
Even worse, the more cores they pack on a single chip, the slower they all go. From Intel's current Xeon E5 lineup:
Sad, isn't it? Which brings me to the following build for our core web tiers, which optimizes for "lots of inexpensive, fast boxes"
So, about 10% cheaper than what we spent in 2013, with 2× the memory, 2× the storage (probably 50-100% faster too), and at least ~33% faster CPU. With lower power draw, to boot! Pretty good. Pretty, pretty, pretty, pretty good.
(Note that the memory bump is only possible thanks to Intel finally relaxing their iron fist of maximum allowed RAM at the low end; that's new to the Skylake generation.)
One thing is conspicuously missing in our 2016 build: Xeons, and ECC Ram. In my defense, this isn't intentional – we wanted the fastest per-thread performance and no Intel Xeon, either currently available or announced, goes to 4.0 GHz with Skylake. Paying half the price for a CPU with better per-thread performance than any Xeon, well, I'm not going to kid you, that's kind of a nice perk too.
So what is ECC all about?
It's received wisdom in the sysadmin community that you always build servers with ECC RAM because, well, you build servers to be reliable, right? Why would anyone intentionally build a server that isn't reliable? Are you crazy, man? Well, looking at that cobbled together Google 1999 server rack, which also utterly lacked any form of ECC RAM, I'm inclined to think that reliability measured by "lots of redundant boxes" is more worthwhile and easier to achieve than the platonic ideal of making every individual server bulletproof.
Being the type of guy who likes to question stuff… I began to question. Why is it that ECC is so essential anyway? If ECC was so important, so critical to the reliable function of computers, why isn't it built in to every desktop, laptop, and smartphone in the world by now? Why is it optional? This smells awfully… enterprisey to me.
Now, before everyone stops reading and I get permanently branded as "that crazy guy who hates ECC", I think ECC RAM is fine:
I am not anti-insurance, nor am I anti-ECC. But I do seriously question whether ECC is as operationally critical as we have been led to believe, and I think the data shows modern, non-ECC RAM is already extremely reliable.
First, let's look at the Puget Systems reliability stats. These guys build lots of commodity x86 gamer PCs, burn them in, and ship them. They helpfully track statistics on how many parts fail either from burn-in or later in customer use. Go ahead and read through the stats.
Modern commodity computer parts from reputable vendors are amazingly reliable. And their trends show from 2012 onward essential PC parts have gotten more reliable, not less. (I can also vouch for the improvement in SSD reliability as we have had zero server SSD failures in 3 years across our 12 servers with 24+ drives, whereas in 2011 I was writing about the Hot/Crazy SSD Scale.) And doesn't this make sense from a financial standpoint? How does it benefit you as a company to ship unreliable parts? That's money right out of your pocket and the reseller's pocket, plus time spent dealing with returns.
We had a, uh, "spirited" discussion about this internally on our private Discourse instance.
This is not a new debate by any means, but I was frustrated by the lack of data out there. In particular, I'm really questioning the difference between "soft" and "hard" memory errors:
I absolutely believe that hard errors are reasonably common. RAM DIMMS can have bugs, or the chips on the DIMM can fail, or there's a design flaw in circuitry on the DIMM that only manifests in certain corner cases or under extreme loads. I've seen it plenty. But a soft error where a bit of memory randomly flips?
Outside of airplanes and spacecraft, I have a difficult time believing that soft errors happen with any frequency, otherwise most of the computing devices on the planet would be crashing left and right. I deeply distrust the anecdotal voodoo behind "but one of your computer's memory bits could flip, you'd never know, and corrupted data would be written!" It'd be one thing if we observed this regularly, but I've been unhealthily obsessed with computers since birth and I have never found random memory corruption to be a real, actual problem on any computers I have either owned or had access to.
But who gives a damn what I think. What does the data say?
A 2007 study found that the observed soft error rate in live servers was two orders of magnitude lower than previously predicted:
A 2009 study on Google's server farm notes that soft errors were difficult to find:
Yet another large scale study from 2012 discovered that RAM errors were dominated by permanent failure modes typical of hard errors:
In the end, we decided the non-ECC RAM risk was acceptable for every tier of service except our databases. Which is kind of a bummer since higher end Skylake Xeons got pushed back to the big Purley platform upgrade in 2017. Regardless, we burn in every server we build with a complete run of memtestx86 and overnight prime95/mprime, and you should too. There's one whirring away through endless memory tests right behind me as I write this.
I find it very, very suspicious that ECC – if it is so critical to preventing these random, memory corrupting bit flips – has not already been built into every type of RAM that we ship in the ubiquitous computing devices all around the world as a cost of doing business. But I am by no means opposed to paying a small insurance premium for server farms, either. You'll have to look at the data and decide for yourself. Mostly I wanted to collect all this information in one place so people who are also evaluating the cost/benefit of ECC RAM for themselves can read the studies and decide what they want to do.
Please feel free to leave comments if you have other studies to cite, or significant measured data to share.
Answer is posted for the following question.
Answer
Soho Farmhouse is a members' club set in 100 acres of Oxfordshire countryside, with bedrooms, a pool, spa and gym Find out more here
Answer is posted for the following question.
When did soho farmhouse open?
Answer
Drivers use the " heel and toe " method to smoothly combine braking and downshifting as they approach a corner Good drivers know that blipping
Answer is posted for the following question.
How to heel toe downshift?