When is ecc memory necessary?
Not too fancy, right? Maybe even … a little janky? This is building a computer the Google way:
This rack is now immortalized in the National Museum of American History. Urs Hölzle posted lots more juicy behind the scenes details, including the exact specifications:
When I left Stack Exchange (sorry, Stack Overflow) one of the things that excited me most was embarking on a new project using 100% open source tools. That project is, of course, Discourse.
Inspired by Google and their use of cheap, commodity x86 hardware to scale on top of the open source Linux OS, I also built our own servers. When I get stressed out, when I feel the world weighing heavy on my shoulders and I don't know where to turn … I build servers. It's therapeutic.
Don't judge me, man.
But more seriously, with the release of Intel's latest Skylake architecture, it's finally time to upgrade our 2013 era Discourse servers to the latest and greatest, something reflective of 2016 – which means building even more servers.
Discourse runs on a Ruby stack and one thing we learned early on is that Ruby demands exceptional single threaded performance, aka, a CPU running as fast as possible. Throwing umptazillion CPU cores at Ruby doesn't buy you a whole lot other than being able to handle more requests at the same time. Which is nice, but doesn't get you speed per se. Someone made a helpful technical video to illustrate exactly how this all works:
This is by no means exclusive to Ruby; other languages like JavaScript and Python also share this trait. And Discourse itself is a JavaScript application delivered through the browser, which exercises the mobile / laptop / desktop client CPU. Mobile devices reaching near-parity with desktop performance in single threaded performance is something we're betting on in a big way with Discourse.
So, good news! Although PC performance has been incremental at best in the last 5 years, between Haswell and Skylake, Intel managed to deliver a respectable per-thread performance bump. Since we are upgrading our servers from Ivy Bridge (very similar to the i7-3770k), the generation before Haswell, I'd expect a solid 33% performance improvement at minimum.
Even worse, the more cores they pack on a single chip, the slower they all go. From Intel's current Xeon E5 lineup:
Sad, isn't it? Which brings me to the following build for our core web tiers, which optimizes for "lots of inexpensive, fast boxes"
So, about 10% cheaper than what we spent in 2013, with 2× the memory, 2× the storage (probably 50-100% faster too), and at least ~33% faster CPU. With lower power draw, to boot! Pretty good. Pretty, pretty, pretty, pretty good.
(Note that the memory bump is only possible thanks to Intel finally relaxing their iron fist of maximum allowed RAM at the low end; that's new to the Skylake generation.)
One thing is conspicuously missing in our 2016 build: Xeons, and ECC Ram. In my defense, this isn't intentional – we wanted the fastest per-thread performance and no Intel Xeon, either currently available or announced, goes to 4.0 GHz with Skylake. Paying half the price for a CPU with better per-thread performance than any Xeon, well, I'm not going to kid you, that's kind of a nice perk too.
So what is ECC all about?
It's received wisdom in the sysadmin community that you always build servers with ECC RAM because, well, you build servers to be reliable, right? Why would anyone intentionally build a server that isn't reliable? Are you crazy, man? Well, looking at that cobbled together Google 1999 server rack, which also utterly lacked any form of ECC RAM, I'm inclined to think that reliability measured by "lots of redundant boxes" is more worthwhile and easier to achieve than the platonic ideal of making every individual server bulletproof.
Being the type of guy who likes to question stuff… I began to question. Why is it that ECC is so essential anyway? If ECC was so important, so critical to the reliable function of computers, why isn't it built in to every desktop, laptop, and smartphone in the world by now? Why is it optional? This smells awfully… enterprisey to me.
Now, before everyone stops reading and I get permanently branded as "that crazy guy who hates ECC", I think ECC RAM is fine:
I am not anti-insurance, nor am I anti-ECC. But I do seriously question whether ECC is as operationally critical as we have been led to believe, and I think the data shows modern, non-ECC RAM is already extremely reliable.
First, let's look at the Puget Systems reliability stats. These guys build lots of commodity x86 gamer PCs, burn them in, and ship them. They helpfully track statistics on how many parts fail either from burn-in or later in customer use. Go ahead and read through the stats.
Modern commodity computer parts from reputable vendors are amazingly reliable. And their trends show from 2012 onward essential PC parts have gotten more reliable, not less. (I can also vouch for the improvement in SSD reliability as we have had zero server SSD failures in 3 years across our 12 servers with 24+ drives, whereas in 2011 I was writing about the Hot/Crazy SSD Scale.) And doesn't this make sense from a financial standpoint? How does it benefit you as a company to ship unreliable parts? That's money right out of your pocket and the reseller's pocket, plus time spent dealing with returns.
We had a, uh, "spirited" discussion about this internally on our private Discourse instance.
This is not a new debate by any means, but I was frustrated by the lack of data out there. In particular, I'm really questioning the difference between "soft" and "hard" memory errors:
I absolutely believe that hard errors are reasonably common. RAM DIMMS can have bugs, or the chips on the DIMM can fail, or there's a design flaw in circuitry on the DIMM that only manifests in certain corner cases or under extreme loads. I've seen it plenty. But a soft error where a bit of memory randomly flips?
Outside of airplanes and spacecraft, I have a difficult time believing that soft errors happen with any frequency, otherwise most of the computing devices on the planet would be crashing left and right. I deeply distrust the anecdotal voodoo behind "but one of your computer's memory bits could flip, you'd never know, and corrupted data would be written!" It'd be one thing if we observed this regularly, but I've been unhealthily obsessed with computers since birth and I have never found random memory corruption to be a real, actual problem on any computers I have either owned or had access to.
But who gives a damn what I think. What does the data say?
A 2007 study found that the observed soft error rate in live servers was two orders of magnitude lower than previously predicted:
A 2009 study on Google's server farm notes that soft errors were difficult to find:
Yet another large scale study from 2012 discovered that RAM errors were dominated by permanent failure modes typical of hard errors:
In the end, we decided the non-ECC RAM risk was acceptable for every tier of service except our databases. Which is kind of a bummer since higher end Skylake Xeons got pushed back to the big Purley platform upgrade in 2017. Regardless, we burn in every server we build with a complete run of memtestx86 and overnight prime95/mprime, and you should too. There's one whirring away through endless memory tests right behind me as I write this.
I find it very, very suspicious that ECC – if it is so critical to preventing these random, memory corrupting bit flips – has not already been built into every type of RAM that we ship in the ubiquitous computing devices all around the world as a cost of doing business. But I am by no means opposed to paying a small insurance premium for server farms, either. You'll have to look at the data and decide for yourself. Mostly I wanted to collect all this information in one place so people who are also evaluating the cost/benefit of ECC RAM for themselves can read the studies and decide what they want to do.
Please feel free to leave comments if you have other studies to cite, or significant measured data to share.
Physically, ECC memory differs from non-ECC memory (like what consumer laptop / desktop RAM uses) in that it has 9 memory chips instead of 8 (memory chips are used to store data that is sent to the CPU when summoned). ECC RAM’s bonus memory chip is used for error detection and correction among the other eight memory chips.
Systems running ECC memory are supposed to crash less. In 2014, Puget Systems ran benchmarks and found ECC memory had a 0.09 percent failure rate, compared to non-ECC memory’s 0.6 percent failure rate.
ECC memory targets enterprise-grade workloads, so most consumer PC motherboards either won’t support ECC RAM or will run it without its ECC function. To actually enjoy the benefits of ECC memory, you'll need a workstation / server level motherboard. ECC memory is also more expensive than non-ECC RAM because of its extra memory chip.
Again, ECC memory is geared toward enterprise-grade workstations and servers. As such, a similarly heavy-duty CPU is needed to support ECC memory. For Intel CPUs, only the Xeon line supports ECC, in an attempt to differentiate its enthusiast-level processors from enterprise-level ones. Meanwhile, AMD’s core-abundant Threadripper line supports ECC memory.
Perhaps surprisingly, ECC RAM is a touch slower than non-ECC RAM, since it takes extra time to check for errors. In that same 2014 study cited above, Puget found that ECC RAM was 0.25 percent slower than non-ECC RAM, with Registered ECC RAM being 0.44 percent slower (however, they determined the performance difference in non-ECC’s favor is “tiny.”)
Error correction code is a mathematical process that ensures the data stored in memory is correct. In the case of an error, ECC also allows the system to recreate the correct data in real time.
ECC uses a more advanced form of parity, which is a method of using a single bit of data (a parity bit) to detect errors in larger groups of data, such as the typical eight bits of data used to represent values in a computer memory system. Unfortunately, while a parity bit allows the system to detect an error, it doesn't provide enough information to correct the data error.
Most computing systems move data in larger chunks of 64 bits (referred to as a "word"). Instead of generating one extra parity bit for every eight bits of data, ECC generates seven extra bits per 64 bits of data. The system performs a complex mathematical algorithm on the extra seven bits of data to ensure the other 64 bits are correct. In the event of a single bit being incorrect (a single-bit error), the ECC algorithm can reconstitute the data, but it can only notify the system of larger errors (two or more bits).
ECC memory is not always registered / buffered. However, all registered memory is ECC memory.
Well, it depends.
Are you looking for memory that has access to higher speeds and is compatible with more platforms? Or are you looking for endurance memory that can work 24/7, catch more errors, but sacrifices a bit of speed to do it?
RAM (Random Access Memory) modules are a crucial part of every system, but not all modules are the same. Aside from the capacity, frequency, and latency, the modules can either be Error-Correcting Code (ECC) modules, or non-ECC modules.
The difference between the two is that ECC memory will protect your system from a potential crash by correcting any errors in the data, while non-ECC memory doesn’t correct such errors.
Think of non-ECC memory as your speed-oriented memory, while ECC is your endurance / reliability memory.
Since not all platforms support ECC memory, and not every system needs it, let’s discuss what ECC memory is, how it works, and whether you need it.
To understand how the Error-Correcting Code (ECC) works, first you need to understand what a single-bit error is. Because that’s the major problem that ECC was made to handle.
A single-bit error is when a single bit (a binary 0 or 1) within a data within the RAM is changed to the opposite value accidentally.
This kind of error is tiny, and a computer may not recognize it as incorrect automatically, which can lead to many problems.
You can think of single-bit errors like metaphorical weeds in your lawn. Your lawn is your memory, and the ECC part of your memory is the choice of herbicide that’s available.
Non-ECC memory won’t take out any of the “weeds.”
ECC will annihilate all weeds, but a bit slower.
Single-bit errors can occur because of magnetic or electrical interference inside the computer, which is present in every system as background radiation.
Voltage stress, temperature variation, impact shock, or even data being read or written in a different way than originally intended can also lead to a single-bit error.
ECC memory will take care of these errors and fix them before they turn into a bigger problem.
ECC memory is very similar to non-ECC memory. The biggest difference between the two is that ECC memory usually has a bit of extra memory dedicated solely to making sure that the actual memory doesn’t crash and burn if something bad happens.
ECC is basically just a little chip on a normal stick of RAM that makes sure that every bit of data that goes in and out is exactly what they’re supposed to be.
What it does is that it creates an encrypted piece of code from the data being written into the main memory and stores that code in the extra bit of memory I told you about.
When the data stored in the main memory needs to be accessed, it then creates a new code and checks that piece of code against the code that was previously generated.
If it finds that they’re both the same, and that the data has not been tampered with in any way, it allows the data to be read.
But if it finds that the new code differs from the stored code, it tries to fix the issue by decrypting the code to find exactly where the problem lies.
And if it can’t, it at the very least makes sure that you know that something has gone wrong instead of silently continuing to work.
It’s similar to comparing MD5 hashes when downloading a program to make sure that what you downloaded was what you actually wanted and not a different rogue undercover file.
This is how ECC becomes slightly slower—because it has to create those extra codes—but a more efficient option to take care of your metaphorical lawn.
According to Intelligent Memory, the chances of such an error occurring are one single-bit error every 14 to 40 hours, per Gigabit (125 MBs) of RAM.
In a system used for everyday browsing and gaming, error correction isn’t a necessity. However, the stakes are higher in the world of servers and professional workstations.
If you’re running a business that specializes in finance, a single-bit error can lead to a server crash that could potentially wipe transactions from your server.
Such a memory error can also lead to a data transcription error, which could lead to misplacing a decimal or changing a number.
This is obviously detrimental to the integrity and trustworthiness of businesses. If you went to purchase your cat a new toy and ended up paying $100.00 instead of $10.00, that would be quite tragic.
If you visited your doctor and the bill came out to $56,987, instead of $59.45, you’re an American or just became the victim of an ECC-related computer error, or both. And you’d most likely never use their services again.
Not to mention the terrible day you’d have after getting a bill that size for a mere checkup.
When caching an airflow simulation on an airplane’s wing, you don’t want memory errors messing up, because there are lives at stake.
Wasted time investment is another thing that ECC Memory can prevent. If you’re rendering high-res, complex images or training a deep learning model, having to start over after weeks of processing, because of some memory errors, is a huge waste of time and money for your business.
ECC memory is a necessity in the medical industry too. With patient care, the accuracy of records is critical, and a single-bit error can lead to a wrong diagnosis, which could be fatal. The choice in memory can make or break a business’s records and reputation.
Without ECC memory, not only is there a chance of errors like this happening, but you also won’t know they happened until someone reviews the data and finds a mistake. And sometimes, that can be too late.
Make sure you double-check the receipt on that cat toy.
In the server world, both Intel’s Xeon CPUs and AMD’s Epyc server lineups support ECC memory. Note that for you to use ECC memory, both the processor and the motherboard you’re using will need to support ECC memory.
Remember the lawn and herbicide analogy? Think of your processor and motherboard as the tools you use to work efficiently with ECC. You need the proper tools to spread herbicide where you want it in your lawn.
In the mainstream platforms used today, most of Intel’s processors (even some budget-oriented Celeron models) will support ECC memory, provided you use a motherboard that is compatible with such memory.
With AMD, all Ryzen processors support ECC memory with a compatible X570 chipset motherboard, whereas the B550 chipset doesn’t support ECC memory with Ryzen 2000 processors. Ryzen processors with an integrated graphics card, or accelerated processing unit (APU), the 3000 G-Series, and 4000 G-Series will require you to use a PRO processor for ECC support.
Here’s an overview Table from ASUS showing Ryzen CPU ECC Memory Support:
The most common issue with ECC memory is compatibility. Even though most modern platforms support it, you must make sure that the specific processor and motherboard combination works with ECC memory.
This is less of a problem in the server world, where ECC memory is commonly used. So server hardware usually supports ECC by default.
Pricing is also different.
With ECC memory modules being more expensive than non-ECC modules due to the added functionality. Depending on the capacity you’re buying, the price difference is about 10-20%.
There is also a slight performance hit, because of the additional time that ECC memory takes to check for any errors. According to Corsair, you can expect a hit of about 2%.
When you’re building a professional workstation or a server that needs to run 24/7, ECC memory is a must.
To go without ECC in this scenario would be like using a greyhound to pull your wagon when what you really need is a sturdy workhorse.
Both the price difference and performance hit are worth it when you consider that you won’t have to worry about the possibility of a single-bit error causing you headaches.
You wouldn’t want to double-check every receipt to make sure it was correct, right? Extend the same courtesy to your clients and customers.
The main reason why ECC memory is favored is that it prevents data errors in server memory, ensuring system operation stability. An important place to prevent data errors is in the server RAM that temporarily stores data, so ECC memory can also be called ECC RAM. Generally, ECC memory differs from non-ECC memory in that it uses error correction codes to correct memory data. Speaking of which, we will wonder, which ECC memory or non-ECC memory is more suitable for our use environment? Let's explore ECC vs. non-ECC memory.
ECC is a type of server memory that monitors memory data for errors to protect your system from potential threats. The main idea is to add a ninth computer chip to server RAM. The main function of this ninth chip is that it is exclusively responsible for checking for errors and correcting them. Non-ECC memory has only eight chips and does not perform data monitoring and error correction, which is the biggest difference between ECC and non-ECC memory.
But why are there errors? There are two main types of errors: memory and unit errors, while unit errors are the most common. Memory errors are due to electrical and magnetic perturbations inside the computer that causes DRAM to spontaneously transfer to the opposite state. Unit errors are when one bit of a data octet (binary 1 or 0) changes to the opposite value without awareness. Unit errors occur in subtle ways and have little impact on the data, but still have some effect on the operating system. There is also a unit memory error that includes hard and soft errors. Hard errors are mostly physical factors such as voltage, shock, and temperature changes. Soft errors are caused by writing data that is not as expected. When data goes in and out of server RAM, some corruption happens.
The method of ECC memory to detect errors is parity. From the description in the previous section, we can know that the server RAM is checked for errors and corrected by adding the ninth byte. However, parity mainly detects whether a byte will appear even or odd by adding 0 or 1 to the end of the byte. For example, if parity adds bytes to odd bit 7, then parity is 1, and the even will be 8. If the parity byte is 0 and a result is an odd number, the byte is in a corrupt state.
Naturally, the parity bit of ECC memory is not always an 8-bit byte, it is also possible to generate a 7-bit code/64-bit byte by using binary cyclic error correction code. What this means here is that every time the system reads 64 bits of data, it generates a 7-bit code. The purpose of the detection is to determine whether the code matches. If the mismatch means it has an error, the ECC memory will correct the error immediately.
When you apply ECC memory to your server, it monitors memory data and corrects errors in a timely manner. First, this somewhat reduces the number of crashes, especially in devices that cannot withstand memory data corruption, such as computing applications or servers in the scientific and financial industries. Secondly, its data error correction can maintain data integrity and enhance system stability. In the data center, ECC is more reliable than non-ECC memory.
However, ECC memory does not only bring advantages, and there are also some disadvantages. Compared to non-ECC memory, ECC memory is more expensive because of the extra memory chips and their complexity. Also, not all computers need to use ECC memory. In some important and complex work environments, server and workstation motherboards need to be configured with ECC memory. What's more, in terms of reading speed, ECC memory is slightly slower than non-ECC memory by about 2% because of the extra time required to check memory data errors.
There is no absolute standard of judgment between ECC and non-ECC memory to say which is better, it needs to be specific to the environment. If your industry is the financial or medical industry or other critical data-related industries, you must consider configuring ECC memory in data center servers. Why is it necessary to configure it? Because it can reduce security breaches, and data transcription errors, prevent information corruption and system crashes, etc., and achieve data accuracy and system stability. In such industries, the impact of data errors can be fatal. It can cause data to be coded incorrectly or corrupted, directly affecting your financial business. The specific embodiment of medical treatment is that the data is not matched correctly, resulting in serious consequences. It relies on the support of the CPU and the memory itself, while UDIMM supports ECC memory.
If you're just a regular PC user, or don't plan to use mission-critical equipment for major projects, you can choose DRAM or non-parity SDRAM.
In contrast to non-ECC RAM, which can only detect common memory errors, ECC RAM can immediately detect and fix memory errors before they cause data corruption or event systems crashes. This is why ECC memory is utilized in numerous enterprise applications, especially for mission-critical applications.