xen-users
Re: [Xen-users] Home Xen hypervisor for master's project
Mechanical and power supply issues aside, ECC errors are the most common
reason for component replacement in the field. Memory does fail.
Furthermore, cosmic radiation, as far-fetched as it sounds, is a real
problem and a single bit flip on non-ECC systems will trigger a system
panic (if you're lucky, only the VM will fail, if unlucky, the whole
system will reboot). Analyzing crash dumps to find the offending module is
rather unpleasant to say the least (depending on how the OS handles memory
access), and you won't know if the problem was with the hardware or a
random occurrence.
I've seen systems which had a constant zero or one on one of the memory
module pins (lines) due to vibration, resulting in constant error
messages. With ECC, which reported the exact bit, it was trivial to
diagnose and resolve. With standard memory, it would have panicked every
time the server was booted -- the frustration would have been unbelievable.
If anything, ECC does NOT enhance performance. If anything, you'll get
lower performance as I've seen no ECC modules which went beyond JEDEC
specified DDR standards (for frequency and latency). I.e., the fastest ECC
DDR3 memory is 1333 MHz CL9. You can get 1866 MHz CL7 and something like
2400 MHz CL9, but only non-ECC, so when it comes to performance, ECC
limits it (aside from the obvious delays due to outages).
One thing which you should bear in mind, and which you should check before
springing for expensive equipment: check how memory errors are reported on
the system you are building. FWIW, I know that ECC errors with AMD under
Linux are very readable (when building the kernel, you have the option to
choose human-readable ECC error reporting, which is available for AMD
only). Syncfloods on HT are also diagnosable. I don't know how it looks
with Intel under Linux, although I can confirm that memory errors are
mostly easily diagnosable under Solaris, both with Intel and with AMD.
Just so you know, if you go with AMD, you don't have to get Opteron to get
ECC. All 890FX system boards support ECC (some mainstream boards based on
other chipsets might not allow it in BIOS), and nearly all support IOMMU.
With Intel, you have to get a Xeon, paying more in the process. I do use
AMD-based PC at home, I'm not an AMD employee. If I was going with brand
loyalty based on the companies I work for, I'd be recommending Intel
Marek
Dnia 20-02-2011 o 09:17:56 James Harper <james.harper@xxxxxxxxxxxxxxxx>
napisał(a):
Hi Joseph, guys,
I must say that ECC is kinda market bull. I did used ECC enabled server
and regular servers built from consumer parts. I've noticed no
difference whatsoever.
You do understand what ECC is right? It's not a performance thing, it's
an error detection/recovery mechanism.
When everything is working properly you won't notice a thing. When you
get a single bit memory error you shouldn't notice any problems apart
from a message about a faulty memory module which you can then replace.
When you get a double bit memory error you'll know that you've had a
memory error instead of getting a random crash or data corruption
problem with no idea of the cause.
I've seen ECC catch memory errors a few times, so people aren't just
making this stuff up.
James
--
Używam klienta poczty Opera Mail: http://www.opera.com/mail/
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users
|
|
|