On 2015-11-04 23:06, Duncan wrote: > (Tho I should mention, while not on zfs, I've actually had my own > problems with ECC RAM too. In my case, the RAM was certified to run at > speeds faster than it was actually reliable at, such that actually stored > data, what the ECC protects, was fine, the data was actually getting > damaged in transit to/from the RAM. On a lightly loaded system, such as > one running many memory tests or under normal desktop usage conditions, > the RAM was generally fine, no problems. But on a heavily loaded system, > such as when doing parallel builds (I run gentoo, which builds from > sources in ordered to get the higher level of option flexibility that > comes only when you can toggle build-time options), I'd often have memory > faults and my builds would fail. > > The most common failure, BTW, was on tarball decompression, bunzip2 or > the like, since the tarballs contained checksums that were verified on > data decompression, and often they'd fail to verify. > > Once I updated the BIOS to one that would let me set the memory speed > instead of using the speed the modules themselves reported, and I > declocked the memory just one notch (this was DDR1, IIRC I declocked from > the PC3200 it was rated, to PC3000 speeds), not only was the memory then > 100% reliable, but I could and did actually reduce the number of wait- > states for various operations, and it was STILL 100% reliable. It simply > couldn't handle the raw speeds it was certified to run, is all, tho it > did handle it well enough, enough of the time, to make the problem far > more difficult to diagnose and confirm than it would have been had the > problem appeared at low load as well. > > As it happens, I was running reiserfs at the time, and it handled both > that hardware issue, and a number of others I've had, far better than I'd > have expected of /any/ filesystem, when the memory feeding it is simply > not reliable. Reiserfs metadata, in particular, seems incredibly > resilient in the face of hardware issues, and I lost far less data than I > might have expected, tho without checksums and with bad memory, I imagine > I had occasional undetected bitflip corruption in files here or there, > but generally nothing I detected. I still use reiserfs on my spinning > rust today, but it's not well suited to SSD, which is where I run btrfs. > > But the point for this discussion is that just because it's ECC RAM > doesn't mean you can't have memory related errors, just that if you do, > they're likely to be different errors, "transit errors", that will tend > to be undetected by many memory checkers, at least the ones that don't > tend to run full out memory bandwidth if they're simply checking that > what was stored in a cell can be read back, unchanged.) I've actually seen similar issues with both ECC and non-ECC memory myself. Any time I'm getting RAM for a system that I can afford to over-spec, I get the next higher speed and under-clock it (which in turn means I can lower the timing parameters and usually get a faster system than if I was running it at the rated speed). FWIW, I also make a point of doing multiple memtest86+ runs (at a minimum, one running single core, and one with forced SMP) when I get new RAM, and even have a run-level configured on my Gentoo based home server system where it boots Xen and fires up twice as many VM's running memtest86+ as I have CPU cores, which is usually enough to fully saturate memory bandwidth and check for the type of issues you mentioned having above (although the BOINC client I run usually does a good job of triggering those kind of issues fast, distributed computing apps tend to be memory bound and use a lot of memory bandwidth).