From: Austin S Hemmelgarn <ahferroin7@gmail.com>
To: Duncan <1i5t5.duncan@cox.net>, linux-btrfs@vger.kernel.org
Subject: Re: Btrfs/RAID5 became unmountable after SATA cable fault
Date: Thu, 5 Nov 2015 07:30:35 -0500 [thread overview]
Message-ID: <563B4BEB.105@gmail.com> (raw)
In-Reply-To: <pan$f33d7$d1be1dd2$81273c9e$505ff3bb@cox.net>
[-- Attachment #1: Type: text/plain, Size: 3810 bytes --]
On 2015-11-04 23:06, Duncan wrote:
> (Tho I should mention, while not on zfs, I've actually had my own
> problems with ECC RAM too. In my case, the RAM was certified to run at
> speeds faster than it was actually reliable at, such that actually stored
> data, what the ECC protects, was fine, the data was actually getting
> damaged in transit to/from the RAM. On a lightly loaded system, such as
> one running many memory tests or under normal desktop usage conditions,
> the RAM was generally fine, no problems. But on a heavily loaded system,
> such as when doing parallel builds (I run gentoo, which builds from
> sources in ordered to get the higher level of option flexibility that
> comes only when you can toggle build-time options), I'd often have memory
> faults and my builds would fail.
>
> The most common failure, BTW, was on tarball decompression, bunzip2 or
> the like, since the tarballs contained checksums that were verified on
> data decompression, and often they'd fail to verify.
>
> Once I updated the BIOS to one that would let me set the memory speed
> instead of using the speed the modules themselves reported, and I
> declocked the memory just one notch (this was DDR1, IIRC I declocked from
> the PC3200 it was rated, to PC3000 speeds), not only was the memory then
> 100% reliable, but I could and did actually reduce the number of wait-
> states for various operations, and it was STILL 100% reliable. It simply
> couldn't handle the raw speeds it was certified to run, is all, tho it
> did handle it well enough, enough of the time, to make the problem far
> more difficult to diagnose and confirm than it would have been had the
> problem appeared at low load as well.
>
> As it happens, I was running reiserfs at the time, and it handled both
> that hardware issue, and a number of others I've had, far better than I'd
> have expected of /any/ filesystem, when the memory feeding it is simply
> not reliable. Reiserfs metadata, in particular, seems incredibly
> resilient in the face of hardware issues, and I lost far less data than I
> might have expected, tho without checksums and with bad memory, I imagine
> I had occasional undetected bitflip corruption in files here or there,
> but generally nothing I detected. I still use reiserfs on my spinning
> rust today, but it's not well suited to SSD, which is where I run btrfs.
>
> But the point for this discussion is that just because it's ECC RAM
> doesn't mean you can't have memory related errors, just that if you do,
> they're likely to be different errors, "transit errors", that will tend
> to be undetected by many memory checkers, at least the ones that don't
> tend to run full out memory bandwidth if they're simply checking that
> what was stored in a cell can be read back, unchanged.)
I've actually seen similar issues with both ECC and non-ECC memory
myself. Any time I'm getting RAM for a system that I can afford to
over-spec, I get the next higher speed and under-clock it (which in turn
means I can lower the timing parameters and usually get a faster system
than if I was running it at the rated speed). FWIW, I also make a point
of doing multiple memtest86+ runs (at a minimum, one running single
core, and one with forced SMP) when I get new RAM, and even have a
run-level configured on my Gentoo based home server system where it
boots Xen and fires up twice as many VM's running memtest86+ as I have
CPU cores, which is usually enough to fully saturate memory bandwidth
and check for the type of issues you mentioned having above (although
the BOINC client I run usually does a good job of triggering those kind
of issues fast, distributed computing apps tend to be memory bound and
use a lot of memory bandwidth).
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]
next prev parent reply other threads:[~2015-11-05 12:31 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <g7loe3red3ksp64hmb0vsbs2.1445476794489@email.android.com>
2015-11-04 18:01 ` Btrfs/RAID5 became unmountable after SATA cable fault Janos Toth F.
2015-11-04 18:45 ` Austin S Hemmelgarn
2015-11-05 4:06 ` Duncan
2015-11-05 12:30 ` Austin S Hemmelgarn [this message]
2015-11-06 3:19 ` Zoiled
2015-11-06 9:03 ` Janos Toth F.
2015-11-06 10:23 ` Patrik Lundquist
2016-07-23 13:20 Janos Toth F.
-- strict thread matches above, loose matches on Subject: below --
2015-10-22 1:18 János Tóth F.
2015-10-19 8:39 Janos Toth F.
2015-10-20 14:59 ` Duncan
2015-10-21 16:09 ` Janos Toth F.
2015-10-21 16:44 ` ronnie sahlberg
2015-10-21 17:42 ` ronnie sahlberg
2015-10-21 18:40 ` Janos Toth F.
2015-10-21 17:46 ` Janos Toth F.
2015-10-21 20:26 ` Chris Murphy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=563B4BEB.105@gmail.com \
--to=ahferroin7@gmail.com \
--cc=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).