Re: Btrfs/RAID5 became unmountable after SATA cable fault

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Austin S Hemmelgarn <ahferroin7@gmail.com>
To: Duncan <1i5t5.duncan@cox.net>, linux-btrfs@vger.kernel.org
Subject: Re: Btrfs/RAID5 became unmountable after SATA cable fault
Date: Thu, 5 Nov 2015 07:30:35 -0500	[thread overview]
Message-ID: <563B4BEB.105@gmail.com> (raw)
In-Reply-To: <pan$f33d7$d1be1dd2$81273c9e$505ff3bb@cox.net>

[-- Attachment #1: Type: text/plain, Size: 3810 bytes --]

On 2015-11-04 23:06, Duncan wrote:
> (Tho I should mention, while not on zfs, I've actually had my own
> problems with ECC RAM too.  In my case, the RAM was certified to run at
> speeds faster than it was actually reliable at, such that actually stored
> data, what the ECC protects, was fine, the data was actually getting
> damaged in transit to/from the RAM.  On a lightly loaded system, such as
> one running many memory tests or under normal desktop usage conditions,
> the RAM was generally fine, no problems.  But on a heavily loaded system,
> such as when doing parallel builds (I run gentoo, which builds from
> sources in ordered to get the higher level of option flexibility that
> comes only when you can toggle build-time options), I'd often have memory
> faults and my builds would fail.
>
> The most common failure, BTW, was on tarball decompression, bunzip2 or
> the like, since the tarballs contained checksums that were verified on
> data decompression, and often they'd fail to verify.
>
> Once I updated the BIOS to one that would let me set the memory speed
> instead of using the speed the modules themselves reported, and I
> declocked the memory just one notch (this was DDR1, IIRC I declocked from
> the PC3200 it was rated, to PC3000 speeds), not only was the memory then
> 100% reliable, but I could and did actually reduce the number of wait-
> states for various operations, and it was STILL 100% reliable.  It simply
> couldn't handle the raw speeds it was certified to run, is all, tho it
> did handle it well enough, enough of the time, to make the problem far
> more difficult to diagnose and confirm than it would have been had the
> problem appeared at low load as well.
>
> As it happens, I was running reiserfs at the time, and it handled both
> that hardware issue, and a number of others I've had, far better than I'd
> have expected of /any/ filesystem, when the memory feeding it is simply
> not reliable.  Reiserfs metadata, in particular, seems incredibly
> resilient in the face of hardware issues, and I lost far less data than I
> might have expected, tho without checksums and with bad memory, I imagine
> I had occasional undetected bitflip corruption in files here or there,
> but generally nothing I detected.  I still use reiserfs on my spinning
> rust today, but it's not well suited to SSD, which is where I run btrfs.
>
> But the point for this discussion is that just because it's ECC RAM
> doesn't mean you can't have memory related errors, just that if you do,
> they're likely to be different errors, "transit errors", that will tend
> to be undetected by many memory checkers, at least the ones that don't
> tend to run full out memory bandwidth if they're simply checking that
> what was stored in a cell can be read back, unchanged.)
I've actually seen similar issues with both ECC and non-ECC memory 
myself.  Any time I'm getting RAM for a system that I can afford to 
over-spec, I get the next higher speed and under-clock it (which in turn 
means I can lower the timing parameters and usually get a faster system 
than if I was running it at the rated speed).  FWIW, I also make a point 
of doing multiple memtest86+ runs (at a minimum, one running single 
core, and one with forced SMP) when I get new RAM, and even have a 
run-level configured on my Gentoo based home server system where it 
boots Xen and fires up twice as many VM's running memtest86+ as I have 
CPU cores, which is usually enough to fully saturate memory bandwidth 
and check for the type of issues you mentioned having above (although 
the BOINC client I run usually does a good job of triggering those kind 
of issues fast, distributed computing apps tend to be memory bound and 
use a lot of memory bandwidth).


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

next prev parent reply	other threads:[~2015-11-05 12:31 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <g7loe3red3ksp64hmb0vsbs2.1445476794489@email.android.com>
2015-11-04 18:01 ` Btrfs/RAID5 became unmountable after SATA cable fault Janos Toth F.
2015-11-04 18:45   ` Austin S Hemmelgarn
2015-11-05  4:06     ` Duncan
2015-11-05 12:30       ` Austin S Hemmelgarn [this message]
2015-11-06  3:19       ` Zoiled
2015-11-06  9:03   ` Janos Toth F.
2015-11-06 10:23     ` Patrik Lundquist
2016-07-23 13:20 Janos Toth F.
  -- strict thread matches above, loose matches on Subject: below --
2015-10-22  1:18 János Tóth F.
2015-10-19  8:39 Janos Toth F.
2015-10-20 14:59 ` Duncan
2015-10-21 16:09 ` Janos Toth F.
2015-10-21 16:44   ` ronnie sahlberg
2015-10-21 17:42   ` ronnie sahlberg
2015-10-21 18:40     ` Janos Toth F.
2015-10-21 17:46   ` Janos Toth F.
2015-10-21 20:26   ` Chris Murphy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=563B4BEB.105@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).