All of lore.kernel.org
 help / color / mirror / Atom feed
From: Austin S Hemmelgarn <ahferroin7@gmail.com>
To: Duncan <1i5t5.duncan@cox.net>, linux-btrfs@vger.kernel.org
Subject: Re: Btrfs/RAID5 became unmountable after SATA cable fault
Date: Thu, 5 Nov 2015 07:30:35 -0500	[thread overview]
Message-ID: <563B4BEB.105@gmail.com> (raw)
In-Reply-To: <pan$f33d7$d1be1dd2$81273c9e$505ff3bb@cox.net>

[-- Attachment #1: Type: text/plain, Size: 3810 bytes --]

On 2015-11-04 23:06, Duncan wrote:
> (Tho I should mention, while not on zfs, I've actually had my own
> problems with ECC RAM too.  In my case, the RAM was certified to run at
> speeds faster than it was actually reliable at, such that actually stored
> data, what the ECC protects, was fine, the data was actually getting
> damaged in transit to/from the RAM.  On a lightly loaded system, such as
> one running many memory tests or under normal desktop usage conditions,
> the RAM was generally fine, no problems.  But on a heavily loaded system,
> such as when doing parallel builds (I run gentoo, which builds from
> sources in ordered to get the higher level of option flexibility that
> comes only when you can toggle build-time options), I'd often have memory
> faults and my builds would fail.
>
> The most common failure, BTW, was on tarball decompression, bunzip2 or
> the like, since the tarballs contained checksums that were verified on
> data decompression, and often they'd fail to verify.
>
> Once I updated the BIOS to one that would let me set the memory speed
> instead of using the speed the modules themselves reported, and I
> declocked the memory just one notch (this was DDR1, IIRC I declocked from
> the PC3200 it was rated, to PC3000 speeds), not only was the memory then
> 100% reliable, but I could and did actually reduce the number of wait-
> states for various operations, and it was STILL 100% reliable.  It simply
> couldn't handle the raw speeds it was certified to run, is all, tho it
> did handle it well enough, enough of the time, to make the problem far
> more difficult to diagnose and confirm than it would have been had the
> problem appeared at low load as well.
>
> As it happens, I was running reiserfs at the time, and it handled both
> that hardware issue, and a number of others I've had, far better than I'd
> have expected of /any/ filesystem, when the memory feeding it is simply
> not reliable.  Reiserfs metadata, in particular, seems incredibly
> resilient in the face of hardware issues, and I lost far less data than I
> might have expected, tho without checksums and with bad memory, I imagine
> I had occasional undetected bitflip corruption in files here or there,
> but generally nothing I detected.  I still use reiserfs on my spinning
> rust today, but it's not well suited to SSD, which is where I run btrfs.
>
> But the point for this discussion is that just because it's ECC RAM
> doesn't mean you can't have memory related errors, just that if you do,
> they're likely to be different errors, "transit errors", that will tend
> to be undetected by many memory checkers, at least the ones that don't
> tend to run full out memory bandwidth if they're simply checking that
> what was stored in a cell can be read back, unchanged.)
I've actually seen similar issues with both ECC and non-ECC memory 
myself.  Any time I'm getting RAM for a system that I can afford to 
over-spec, I get the next higher speed and under-clock it (which in turn 
means I can lower the timing parameters and usually get a faster system 
than if I was running it at the rated speed).  FWIW, I also make a point 
of doing multiple memtest86+ runs (at a minimum, one running single 
core, and one with forced SMP) when I get new RAM, and even have a 
run-level configured on my Gentoo based home server system where it 
boots Xen and fires up twice as many VM's running memtest86+ as I have 
CPU cores, which is usually enough to fully saturate memory bandwidth 
and check for the type of issues you mentioned having above (although 
the BOINC client I run usually does a good job of triggering those kind 
of issues fast, distributed computing apps tend to be memory bound and 
use a lot of memory bandwidth).


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

  reply	other threads:[~2015-11-05 12:31 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <g7loe3red3ksp64hmb0vsbs2.1445476794489@email.android.com>
2015-11-04 18:01 ` Btrfs/RAID5 became unmountable after SATA cable fault Janos Toth F.
2015-11-04 18:45   ` Austin S Hemmelgarn
2015-11-05  4:06     ` Duncan
2015-11-05 12:30       ` Austin S Hemmelgarn [this message]
2015-11-06  3:19       ` Zoiled
2015-11-06  9:03   ` Janos Toth F.
2015-11-06 10:23     ` Patrik Lundquist
2016-07-23 13:20 Janos Toth F.
  -- strict thread matches above, loose matches on Subject: below --
2015-10-22  1:18 János Tóth F.
2015-10-19  8:39 Janos Toth F.
2015-10-20 14:59 ` Duncan
2015-10-21 16:09 ` Janos Toth F.
2015-10-21 16:44   ` ronnie sahlberg
2015-10-21 17:42   ` ronnie sahlberg
2015-10-21 18:40     ` Janos Toth F.
2015-10-21 17:46   ` Janos Toth F.
2015-10-21 20:26   ` Chris Murphy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=563B4BEB.105@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.