All of lore.kernel.org
 help / color / mirror / Atom feed
From: Russell Coker <russell@coker.com.au>
To: Duncan <1i5t5.duncan@cox.net>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Btrfs transaction checksum corruption & losing root of the tree & bizarre UUID change.
Date: Fri, 11 Jul 2014 22:33:17 +1000	[thread overview]
Message-ID: <2422537.XRs88ffYHU@xev> (raw)
In-Reply-To: <pan$47a6b$a35f9006$c974e846$4fb869f0@cox.net>

On Fri, 11 Jul 2014 10:38:22 Duncan wrote:
> > I've moved all drives and move those to my main rig which got a nice
> > 16GB of ecc ram, so errors of ram, cpu, controller should be kept
> > theoretically eliminated.
> 
> It's worth noting that ECC RAM doesn't necessarily help when it's an in-
> transit bus error.  Some years ago I had one of the original 3-digit 
> Opteron machines, which of course required registered and thus ECC RAM.  
> The first RAM I purchased for that board was apparently borderline on its 
> timing certifications, and while it worked fine when the system wasn't 
> too stressed, including with memtest, which passed with flying colors, 
> under medium memory activity it would very occasionally give me, for 
> instance, a bad bzip2 csum, and with intensive memory activity, the 
> problem would be worse (more bz2 decompress errors, gcc would error out 
> too sometimes and I'd have to restart my build, very occasionally the 
> system would crash).

If bad RAM causes corrupt memory but no ECC error reports then it probably 
wouldn't be a bus error.  A bus error SHOULD give ECC reports.

One problem is that RAM errors aren't random.  From memory the Hamming codes 
used fix 100% of single bit errors, detect 100% of 2 bit errors, and let some 
3 bit errors through.  If you have a memory module with 3 chips on it (the 
later generation of DIMM for any given size) then an error in 1 chip can 
change 4 bits.

The other main problem is that if you have a read or write going to the wrong 
address then you lose as AFAIK there's no ECC on address lines.

But I still recommend ECC RAM, it just decreases the scope for problems.  
About half the serious problems I've had with BTRFS have been caused by a 
faulty DIMM...

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/


      reply	other threads:[~2014-07-11 12:33 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-10 23:32 Btrfs transaction checksum corruption & losing root of the tree & bizarre UUID change Tomasz Kusmierz
2014-07-11  1:57 ` Austin S Hemmelgarn
2014-07-11 10:38 ` Duncan
2014-07-11 12:33   ` Russell Coker [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2422537.XRs88ffYHU@xev \
    --to=russell@coker.com.au \
    --cc=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.