Re: BTRFS Data at Rest File Corruption

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Richard A. Lochner" <lochner@clone1.com>
To: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: BTRFS Data at Rest File Corruption
Date: Thu, 12 May 2016 18:15:03 -0500	[thread overview]
Message-ID: <1463094903.3636.129.camel@clone1.com> (raw)
In-Reply-To: <ebe609bb-3ce6-b929-97ef-ad323a254dc7@gmail.com>

Austin,

Ah, the idea of rewriting the "bad" data block is very interesting. I
had not thought of that.  Interestingly, the corrupted file is a raw
backup image of a btrfs file system partition. I can mount it as a loop
device.  I suppose I could rewrite that data block, mount it and run a
scrub on that mounted loop device to find out if it is truly fixed.

I should also mention that this data is not critical to me.  I only
brought this issue up because I thought it might be of interest.  

I can think of ways to protect against most manifestations of this type
of error (since metadata is checksummed in btrfs), but I cannot argue
that it would be worth the development effort, increased code
complexity or the additional cpu cycles required to implement such a
"defensive" algorithm for an "edge case" like this.  Even with a
defensive algorithm, these errors could still occur, but I believe you
could shrink the time window in which they could occur enough to
significantly reduce their probability.

That said, I happen to have experienced this particular error twice
(over a period of about 7 months) with btrfs on this system.  I do
believe that both were due to memory errors and I plan to upgrade soon
to a Haswell system with ECC memory because of this. 

However, I wonder if my "commodity hardware" is that unique?

In any event, thank you very much for your time and insight.

Rick Lochner


On Thu, 2016-05-12 at 14:29 -0400, Austin S. Hemmelgarn wrote:
> On 2016-05-12 13:49, Richard A. Lochner wrote:
> > 
> > Austin,
> > 
> > I rebooted the computer and reran the scrub to no avail.  The error
> > is
> > consistent.
> > 
> > The reason I brought this question to the mailing list is because
> > it
> > seemed like a situation that might be of interest to the
> > developers.
> >  Perhaps, there might be a way to "defend" against this type of
> > corruption.
> > 
> > I suspected, and I still suspect that the error occurred upon a
> > metadata update that corrupted the checksum for the file, probably
> > due
> > to silent memory corruption.  If the checksum was silently
> > corrupted,
> > it would be simply written to both drives causing this type of
> > error.
> That does seem to be the most likely cause, and sadly, is not
> something 
> any filesystem can protect reliably against on any commodity
> hardware.
> > 
> > 
> > With that in mind, I proved (see below) that the data blocks match
> > on
> > both mirrors.  This I expected since the data blocks should not
> > have
> > been touched as the the file has not been written.
> > 
> > This is the sequence of events as I see them that I think might be
> > of
> > interest to the developers.
> > 
> > 1. A block containing a checksum for the file was read into memory.
> > The block read would have been checksummed, so the checksum for the
> > file must have been good at that moment.
> It's worth noting that BTRFS doesn't verify all the checksums in a 
> metadata block when it loads that metadata block, only the ones for
> the 
> reads that triggered the metadata block being loaded will get
> verified.
> > 
> > 
> > 2. The checksum block was the altered in memory (perhaps to add or
> > change a value).
> > 
> > 3. A new checksum would then have been calculated for the checksum
> > block.
> > 
> > 4. The checksum block would have been written to both mirrors.
> > 
> > Presumably, in the case that I am experiencing, an undetected
> > memory
> > error must have occurred after 1 and before step 3 was completed.
> > 
> > I wonder if there is a way to correct or detect that situation.
> The closest we could get is to provide an option to handle this in 
> scrub, preferably with a big scary warning on it as this same
> situation 
> can be easily cause by someone modifying the disks themselves (we
> can't 
> reasonably protect against that, but we shouldn't make it trivial
> for 
> people to inject arbitrary data that way either).
> > 
> > 
> > As I stated previously, the machine on which this occurred does not
> > have ECC memory, however, I would not think that the majority of
> > users
> > running btrfs do either.  If it has happened to me, it likely has
> > happened to others.
> > 
> > Rick Lochner
> > 
> > btrfs dmesg(s):
> > 
> > [16510.334020] BTRFS warning (device sdb1): checksum error at
> > logical
> > 3037444042752 on dev /dev/sdb1, sector 4988789496, root 259, inode
> > 1437377, offset 75754369024, length 4096, links 1 (path:
> > Rick/sda4.img)
> > [16510.334043] BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr
> > 0, rd
> > 0, flush 0, corrupt 5, gen 0
> > [16510.345662] BTRFS error (device sdb1): unable to fixup (regular)
> > error at logical 3037444042752 on dev /dev/sdb1
> > 
> > [17606.978439] BTRFS warning (device sdb1): checksum error at
> > logical
> > 3037444042752 on dev /dev/sdc1, sector 4988750584, root 259, inode
> > 1437377, offset 75754369024, length 4096, links 1 (path:
> > Rick/sda4.img)
> > [17606.978460] BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr
> > 0, rd
> > 13, flush 0, corrupt 4, gen 0
> > [17606.989497] BTRFS error (device sdb1): unable to fixup (regular)
> > error at logical 3037444042752 on dev /dev/sdc1
> > 
> > How I compared the data blocks:
> > 
> > #btrfs-map-logical -l 3037444042752  /dev/sdc1
> > mirror 1 logical 3037444042752 physical 2554240299008 device
> > /dev/sdc1
> > mirror 1 logical 3037444046848 physical 2554240303104 device
> > /dev/sdc1
> > mirror 2 logical 3037444042752 physical 2554260221952 device
> > /dev/sdb1
> > mirror 2 logical 3037444046848 physical 2554260226048 device
> > /dev/sdb1
> > 
> > #dd if=/dev/sdc1 bs=1 skip=2554240299008 count=4096 of=c1
> > 4096+0 records in
> > 4096+0 records out
> > 4096 bytes (4.1 kB) copied, 0.0292201 s, 140 kB/s
> > 
> > #dd if=/dev/sdc1 bs=1 skip=2554240303104 count=4096 of=c2
> > 4096+0 records in
> > 4096+0 records out
> > 4096 bytes (4.1 kB) copied, 0.0142381 s, 288 kB/s
> > 
> > #dd if=/dev/sdb1 bs=1 skip=2554260221952 count=4096 of=b1
> > 4096+0 records in
> > 4096+0 records out
> > 4096 bytes (4.1 kB) copied, 0.0293211 s, 140 kB/s
> > 
> > #dd if=/dev/sdb1 bs=1 skip=2554260226048 count=4096 of=b2
> > 4096+0 records in
> > 4096+0 records out
> > 4096 bytes (4.1 kB) copied, 0.0151947 s, 270 kB/s
> > 
> > #diff b1 c1
> > #diff b2 c2
> Excellent thinking here.
> 
> Now, if you can find some external method to verify that that block
> is 
> in fact correct, you can just write it back into the file itself at
> the 
> correct offset, and fix the issue.
>

next prev parent reply	other threads:[~2016-05-12 23:15 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-11 18:36 BTRFS Data at Rest File Corruption Richard Lochner
2016-05-11 19:01 ` Roman Mamedov
2016-05-11 19:26 ` Austin S. Hemmelgarn
2016-05-12 17:49   ` Richard A. Lochner
2016-05-12 18:29     ` Austin S. Hemmelgarn
2016-05-12 21:53       ` Goffredo Baroncelli
2016-05-12 23:15       ` Richard A. Lochner [this message]
2016-05-13  1:41     ` Chris Murphy
2016-05-13  4:49       ` Richard A. Lochner
2016-05-13 17:46         ` Chris Murphy
2016-05-15 18:43           ` Richard A. Lochner
2016-05-16  6:07             ` Chris Murphy
2016-05-16 11:33               ` Austin S. Hemmelgarn
2016-05-16 21:20                 ` Richard A. Lochner
2016-05-16 22:43                 ` Chris Murphy
2016-05-16 23:44                   ` Richard A. Lochner
2016-05-17  3:42                     ` Chris Murphy
2016-05-17 11:26                       ` Austin S. Hemmelgarn
2016-05-13 16:28   ` Goffredo Baroncelli
2016-05-13 16:54     ` Austin S. Hemmelgarn
2016-05-12  6:49 ` Chris Murphy
     [not found] ` <CAAuLxcaQ1Uo+pff9AtD74UwUvo5yYKBuNLwKzjVMWV1kt2DcRQ@mail.gmail.com>
2016-05-12 18:26   ` Richard A. Lochner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1463094903.3636.129.camel@clone1.com \
    --to=lochner@clone1.com \
    --cc=ahferroin7@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.