From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f172.google.com ([209.85.192.172]:33581 "EHLO mail-pf0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751330AbcELXPR (ORCPT ); Thu, 12 May 2016 19:15:17 -0400 Received: by mail-pf0-f172.google.com with SMTP id 206so35592781pfu.0 for ; Thu, 12 May 2016 16:15:17 -0700 (PDT) Message-ID: <1463094903.3636.129.camel@clone1.com> Subject: Re: BTRFS Data at Rest File Corruption From: "Richard A. Lochner" To: "Austin S. Hemmelgarn" , Btrfs BTRFS Date: Thu, 12 May 2016 18:15:03 -0500 In-Reply-To: References: <97b8a0bd-3707-c7d6-4138-c8fe81937b72@gmail.com> <1463075341.3636.56.camel@clone1.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Sender: linux-btrfs-owner@vger.kernel.org List-ID: Austin, Ah, the idea of rewriting the "bad" data block is very interesting. I had not thought of that.  Interestingly, the corrupted file is a raw backup image of a btrfs file system partition. I can mount it as a loop device.  I suppose I could rewrite that data block, mount it and run a scrub on that mounted loop device to find out if it is truly fixed. I should also mention that this data is not critical to me.  I only brought this issue up because I thought it might be of interest.   I can think of ways to protect against most manifestations of this type of error (since metadata is checksummed in btrfs), but I cannot argue that it would be worth the development effort, increased code complexity or the additional cpu cycles required to implement such a "defensive" algorithm for an "edge case" like this.  Even with a defensive algorithm, these errors could still occur, but I believe you could shrink the time window in which they could occur enough to significantly reduce their probability. That said, I happen to have experienced this particular error twice (over a period of about 7 months) with btrfs on this system.  I do believe that both were due to memory errors and I plan to upgrade soon to a Haswell system with ECC memory because of this.  However, I wonder if my "commodity hardware" is that unique? In any event, thank you very much for your time and insight. Rick Lochner On Thu, 2016-05-12 at 14:29 -0400, Austin S. Hemmelgarn wrote: > On 2016-05-12 13:49, Richard A. Lochner wrote: > > > > Austin, > > > > I rebooted the computer and reran the scrub to no avail.  The error > > is > > consistent. > > > > The reason I brought this question to the mailing list is because > > it > > seemed like a situation that might be of interest to the > > developers. > >  Perhaps, there might be a way to "defend" against this type of > > corruption. > > > > I suspected, and I still suspect that the error occurred upon a > > metadata update that corrupted the checksum for the file, probably > > due > > to silent memory corruption.  If the checksum was silently > > corrupted, > > it would be simply written to both drives causing this type of > > error. > That does seem to be the most likely cause, and sadly, is not > something  > any filesystem can protect reliably against on any commodity > hardware. > > > > > > With that in mind, I proved (see below) that the data blocks match > > on > > both mirrors.  This I expected since the data blocks should not > > have > > been touched as the the file has not been written. > > > > This is the sequence of events as I see them that I think might be > > of > > interest to the developers. > > > > 1. A block containing a checksum for the file was read into memory. > > The block read would have been checksummed, so the checksum for the > > file must have been good at that moment. > It's worth noting that BTRFS doesn't verify all the checksums in a  > metadata block when it loads that metadata block, only the ones for > the  > reads that triggered the metadata block being loaded will get > verified. > > > > > > 2. The checksum block was the altered in memory (perhaps to add or > > change a value). > > > > 3. A new checksum would then have been calculated for the checksum > > block. > > > > 4. The checksum block would have been written to both mirrors. > > > > Presumably, in the case that I am experiencing, an undetected > > memory > > error must have occurred after 1 and before step 3 was completed. > > > > I wonder if there is a way to correct or detect that situation. > The closest we could get is to provide an option to handle this in  > scrub, preferably with a big scary warning on it as this same > situation  > can be easily cause by someone modifying the disks themselves (we > can't  > reasonably protect against that, but we shouldn't make it trivial > for  > people to inject arbitrary data that way either). > > > > > > As I stated previously, the machine on which this occurred does not > > have ECC memory, however, I would not think that the majority of > > users > > running btrfs do either.  If it has happened to me, it likely has > > happened to others. > > > > Rick Lochner > > > > btrfs dmesg(s): > > > > [16510.334020] BTRFS warning (device sdb1): checksum error at > > logical > > 3037444042752 on dev /dev/sdb1, sector 4988789496, root 259, inode > > 1437377, offset 75754369024, length 4096, links 1 (path: > > Rick/sda4.img) > > [16510.334043] BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr > > 0, rd > > 0, flush 0, corrupt 5, gen 0 > > [16510.345662] BTRFS error (device sdb1): unable to fixup (regular) > > error at logical 3037444042752 on dev /dev/sdb1 > > > > [17606.978439] BTRFS warning (device sdb1): checksum error at > > logical > > 3037444042752 on dev /dev/sdc1, sector 4988750584, root 259, inode > > 1437377, offset 75754369024, length 4096, links 1 (path: > > Rick/sda4.img) > > [17606.978460] BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr > > 0, rd > > 13, flush 0, corrupt 4, gen 0 > > [17606.989497] BTRFS error (device sdb1): unable to fixup (regular) > > error at logical 3037444042752 on dev /dev/sdc1 > > > > How I compared the data blocks: > > > > #btrfs-map-logical -l 3037444042752  /dev/sdc1 > > mirror 1 logical 3037444042752 physical 2554240299008 device > > /dev/sdc1 > > mirror 1 logical 3037444046848 physical 2554240303104 device > > /dev/sdc1 > > mirror 2 logical 3037444042752 physical 2554260221952 device > > /dev/sdb1 > > mirror 2 logical 3037444046848 physical 2554260226048 device > > /dev/sdb1 > > > > #dd if=/dev/sdc1 bs=1 skip=2554240299008 count=4096 of=c1 > > 4096+0 records in > > 4096+0 records out > > 4096 bytes (4.1 kB) copied, 0.0292201 s, 140 kB/s > > > > #dd if=/dev/sdc1 bs=1 skip=2554240303104 count=4096 of=c2 > > 4096+0 records in > > 4096+0 records out > > 4096 bytes (4.1 kB) copied, 0.0142381 s, 288 kB/s > > > > #dd if=/dev/sdb1 bs=1 skip=2554260221952 count=4096 of=b1 > > 4096+0 records in > > 4096+0 records out > > 4096 bytes (4.1 kB) copied, 0.0293211 s, 140 kB/s > > > > #dd if=/dev/sdb1 bs=1 skip=2554260226048 count=4096 of=b2 > > 4096+0 records in > > 4096+0 records out > > 4096 bytes (4.1 kB) copied, 0.0151947 s, 270 kB/s > > > > #diff b1 c1 > > #diff b2 c2 > Excellent thinking here. > > Now, if you can find some external method to verify that that block > is  > in fact correct, you can just write it back into the file itself at > the  > correct offset, and fix the issue. >