Recovering from csum errors

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Recovering from csum errors
@ 2013-09-02 21:41 Rain Maker
  2013-09-02 22:00 ` Hugo Mills
  0 siblings, 1 reply; 6+ messages in thread
From: Rain Maker @ 2013-09-02 21:41 UTC (permalink / raw)
  To: linux-btrfs

Hello list,

I was greeted by the following errors in my syslog:
Sep  2 23:06:08 laptop kernel: [ 7340.809551] btrfs: checksum error at
logical 271008116736 on dev /dev/dm-0, sector 540863448, root 442,
inode 1508, offset 10128658432, length 4096, links 1 (path:
Werkstation/Windows 8 x64-cl1.vmdk)
Sep  2 23:06:08 laptop kernel: [ 7340.809562] btrfs: bdev /dev/dm-0
errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
Sep  2 23:06:08 laptop kernel: [ 7340.809565] btrfs: unable to fixup
(regular) error at logical 271008116736 on dev /dev/dm-0
Sep  2 23:06:08 laptop kernel: [ 7340.814266] btrfs: checksum error at
logical 271008120832 on dev /dev/dm-0, sector 540863456, root 442,
inode 1508, offset 10128662528, length 4096, links 1 (path:
Werkstation/Windows 8 x64-cl1.vmdk)
Sep  2 23:06:08 laptop kernel: [ 7340.814278] btrfs: bdev /dev/dm-0
errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Sep  2 23:06:08 laptop kernel: [ 7340.814283] btrfs: unable to fixup
(regular) error at logical 271008120832 on dev /dev/dm-0
Sep  2 23:06:08 laptop kernel: [ 7340.815205] btrfs: checksum error at
logical 271008124928 on dev /dev/dm-0, sector 540863464, root 442,
inode 1508, offset 10128666624, length 4096, links 1 (path:
Werkstation/Windows 8 x64-cl1.vmdk)
Sep  2 23:06:08 laptop kernel: [ 7340.815212] btrfs: bdev /dev/dm-0
errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
Sep  2 23:06:08 laptop kernel: [ 7340.815214] btrfs: unable to fixup
(regular) error at logical 271008124928 on dev /dev/dm-0
Sep  2 23:06:08 laptop kernel: [ 7340.816107] btrfs: checksum error at
logical 271008129024 on dev /dev/dm-0, sector 540863472, root 442,
inode 1508, offset 10128670720, length 4096, links 1 (path:
Werkstation/Windows 8 x64-cl1.vmdk)
Sep  2 23:06:08 laptop kernel: [ 7340.816111] btrfs: bdev /dev/dm-0
errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
Sep  2 23:06:08 laptop kernel: [ 7340.816113] btrfs: unable to fixup
(regular) error at logical 271008129024 on dev /dev/dm-0
Sep  2 23:06:08 laptop kernel: [ 7340.816882] btrfs: checksum error at
logical 271008133120 on dev /dev/dm-0, sector 540863480, root 442,
inode 1508, offset 10128674816, length 4096, links 1 (path:
Werkstation/Windows 8 x64-cl1.vmdk)
Sep  2 23:06:08 laptop kernel: [ 7340.816887] btrfs: bdev /dev/dm-0
errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
Sep  2 23:06:08 laptop kernel: [ 7340.816889] btrfs: unable to fixup
(regular) error at logical 271008133120 on dev /dev/dm-0
Sep  2 23:06:08 laptop kernel: [ 7340.817672] btrfs: bdev /dev/dm-0
errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
Sep  2 23:06:08 laptop kernel: [ 7340.817676] btrfs: unable to fixup
(regular) error at logical 271008137216 on dev /dev/dm-0

So, I ran a full scrub, and, luckily, it only found 6 csum errors
(these 6). The damage therefore seems to be contained in "just" 1
file.

Now, I removed the offending file. But is there something else I
should have done to recover the data in this file? Can it be
recovered?

I'm running 3.11-rc7. It is a single disk btrfs filesystem. I have
several subvolumes defined, one of which for VMWare Workstation (on
which the corruption took place).

I checked the SMART values, they all seem OK. The harddisks in this
machine are less then a month old. I replaced them after seeing
similar messages on the "old" disks.

Is the only logical explanation for this some kind of hardware failure
(SATA controller, power supply...), or could there be something more
to this?

Sincerely,
Roel Brook

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Recovering from csum errors
  2013-09-02 21:41 Recovering from csum errors Rain Maker
@ 2013-09-02 22:00 ` Hugo Mills
  2013-09-02 22:28   ` Rain Maker
                     ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Hugo Mills @ 2013-09-02 22:00 UTC (permalink / raw)
  To: Rain Maker; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2175 bytes --]

On Mon, Sep 02, 2013 at 11:41:12PM +0200, Rain Maker wrote:
> Hello list,
> 
> So, I ran a full scrub, and, luckily, it only found 6 csum errors
> (these 6). The damage therefore seems to be contained in "just" 1
> file.
> 
> Now, I removed the offending file. But is there something else I
> should have done to recover the data in this file? Can it be
> recovered?

   No, and no. The data's failing a checksum, so it's basically
broken. If you had a btrfs RAID-1 configuration, the FS would be able
to recover from one broken copy using the other (good) copy.

> I'm running 3.11-rc7. It is a single disk btrfs filesystem. I have
> several subvolumes defined, one of which for VMWare Workstation (on
> which the corruption took place).

   Aaah, the VM workload could explain this. There's some (known,
won't-fix) issues with (I think) direct-IO in VM guests that can cause
bad checksums to be written under some circumstances.

   I'm not 100% certain, but I _think_ that making your VM images
nocow (create an empty file with touch; use chattr +C; extend the file
to the right size) may help prevent these problems.

> I checked the SMART values, they all seem OK. The harddisks in this
> machine are less then a month old. I replaced them after seeing
> similar messages on the "old" disks.
> 
> Is the only logical explanation for this some kind of hardware failure
> (SATA controller, power supply...), or could there be something more
> to this?

   As above, there's some direct-IO problems with data changing
in-flight that can lead to bad checksums. Fixing the issue would cause
some fairly serious slow-downs in performance for that case, which is
rather against what direct-IO is trying to do, so I think it's
unlikely the behaviour will be changed.

   Of course, I could be completely wrong about all this, and you've
got bad RAM or PSU something...

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
    --- "What are we going to do tonight?" "The same thing we do ---     
            every night, Pinky.  Try to take over the world!"            

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Recovering from csum errors
  2013-09-02 22:00 ` Hugo Mills
@ 2013-09-02 22:28   ` Rain Maker
       [not found]   ` < CAD+_0YqsQRO90Lx1R64h8EE-L=4zrE6CEQGDKy-h=92hLLptWw@mail.gmail.com>
       [not found]   ` < pan$97a5$71cfb7f0$286a61f9$62ed2e15@cox.net>
  2 siblings, 0 replies; 6+ messages in thread
From: Rain Maker @ 2013-09-02 22:28 UTC (permalink / raw)
  To: Hugo Mills, Rain Maker, linux-btrfs

First of all, thanks for the quick response. Reply inline.

2013/9/3 Hugo Mills <hugo@carfax.org.uk>:
> On Mon, Sep 02, 2013 at 11:41:12PM +0200, Rain Maker wrote:
>> Now, I removed the offending file. But is there something else I
>> should have done to recover the data in this file? Can it be
>> recovered?
>
>    No, and no. The data's failing a checksum, so it's basically
> broken. If you had a btrfs RAID-1 configuration, the FS would be able
> to recover from one broken copy using the other (good) copy.
>
Ofcourse, this makes sense.

I know filesystem recovery in BTRFS is incomplete. I'm opting for a
override for these usecases. I mean; the filesystem still knows the
checksum. There are 2 possibilities:
- The checksum is wrong
- The data is wrong

In case the checksum is wrong, why is there no possibility to
recalculate the checksum and continue with the file (taking small
corruptions for granted)? In this case (and, I believe, in more
cases), it's a VM. I could have run Windows chkdsk from the VM to see
what I could have salvaged.
In case the data is wrong, there may be a reverse CRC32 algorithm
implemented. Most likely it's only several bytes which got "flipped".
On modern hardware, it shouldn't take that much time to brute-force
the checksum, especially considering we have a good guess (the raw,
corrupted data).

Now, the VM I removed did not have any special data in it (+ I make
backups), but it could've been much worse.

>> I'm running 3.11-rc7. It is a single disk btrfs filesystem. I have
>> several subvolumes defined, one of which for VMWare Workstation (on
>> which the corruption took place).
>
>    Aaah, the VM workload could explain this. There's some (known,
> won't-fix) issues with (I think) direct-IO in VM guests that can cause
> bad checksums to be written under some circumstances.
>
>    I'm not 100% certain, but I _think_ that making your VM images
> nocow (create an empty file with touch; use chattr +C; extend the file
> to the right size) may help prevent these problems.
>
Hmm, could try that. Thanks for the tip.

I could also disable writeback cache on the VM. But, VMWare uses it's
own "vmblock" kernel module for I/O, so I'm not sure if this would do
any good. Then ofcourse, there's the performance hit.

>> Is the only logical explanation for this some kind of hardware failure
>> (SATA controller, power supply...), or could there be something more
>> to this?
>
>    As above, there's some direct-IO problems with data changing
> in-flight that can lead to bad checksums. Fixing the issue would cause
> some fairly serious slow-downs in performance for that case, which is
> rather against what direct-IO is trying to do, so I think it's
> unlikely the behaviour will be changed.
>
>    Of course, I could be completely wrong about all this, and you've
> got bad RAM or PSU something...
>
>    Hugo.
>
> --
> === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
>   PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
>     --- "What are we going to do tonight?" "The same thing we do ---
>             every night, Pinky.  Try to take over the world!"

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Recovering from csum errors
       [not found]   ` < CAD+_0YqsQRO90Lx1R64h8EE-L=4zrE6CEQGDKy-h=92hLLptWw@mail.gmail.com>
@ 2013-09-03  8:54     ` Duncan
  2013-09-03  9:26       ` David MacKinnon
  0 siblings, 1 reply; 6+ messages in thread
From: Duncan @ 2013-09-03  8:54 UTC (permalink / raw)
  To: linux-btrfs

Rain Maker posted on Tue, 03 Sep 2013 00:28:30 +0200 as excerpted:

> 2013/9/3 Hugo Mills <hugo@carfax.org.uk>:
>> On Mon, Sep 02, 2013 at 11:41:12PM +0200, Rain Maker wrote:
>>> Now, I removed the offending file. But is there something else I
>>> should have done to recover the data in this file? Can it be
>>> recovered?
>>
>>    No, and no. The data's failing a checksum, so it's basically
>> broken. If you had a btrfs RAID-1 configuration, the FS would be able
>> to recover from one broken copy using the other (good) copy.
>>
> Ofcourse, this makes sense.
> 
> I know filesystem recovery in BTRFS is incomplete. I'm opting for a
> override for these usecases. I mean; the filesystem still knows the
> checksum. There are 2 possibilities:
> - The checksum is wrong - The data is wrong
> 
> In case the checksum is wrong, why is there no possibility to
> recalculate the checksum and continue with the file (taking small
> corruptions for granted)? In this case (and, I believe, in more cases),
> it's a VM. I could have run Windows chkdsk from the VM to see what I
> could have salvaged.

AFAIK chkdsk wouldn't have returned an error, because from its point of 
view, the data is probably correct.  The issue, as stated, is (AFAIK 
proprietary, blackbox-unpatchable from a freedomware perspective) vmware 
changing data under direct-IO "in-flight", which breaks the intent and 
rules of direct-IO, at least as defined for Linux.  The previous 
discussion I've seen of the problem indicates that MS allows such 
changes, apparently choosing to take the speed hit for doing so, so it's 
an impedance miss-match between VM/physical-machine layers, one of which 
is proprietary and thus unfixable from a FLOSS perspective, with the 
other unwilling to take the general case slowdown for the proprietary 
special case that's breaking the intent of direct-IO and thus the rules 
for it in the first place.

It's worth noting that in the normal non-direct-IO case, there's no 
problem; the data is allowed to change and the checksum is simply 
recalculated.  But the entire purpose of direct-IO is to short-cut a lot 
of the care taken in the normal path in the interest of performance, when 
the user knows it can guarantee certain conditions are met.  The problem 
here is that direct-IO is being used, but the user is breaking the 
guarantee it chose to make by choosing to use direct-IO in the first 
place, then changing data in-flight that is guaranteed to be stable once 
committed to the direct-IO path.

(Just because it happened to work with ext3/4, etc, because they didn't 
do checksums and thus didn't actually rely on the level of guarantee 
being made, doesn't obligate other filesystems to do the same, 
particularly when one of their major features is checksummed data 
integrity, as is the case with btrfs.)

So because the data under direct-IO was changed in-flight, after the 
btrfs checksum had already been calculated, the MS side should indeed 
show it to be correct -- only the btrfs side will show as wrong, since 
the data changed after it calculated its checksum, thus breaking the 
rules for direct-IO under Linux.

The "proper" fix would thus be in vmware or possibly in the MS software 
running on top of it.  It should either not change the data in-flight if 
it's going to use direct-IO and by doing so make the guarantee that the 
data won't change in-flight, or should not use direct-IO if it's going to 
be changing the data in-flight and thus can't make that guarantee.  But 
of course that's not within the Linux/FLOSS world's control.

> In case the data is wrong, there may be a reverse CRC32 algorithm
> implemented. Most likely it's only several bytes which got "flipped".
> On modern hardware, it shouldn't take that much time to brute-force the
> checksum, especially considering we have a good guess (the raw,
> corrupted data).

But... that flips the entire reason for choosing direct-IO in the first 
place -- performance -- on its head, incurring a **HUGE** slowdown just 
to fix up a broken program that can't keep the guarantees it chose to 
make, to try to gain just a bit of performance.

By analogy, normal-IO might be considered surface shipping China to US, 
with direct-IO shipping by air.  But once the packages/data arrive by 
air, they're found to be broken because the packer didn't pack the data 
with the padding specified by the air-carrier so things broke in 
shipping, but instead of proposing the problem be fixed by actually 
padding as specified by the carrier or choosing the slower but more 
careful surface carrier, you're now proposing we send them to Mars (!!) 
and back to be fixed!

> Now, the VM I removed did not have any special data in it (+ I make
> backups), but it could've been much worse.
> 
>>> I have several subvolumes defined, one of which for VMWare
>>> Workstation (on which the corruption took place).
>>
>> Aaah, the VM workload could explain this. There's some (known,
>> won't-fix) issues with (I think) direct-IO in VM guests that can cause
>> bad checksums to be written under some circumstances.
>>
>> I'm not 100% certain, but I _think_ that making your VM images
>> nocow (create an empty file with touch; use chattr +C; extend the file
>> to the right size) may help prevent these problems.
>>
> Hmm, could try that. Thanks for the tip.

I'm similarly not 100% certain, but from (I believe accurate) memory, it 
was indeed nocow (nodatacow in terms of mount options).  The actual 
desired feature would be nodatasum, but AFAIK that's only available as a 
mount option, not as a per-file attribute.  And since those mount options 
currently apply to the entire filesystem, not just a subvolume, and 
checksumming is one of the big reasons you'd use btrfs in the first 
place, turning it off for the entire filesystem probably isn't what you 
want.  But since nodatacow/nocow implies nodatasum, turning off COW on 
the file also turns off checksumming, so it should do what you need, even 
if it does a bit more as well.

But nocow for a file containing a VM is almost certainly a good idea 
anyway, since the file-internal write pattern of VMs is such that the 
file would very likely otherwise end up hugely fragmented over time.  So 
it's probably what you want in the first place. =:^)

Of course you could look up the previous discussion in the list archives 
if you want the original discussion.

Meanwhile, as an alternative to the touch/chattr/extend routine 
(ordinarily necessary since nocow won't fix data that's already written), 
you can set nodatacow on the subdir the file will be created in, and 
(based on what I've read, I'm an admin not a developer myself and thus 
haven't actually read the code) all new files in that subdir should 
automatically inherit the nocow attribute.  That's what I'd probably do.

> I could also disable writeback cache on the VM. But, VMWare uses it's
> own "vmblock" kernel module for I/O, so I'm not sure if this would do
> any good. Then ofcourse, there's the performance hit.

Well, considering that by analogy you've proposed after-the-fact shipping 
to Mars and back to fix the breakage, choosing surface shipping vs. air 
shipment should be entirely insignificant, performance-wise. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Recovering from csum errors
  2013-09-03  8:54     ` Duncan
@ 2013-09-03  9:26       ` David MacKinnon
  0 siblings, 0 replies; 6+ messages in thread
From: David MacKinnon @ 2013-09-03  9:26 UTC (permalink / raw)
  To: linux-btrfs

On 3 September 2013 18:54, Duncan <1i5t5.duncan@cox.net> wrote:
>
> > In case the data is wrong, there may be a reverse CRC32 algorithm
> > implemented. Most likely it's only several bytes which got "flipped".
>
...
>
> But... that flips the entire reason for choosing direct-IO in the first
> place -- performance -- on its head, incurring a **HUGE** slowdown just


Not wanting to put words in the original posters mouth, but I read
that as an offline recovery method (scrub?), rather than real time
recovery attempts. If the frequency of errors is low, then for certain
purposes accepting, a few errors if you had a recovery option might be
acceptable.

As mentioned, nocow is probably best for VM images anyhow, but still :)

-David

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Recovering from csum errors
       [not found]     ` < CAA1QwTbiYGWpxctxOrF67wOzdAr6U6TKR__HZW44c2q9XeVM2w@mail.gmail.com>
@ 2013-09-03 16:08       ` Duncan
  0 siblings, 0 replies; 6+ messages in thread
From: Duncan @ 2013-09-03 16:08 UTC (permalink / raw)
  To: linux-btrfs

David MacKinnon posted on Tue, 03 Sep 2013 19:26:10 +1000 as excerpted:

> On 3 September 2013 18:54, Duncan <1i5t5.duncan@cox.net> wrote:
>>
>> > In case the data is wrong, there may be a reverse CRC32 algorithm
>> > implemented. Most likely it's only several bytes which got "flipped".
>>
>> But... that flips the entire reason for choosing direct-IO in the first
>> place -- performance -- on its head, incurring a **HUGE** slowdown just
> 
> Not wanting to put words in the original posters mouth, but I read that
> as an offline recovery method (scrub?), rather than real time recovery
> attempts. If the frequency of errors is low, then for certain purposes
> accepting, a few errors if you had a recovery option might be
> acceptable.

You might be right.  Tho there's already scrub available... it just 
requires a second, hopefully valid, copy to work from.  Which is what 
btrfs raid1 mode is all about, and why I chose to run it. =:^)

It would be nice to be able to say accept the invalid data, if it's not 
deemed critical and isn't so corrupted it's entirely invalid, which was 
something the poster suggested.  And in a way, that's what nocow does, by 
way of nosum; it just has to be setup before the fact; there's 
(currently) no way to make it work after the damage has occurred.

But I don't believe brute-forcing a correct crc match to be as 
necessarily feasible as the poster suggested as another alternative.  And 
even if a proper match is found, what's to say it's the /correct/ match?

Meanwhile, even if brute-forcing a match /is/ possible, in this 
particular case, it'd likely crash the VM or otherwise cause at the very 
least invalid results if not horrible VM corruption, because the written 
data was very likely correct, just changed after btrfs calculated the 
checksum.  So changing it back to what btrfs calculated the checksum on, 
even if possible, would actually corrupt the data from the VM's 
perspective, and then the VM would be acting on that corrupt data, which 
would certainly have unexpected and very possibly horribly bad results.

> As mentioned, nocow is probably best for VM images anyhow, but still :)

Agreed on that.  If the VM insists on breaking the rules and scribbling 
over its own data, just don't do the checksumming and LET it scribble 
over its own data if that's what it wants to do and as long as it doesn't 
try to scribble over anything that's NOT its data to scribble over.  If 
it breaks in pieces as a result, it gets to keep 'em. =:^\

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-09-03 16:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-02 21:41 Recovering from csum errors Rain Maker
2013-09-02 22:00 ` Hugo Mills
2013-09-02 22:28   ` Rain Maker
     [not found]   ` < CAD+_0YqsQRO90Lx1R64h8EE-L=4zrE6CEQGDKy-h=92hLLptWw@mail.gmail.com>
2013-09-03  8:54     ` Duncan
2013-09-03  9:26       ` David MacKinnon
     [not found]   ` < pan$97a5$71cfb7f0$286a61f9$62ed2e15@cox.net>
     [not found]     ` < CAA1QwTbiYGWpxctxOrF67wOzdAr6U6TKR__HZW44c2q9XeVM2w@mail.gmail.com>
2013-09-03 16:08       ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).