From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: Recovering from csum errors
Date: Tue, 3 Sep 2013 08:54:01 +0000 (UTC) [thread overview]
Message-ID: <pan$97a5$71cfb7f0$286a61f9$62ed2e15@cox.net> (raw)
In-Reply-To: CAD+_0YqsQRO90Lx1R64h8EE-L=4zrE6CEQGDKy-h=92hLLptWw@mail.gmail.com
Rain Maker posted on Tue, 03 Sep 2013 00:28:30 +0200 as excerpted:
> 2013/9/3 Hugo Mills <hugo@carfax.org.uk>:
>> On Mon, Sep 02, 2013 at 11:41:12PM +0200, Rain Maker wrote:
>>> Now, I removed the offending file. But is there something else I
>>> should have done to recover the data in this file? Can it be
>>> recovered?
>>
>> No, and no. The data's failing a checksum, so it's basically
>> broken. If you had a btrfs RAID-1 configuration, the FS would be able
>> to recover from one broken copy using the other (good) copy.
>>
> Ofcourse, this makes sense.
>
> I know filesystem recovery in BTRFS is incomplete. I'm opting for a
> override for these usecases. I mean; the filesystem still knows the
> checksum. There are 2 possibilities:
> - The checksum is wrong - The data is wrong
>
> In case the checksum is wrong, why is there no possibility to
> recalculate the checksum and continue with the file (taking small
> corruptions for granted)? In this case (and, I believe, in more cases),
> it's a VM. I could have run Windows chkdsk from the VM to see what I
> could have salvaged.
AFAIK chkdsk wouldn't have returned an error, because from its point of
view, the data is probably correct. The issue, as stated, is (AFAIK
proprietary, blackbox-unpatchable from a freedomware perspective) vmware
changing data under direct-IO "in-flight", which breaks the intent and
rules of direct-IO, at least as defined for Linux. The previous
discussion I've seen of the problem indicates that MS allows such
changes, apparently choosing to take the speed hit for doing so, so it's
an impedance miss-match between VM/physical-machine layers, one of which
is proprietary and thus unfixable from a FLOSS perspective, with the
other unwilling to take the general case slowdown for the proprietary
special case that's breaking the intent of direct-IO and thus the rules
for it in the first place.
It's worth noting that in the normal non-direct-IO case, there's no
problem; the data is allowed to change and the checksum is simply
recalculated. But the entire purpose of direct-IO is to short-cut a lot
of the care taken in the normal path in the interest of performance, when
the user knows it can guarantee certain conditions are met. The problem
here is that direct-IO is being used, but the user is breaking the
guarantee it chose to make by choosing to use direct-IO in the first
place, then changing data in-flight that is guaranteed to be stable once
committed to the direct-IO path.
(Just because it happened to work with ext3/4, etc, because they didn't
do checksums and thus didn't actually rely on the level of guarantee
being made, doesn't obligate other filesystems to do the same,
particularly when one of their major features is checksummed data
integrity, as is the case with btrfs.)
So because the data under direct-IO was changed in-flight, after the
btrfs checksum had already been calculated, the MS side should indeed
show it to be correct -- only the btrfs side will show as wrong, since
the data changed after it calculated its checksum, thus breaking the
rules for direct-IO under Linux.
The "proper" fix would thus be in vmware or possibly in the MS software
running on top of it. It should either not change the data in-flight if
it's going to use direct-IO and by doing so make the guarantee that the
data won't change in-flight, or should not use direct-IO if it's going to
be changing the data in-flight and thus can't make that guarantee. But
of course that's not within the Linux/FLOSS world's control.
> In case the data is wrong, there may be a reverse CRC32 algorithm
> implemented. Most likely it's only several bytes which got "flipped".
> On modern hardware, it shouldn't take that much time to brute-force the
> checksum, especially considering we have a good guess (the raw,
> corrupted data).
But... that flips the entire reason for choosing direct-IO in the first
place -- performance -- on its head, incurring a **HUGE** slowdown just
to fix up a broken program that can't keep the guarantees it chose to
make, to try to gain just a bit of performance.
By analogy, normal-IO might be considered surface shipping China to US,
with direct-IO shipping by air. But once the packages/data arrive by
air, they're found to be broken because the packer didn't pack the data
with the padding specified by the air-carrier so things broke in
shipping, but instead of proposing the problem be fixed by actually
padding as specified by the carrier or choosing the slower but more
careful surface carrier, you're now proposing we send them to Mars (!!)
and back to be fixed!
> Now, the VM I removed did not have any special data in it (+ I make
> backups), but it could've been much worse.
>
>>> I have several subvolumes defined, one of which for VMWare
>>> Workstation (on which the corruption took place).
>>
>> Aaah, the VM workload could explain this. There's some (known,
>> won't-fix) issues with (I think) direct-IO in VM guests that can cause
>> bad checksums to be written under some circumstances.
>>
>> I'm not 100% certain, but I _think_ that making your VM images
>> nocow (create an empty file with touch; use chattr +C; extend the file
>> to the right size) may help prevent these problems.
>>
> Hmm, could try that. Thanks for the tip.
I'm similarly not 100% certain, but from (I believe accurate) memory, it
was indeed nocow (nodatacow in terms of mount options). The actual
desired feature would be nodatasum, but AFAIK that's only available as a
mount option, not as a per-file attribute. And since those mount options
currently apply to the entire filesystem, not just a subvolume, and
checksumming is one of the big reasons you'd use btrfs in the first
place, turning it off for the entire filesystem probably isn't what you
want. But since nodatacow/nocow implies nodatasum, turning off COW on
the file also turns off checksumming, so it should do what you need, even
if it does a bit more as well.
But nocow for a file containing a VM is almost certainly a good idea
anyway, since the file-internal write pattern of VMs is such that the
file would very likely otherwise end up hugely fragmented over time. So
it's probably what you want in the first place. =:^)
Of course you could look up the previous discussion in the list archives
if you want the original discussion.
Meanwhile, as an alternative to the touch/chattr/extend routine
(ordinarily necessary since nocow won't fix data that's already written),
you can set nodatacow on the subdir the file will be created in, and
(based on what I've read, I'm an admin not a developer myself and thus
haven't actually read the code) all new files in that subdir should
automatically inherit the nocow attribute. That's what I'd probably do.
> I could also disable writeback cache on the VM. But, VMWare uses it's
> own "vmblock" kernel module for I/O, so I'm not sure if this would do
> any good. Then ofcourse, there's the performance hit.
Well, considering that by analogy you've proposed after-the-fact shipping
to Mars and back to fix the breakage, choosing surface shipping vs. air
shipment should be entirely insignificant, performance-wise. =:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2013-09-03 8:54 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-09-02 21:41 Recovering from csum errors Rain Maker
2013-09-02 22:00 ` Hugo Mills
2013-09-02 22:28 ` Rain Maker
[not found] ` < CAD+_0YqsQRO90Lx1R64h8EE-L=4zrE6CEQGDKy-h=92hLLptWw@mail.gmail.com>
2013-09-03 8:54 ` Duncan [this message]
2013-09-03 9:26 ` David MacKinnon
[not found] ` < pan$97a5$71cfb7f0$286a61f9$62ed2e15@cox.net>
[not found] ` < CAA1QwTbiYGWpxctxOrF67wOzdAr6U6TKR__HZW44c2q9XeVM2w@mail.gmail.com>
2013-09-03 16:08 ` Duncan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$97a5$71cfb7f0$286a61f9$62ed2e15@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).