linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: btrfs-transaction blocked for more than 120 seconds
Date: Sun, 5 Jan 2014 19:57:38 +0000 (UTC)	[thread overview]
Message-ID: <pan$95dc3$6f31a244$d85ae144$1545dedf@cox.net> (raw)
In-Reply-To: 52C99C64.2010209@jrs-s.net

Jim Salter posted on Sun, 05 Jan 2014 12:54:44 -0500 as excerpted:


> On 01/05/2014 12:09 PM, Chris Murphy wrote:
>> I haven't read anything so far indicating defrag applies to the VM
>> container use case, rather nodatacow via xattr +C is the way to go. At
>> least for now.

Well, NOCOW from the get-go would certainly be better, but given that the 
file is already there and heavily fragmented, my idea was to get it 
defragmented and then set the +C, to prevent it reoccurring.

But I do very little snapshotting here, and as a result hadn't considered 
the knockon effect of 100K-plus extents in perhaps 1000 snapshots.  I 
guess that's what's killing the defrag, however it's initiated.  The only 
way to get rid of the problem, then, would be to move the file away and 
then back, but doing so does still leave all those snapshots with the 
crazy fragmentation, and to kill that would require either killing all 
those snapshots, or setting them writable and doing the same move out, 
move back, on each one!  OUCH, but I guess that's why it just seems 
impossible to deal with the fragmentation on these things, whether it's 
autodefrag, or named file defrag, or doing the whole move out and back 
thing, and then having to worry about all those snapshots.

Still, I'd guess ultimately it'll need done, whether it's a wipe the 
filesystem and restore from backup or whatever.

> Can you elaborate on the rationale behind database or VM binaries being
> set nodatacow? I experimented with this*, and found no significant (to
> me,
> anyway) performance enhancement with nodatacow on - maybe 10% at best,
> and if I understand correctly, that implies losing the live per-block
> checksumming of the data that's set nodatacow, meaning you won't get
> automatic correction if you're on a redundant array.
> 
> All I've heard so far is "better performance" without any more detailed
> explanation, and if the only benefit is an added MAYBE 10%ish
> performance... I'd rather take the hit, personally.
> 
> * "experimented with this" == set up a Win2008R2 test VM and ran
> HDTunePro for several runs on binaries stored with and without nodatacow
> set, 5G of random and sequential read and write access per run.

Well, the problem isn't just performance, it's that in most such cases 
the apps actually have their own date integrity checking and management, 
and sometimes the app's integrity management and that of btrfs end up 
fighting each other, destroying the data as a result.

In normal operation, everything's fine.  But should the system crash at 
the wrong moment, btrfs' atomic commit and data integrity mechanisms can 
roll back to a slightly earlier version of the file.

Which is normally fine.  But because hardware is known to often lie about 
having committed writes that may actually still only be in buffer, if the 
power outage/crash occurred at the wrong moment, ordinary write-barrier 
ordering guarantees may be invalid (particularly on large files with 
finite-seek-speed devices), the app's own integrity checksum may have 
been updated before the data it was supposed to be a checksum on actually 
got to disk.  If btrfs ends up rolling back to that condition, btrfs will 
likely consider the file fine, but the app's own integrity management 
will consider it corrupted, which it actually is.

But if btrfs only stays out of the way, the application often can fix 
whatever minor corruption it detects, doing its own roll-backs to an 
earlier checkpoint, because it's /designed/ to be able to handle such 
problems on filesystems that don't have integrity management.

So having btrfs trying to manage integrity too on such data where the app 
already handles it is self-defeating, because neither knows about nor 
considers what the other one is doing, and the two end up undoing each 
other's careful work.

Again, this isn't something you'll see in normal operation, but several 
people have reported exactly that sort of problem with the general large-
internally-written-file, application-self-managed-file-integrity, 
scenario.  In those cases, the best thing btrfs can do is simply get out 
of the way and let the application handle its own integrity management, 
and the way to tell btrfs to do that, as well as to do in-place rewrites 
instead of COW-based rewrites, is with the NOCOW xattrib, chattr +C, and 
that must be done before the file gets so fragmented (and multi-
snapshotted in its fragmented state) in the first place.


-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


  reply	other threads:[~2014-01-05 19:58 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-31 11:46 btrfs-transaction blocked for more than 120 seconds Sulla
2014-01-01 12:37 ` Duncan
2014-01-01 20:08   ` Sulla
2014-01-02  8:38     ` Duncan
2014-01-03  1:24       ` Kai Krakow
2014-01-03  9:18         ` Duncan
2014-01-05  0:12     ` Sulla
2014-01-03 17:25   ` Marc MERLIN
2014-01-03 21:34     ` Duncan
2014-01-05  6:39       ` Marc MERLIN
2014-01-05 17:09         ` Chris Murphy
2014-01-05 17:54           ` Jim Salter
2014-01-05 19:57             ` Duncan [this message]
2014-01-05 20:44               ` Chris Murphy
2014-01-08  3:22       ` Marc MERLIN
2014-01-08  9:45         ` Duncan
2014-01-04 20:48     ` Roger Binns
2014-01-02  8:49 ` Jojo
2014-01-05 20:32 ` Chris Murphy
2014-01-05 21:17   ` Sulla
2014-01-05 22:36     ` Brendan Hide
2014-01-05 22:57       ` Roman Mamedov
2014-01-07 10:22         ` Brendan Hide
2014-01-06  0:15       ` Chris Murphy
2014-01-06  0:19         ` Chris Murphy
2014-01-05 23:48     ` Chris Murphy
2014-01-05 23:57       ` Chris Murphy
2014-01-06  0:25         ` Sulla
2014-01-06  0:49           ` Chris Murphy
     [not found]             ` <52CA06FE.2030802@gmx.at>
2014-01-06  1:55               ` Chris Murphy
     [not found] <ADin1n00P0VAdqd01DioM9>
2014-01-05 20:44 ` Duncan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='pan$95dc3$6f31a244$d85ae144$1545dedf@cox.net' \
    --to=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).