From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: Are nocow files snapshot-aware
Date: Fri, 7 Feb 2014 07:06:40 +0000 (UTC) [thread overview]
Message-ID: <pan$851d5$f707f4ca$8ec9d1c3$9613615f@cox.net> (raw)
In-Reply-To: r5mdsa-hng.ln1@hurikhan77.spdns.de
Kai Krakow posted on Fri, 07 Feb 2014 01:32:27 +0100 as excerpted:
> Duncan <1i5t5.duncan@cox.net> schrieb:
>
>> That also explains the report of a NOCOW VM-image still triggering the
>> snapshot-aware-defrag-related pathology. It was a _heavily_ auto-
>> snapshotted btrfs (thousands of snapshots, something like every 30
>> seconds or more frequent, without thinning them down right away), and
>> the continuing VM writes would nearly guarantee that many of those
>> snapshots had unique blocks, so the effect was nearly as bad as if it
>> wasn't NOCOW at all!
>
> The question here is: Does it really make sense to create such snapshots
> of disk images currently online and running a system. They will probably
> be broken anyway after rollback - or at least I'd not fully trust the
> contents.
>
> VM images should not be part of a subvolume of which snapshots are taken
> at a regular and short interval. The problem will go away if you follow
> this rule.
>
> The same applies to probably any kind of file which you make nocow -
> e.g. database files. The only use case is taking _controlled_ snapshots
> - and doing it all 30 seconds is by all means NOT controlled, it's
> completely undeterministic.
I'd absolutely agree -- and that wasn't my report, I'm just recalling it,
as at the time I didn't understand the interaction between NOCOW and
snapshots and couldn't quite understand how a NOCOW file was still
triggering the snapshot-aware-defrag pathology, which in fact we were
just beginning to realize based on such reports.
In fact at the time I assumed it was because the NOCOW had been added
after the file was originally written, such that btrfs couldn't NOCOW it
properly. That still might have been the case, but now that I understand
the interaction between snapshots and NOCOW, I see that such heavy
snapshotting on an actively written VM could trigger the same issue, even
if the NOCOW file was created properly and was indeed NOCOW when content
was actually first written into it.
But definitely agreed. 30 second snapshotting, with a 30 second commit
deadline, is pretty much off the deep end regardless of the content. I'd
even argue that 1 minute snapshotting without snapshots thinned down to
say 5 or 10 minute snapshots after say an hour, is too extreme to be
practical. Even a couple days of that, and how are you going to even
manage the thousands of snapshots or know which precise snapshot to roll
back to if you had to? That's why in the what-I-considered toward the
extreme end of practical example I posted here some days ago, IIRC I had
it do 1 minute snapshots but thin them down to 5 or 10 minutes after a
couple hours and to half hour after a couple days, with something like 90
day snapshots out to a decade. Even that I considered extreme altho at
least reasonably so, but the point was, even with something as extreme as
1 minute snapshots at first frequency and decade of snapshots, with
reasonable thinning it was still very manageable, something like 250
snapshots total, well below the thousands or tens of thousands we're
sometimes seeing in reports. That's hardly practical no matter how you
slice it, as how likely are you to know the exact minute to roll back to,
even a month out, and even if you do, if you can survive a month before
detecting it, how important is rolling back to precisely the last minute
before the problem actually going to be? At a month out perhaps the
hour, but the minute?
But some of the snapshotting scripts out there, and the admins running
them, seem to have the idea that just because it's possible it must be
done, and they have snapshots taken every minute or more frequently, with
no automated snapshot thinning at all. IMO that's pathology run amok
even if btrfs /was/ stable and mature and /could/ handle it properly.
That's regardless of the content so it's from a different angle than you
were attacking the problem from... But if admins aren't able to
recognize the problem with per-minute snapshots without any thinning at
all for days, weeks, months on end, I doubt they'll be any better at
recognizing that VMs, databases, etc, should have a dedicated subvolume.
Taking the long view, with a bit of luck we'll get to the point were
database and VM setup scripts and/or documentation recommend setting NOCOW
on the directory the VMs/DBs/etc will be in, but in practice, even that's
pushing it, and will take some time (2-5 years) as btrfs stabilizes and
mainstreams, taking over from ext4 as the assumed Linux default. Other
than that, I guess it'll be a case-by-case basis as people report
problems here. But with a snapshot-aware-defrag that actually scales,
hopefully there won't be so many people reporting problems. True, they
might not have the best optimized system and may have some minor
pathologies in their admin practices, but as long as they remain /minor/
pathologies because btrfs can deal with them better than it does now thus
keeping them from becoming /major/ pathologies...
But be that as it may, since such extreme snapshotting /is/ possible, and
with automation and downloadable snapper scripts somebody WILL be doing
it, btrfs should scale to it if it is to be considered mature and
stable. People don't want a filesystem that's going to fall over on them
and lose data or simply become unworkably live-locked just because they
didn't know what they were doing when they setup the snapper script and
set it to 1 minute snaps without any corresponding thinning after an hour
or a day or whatever.
Anyway, the temporary snapshot-aware-defrag disable commit is now in
mainline, committed shortly after 3.14-rc1 so it'll be in rc2, giving the
devs some breathing room to work out a solution that scales rather better
than what we had. So defragging is (hopefully temporarily) not snapshot
aware again ATM, but the pathologic snapshot-aware-defrag scaling issues
are at least in a bounded set of kernel releases now, so the immediately
critical problem should die down to some extent now, as the related
commits (the patches did need some backporting rework, apparently) hit
stable, anyway.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2014-02-07 7:07 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-02-04 20:52 Are nocow files snapshot-aware Kai Krakow
2014-02-05 1:22 ` Josef Bacik
2014-02-05 2:02 ` David Sterba
2014-02-05 18:17 ` Kai Krakow
2014-02-06 2:38 ` Duncan
2014-02-07 0:32 ` Kai Krakow
2014-02-07 1:01 ` cwillu
2014-02-07 1:28 ` Chris Murphy
2014-02-07 21:07 ` Kai Krakow
2014-02-07 21:31 ` Chris Murphy
2014-02-07 22:26 ` Kai Krakow
2014-02-08 6:34 ` Duncan
2014-02-08 8:50 ` Kai Krakow
2014-02-07 7:06 ` Duncan [this message]
2014-02-07 21:58 ` Kai Krakow
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$851d5$f707f4ca$8ec9d1c3$9613615f@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).