From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-btrfs-owner@vger.kernel.org>
Received: from mail-wi0-f172.google.com ([209.85.212.172]:33712 "EHLO
	mail-wi0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752594AbbJRMoU (ORCPT
	<rfc822;linux-btrfs@vger.kernel.org>);
	Sun, 18 Oct 2015 08:44:20 -0400
Received: by wijp11 with SMTP id p11so64792841wij.0
        for <linux-btrfs@vger.kernel.org>; Sun, 18 Oct 2015 05:44:18 -0700 (PDT)
Subject: Re: btrfs autodefrag?
To: Duncan <1i5t5.duncan@cox.net>, linux-btrfs@vger.kernel.org
References: <56227910.7000208@gmail.com>
 <pan$6489f$54c8fba2$f1fc6c81$81eac@cox.net>
From: Xavier Gnata <xavier.gnata@gmail.com>
Message-ID: <56239420.4040500@gmail.com>
Date: Sun, 18 Oct 2015 14:44:16 +0200
MIME-Version: 1.0
In-Reply-To: <pan$6489f$54c8fba2$f1fc6c81$81eac@cox.net>
Content-Type: text/plain; charset=utf-8; format=flowed
Sender: linux-btrfs-owner@vger.kernel.org
List-ID: <linux-btrfs.vger.kernel.org>


On 18/10/2015 07:46, Duncan wrote:
> Xavier Gnata posted on Sat, 17 Oct 2015 18:36:32 +0200 as excerpted:
>
>> Hi,
>>
>> On a desktop equipped with an ssd with one 100GB virtual image used
>> frequently, what do you recommend?
>> 1) nothing special, it is all fine as long as you have a recent kernel
>> (which I do)
>> 2) Disabling copy-on-write for just the VM image directory.
>> 3) autodefrag as a mount option.
>> 4) something else.
>>
>> I don't think this usecase is well documented therefore I asked this
>> question.
>
> You are correct.  The VM images on ssd use-case /isn't/ particularly well
> documented, I'd guess because people have differing opinions, and,
> indeed, actual observed behavior, and thus recommendations even in the
> ideal case, may well be different depending on the specs and firmware of
> the ssd.  The documentation tends to be aimed at the spinning rust case.
>
> There's one detail of the use-case (besides ssd specs), however, that you
> didn't mention, that could have a big impact on the recommendation.  What
> sort of btrfs snapshotting are you planning to do, and if you're doing
> snapshots, does your use-case really need them to include the VM image
> file?
>
> Snapshots are a big issue for anything that you might set nocow, because
> snapshot functionality assumes and requires cow, and thus conflicts, to
> some extent, with nocow.  A snapshot locks in place the existing extents,
> so they can no longer be modified.  On a normal btrfs cow-based file,
> that's not an issue, since any modifications would be cowed elsewhere
> anyway -- that's how btrfs normally works.  On a nocow file, however,
> there's a problem, because once the snapshot locks in place the existing
> version, the first change to a specific block (normally 4 KiB) *MUST* be
> cowed, despite the nocow attribute, because to rewrite in-place would
> alter the snapshot.  The nocow attribute remains in place, however, and
> further writes to the same block will again be nocow... to the new block
> location established by that first post-snapshot write... until the next
> snapshot comes along and locks that too in-place, of course.  This sort
> of cow-only-once behavior is sometimes called cow1.
>
> If you only do very occasional snapshots, probably manually, this cow1
> behavior isn't /so/ bad, tho the file will still fragment over time as
> more and more bits of it are written and rewritten after the few
> snapshots that are taken.  However, for people doing frequent, generally
> schedule-automated snapshots, the nocow attribute is effectively
> nullified as all those snapshots force cow1s over and over again.
>
> So ssd or spinning rust, there's serious conflicts between nocow and
> snapshotting that really must be taken into consideration if you're
> planning to both snapshot and nocow.
>
> For use-cases that don't require snapshotting of the nocow files, the
> simplest workaround is to put any nocow files on dedicated subvolumes.
> Since snapshots stop at subvolume boundaries, having nocow files on
> dedicated subvolume(s) stops snapshots of the parent from including them,
> thus avoiding the cow1 situation entirely.
>
> If the use-case requires snapshotting of nocow files, the workaround that
> has been reported (mostly on spinning rust, where fragmentation is a far
> worse problem due to non-zero seek-times) to work is first to reduce
> snapshotting to a minimum -- if it was going to be hourly, consider daily
> or every 12 hours, if you can get away with it, if it was going to be
> daily, consider every other day or weekly.  Less snapshotting means less
> cow1s and thus directly affects how quickly fragmentation becomes a
> problem.  Again, dedicated subvolumes can help here, allowing you to
> snapshot the nocow files on a different schedule than you do the up-
> hierarchy parent subvolume.  Second, schedule periodic manual defrags of
> the nocow files, so the fragmentation that does occur is at least kept
> manageable.  If the snapshotting is daily, consider weekly or monthly
> defrags.  If it's weekly, consider monthly or quarterly defrags.  Again,
> various people who do need to snapshot their nocow files have reported
> that this really does help, keeping fragmentation to at least some sanely
> managed level.
>
> That's the snapshot vs. nocow problem in general.  With luck, however,
> you can avoid snapshotting the files in question entirely, thus factoring
> this issue out of the equation entirely.
>
> Now to the ssd issue.
>
> On ssds in general, there are two very major differences we need to
> consider vs. spinning rust.  One, fragmentation isn't as much of a
> problem as it is on spinning rust.  It's still worth keeping to a
> minimum, because as the number of fragments increases, so does both btrfs
> and device overhead, but it's not the nearly everything-overriding
> consideration that it is on spinning rust.
>
> Two, ssds have a limited write-cycle factor to consider, where with
> spinning rust the write-cycle limit is effectively infinite... at least
> compared to the much lower limit of ssds.
>
> The weighing of these two overriding ssd factors one against the other,
> along with the simple fact that ssds are new enough technology and
> behavior differs enough between them that people simply haven't had time
> to come to agreement yet on best-practices, is why recommendations here
> differ far more than on spinning rust, where fragmentation really is the
> single most important overriding factor compared to very nearly
> everything else.  The fact of the matter is, on ssds, people strongly
> emphasizing the limited write-cycle count will tend not to worry, perhaps
> at all, about fragmentation, since it's negative effects are so much
> lower on ssds, while those (including me) who emphasize the remaining
> negative effects that fragmentation has, including scaling issues should
> it get to bad, as well as the less easy to create a universal rule for
> (because devices and firmwares do differ in major ways here) effect of
> the larger erase block size and how that interacts with sub-erase-block-
> size fragmentation and write-amplification, thus perhaps triggering more
> write cycles due to sub-erase-block-fragmentation than the defrag would
> trigger, still tend to recommend at least taking fragmentation into
> account, and may even consider autodefrag worth enabling, for use-cases
> with small enough internal-rewrite-pattern files, at least.
>
> So let's address autodefrag...
>
> It's worth noting that I have autodefrag enabled here, on my ssds, and
> have from the first mount where I put content on them, so it has been
> enabled for every write on every file.  However, it's not ideal in all
> cases, my use-case simply is one where autodefrag works well, so...
>
> Here's the deal with autodefrag.  First of all, if a file isn't
> constantly being rewritten, or if its rewrite pattern is append-only
> (like most log files, but *not* systemd journal files!), it doesn't tend
> to get particularly fragmented in the first place, especially with a
> filesystem that itself isn't highly fragmented, so free-space blocks tend
> to be large enough that a file doesn't tend to be fragmented as initially
> written.  So fragmentation tends to be worst on internal-rewrite-pattern
> files, where a block here and a block there are rewritten, normally
> triggering cow on a cow-based filesystem such as btrfs.
>
> But, consider that rewriting the entire file to avoid fragmentation,
> which is what autodefrag does, takes time, larger file, more time.   And
> at some point, as filesizes increase, rewrites can be coming in faster
> than the file can be rewritten.  So autodefrag works best on internal-
> rewrite-pattern files (as we've already established), but also on smaller
> files.
>
> On spinning rust, autodefrag tends to work best at file sizes under 256
> MiB, a quarter GiB, where they rewrite fast enough that there's generally
> no problems at all.  But on most spinning rust, people will begin to see
> performance issues with autodefrag, at somewhere between half a GiB and
> 3/4 GiB (512-768 MiB), and nearly everyone on spinning rust reports
> performance issues at 1 GiB file sizes and larger.
>
> As it happens, this quarter-GiB or so spinning-rust autodefrag limit is
> close to that of common desktop-only database uses such as the sqlite
> files firefox and thunderbird use, so this is the use-case for which
> autodefrag is really recommended and tuned ATM.  That's really useful,
> since it means most desktop-only users can simply enable autodefrag and
> forget about it, as it'll "just work".
>
> People optimizing larger databases and GiB+ VM image files, however, are
> going to need to do rather more detailed optimization, which sucks, but
> in contrast with normal desktop users, they're generally used to doing
> various optimization things, at least to some extent, already, so at
> least the problem is hitting those generally more technically prepared to
> deal with it.
>
> But that's for spinning rust.  On ssds, particularly fast ssds, write
> speeds tend to be high enough that autodefrag can work effectively with
> much larger files.  The rub, however, is that ssd speeds vary enough, and
> there's few enough reports from people actually testing autodefrag with
> larger internal-rewrite-pattern files on ssds, that we don't have nicely
> wrapped up numbers for our ssd autodefrag filesize limitation
> recommendations, as we do for spinning rust.
>
> I'd suggest based on my own experience and the reports we /do/ have, that
> on most ssds, autodefrag, provided people are inclined to enable it in
> the first place (see above discussion of the two major ssd factors here
> and how emphasis on one or the other tends to put people in one of two
> camps regarding even worrying about fragmentation at all on ssds), should
> work well enough on files upto a gig in size, at least.  I wouldn't be
> surprised to see 2 GiB work fine, particularly on fast ssds, tho I'd
> guess people will begin to see performance issues at the 4 GiB to 8 GiB
> size.
>
> You say your image file, while on ssd, is 100 GiB.  Please do your own
> tests and report as it's possible my EWAG (educated but wild-ass-guess)
> is wrong, but I'm predicting that's well above the good performance limit
> for autodefrag, even on SSD.
>
> That said, performance may still be good /enough/ that you can deal with
> it, if if sufficiently simplifies the situation for you regarding /other/
> files, and your balance of use tilts sufficiently toward those other
> files as opposed to this single very large image file.
>
> Tho at 100 GiB, the repeated rewriting of autodefrag is definitely likely
> to cut into your write-cycle allowance, arguably rather heavily.  So I
> really can't recommend autodefrag, despite how very much I wish it would
> work for your case, since it does dramatically simplify things where it
> works and you can then simply forget about other alternatives and all
> their relative complications.  Maybe someday they'll optimize it to
> handle such large files better, but until then, I really don't think it's
> a good match to your requirements.
>
> So with autodefrag out for that file, and with the previous issues
> discussed, here's some reasonable options to try.
>
> 1) The nothing special option.  With a bit of luck, the 0-seek-time of
> ssd will mean that the fragmentation you're likely to see won't
> dramatically affect you, and the "do nothing" option will work acceptably.
>
> The biggest thing I'm worried about here is that fragmentation may well
> get bad enough that it affects btrfs maintenance times, etc, due to
> scaling issues.  Btrfs balance, scrub, and check, could end up taking far
> longer than you might expect on ssd and than they'd take were it not for
> the fragmentation on this single file.
>
> And if you're keeping snapshots around, be aware that simply defragging
> the file isn't likely to solve the btrfs maintenance times issue, because
> while btrfs did have snapshot-aware-defrag for a few kernels, it did not
> scale well *AT* *ALL* and the snapshot awareness was disabled again,
> until the scaling issues could be worked thru (which they're gradually
> doing, but it's an exceedingly complex problem, with many sub-issues that
> must be solved before scaling itself can be considered solved).  So
> defragging a file that's already highly fragmented in various snapshots
> of differing ages, will defrag it in the subvolume/snapshot you run the
> defrag in, but won't affect it in the other snapshots, so isn't likely to
> do much at all for the overall btrfs maintenance scaling issue.  You'd
> have to delete all those snapshots (or not take them in the first place,
> if your use-case doesn't require them) to eliminate the scaling issue, if
> it's due to fragmentation of this file in all those snapshots as well as
> the working copy.
>
> So watch out for the maintenance scaling (maybe run a scrub and/or read-
> only check periodically, just to ensure the execution times aren't
> running away on you), but if it works well enough for you, this is by far
> the most uncomplicated option.
>
> 2) If your use-case doesn't involve snapshotting the image file, setting
> nocow on the dir before creation of the file, such that the file inherits
> the nocow, should be a reasonably uncomplicated option.
>
> If you do plan on snapshotting the parent but don't actually need to
> snapshot the nocow subdir and its nocow inheriting images, then use the
> dedicated subvolume trick to keep the image file out of your snapshots
> and avoid the cow1 complications.
>
> 3) As an idea taking the dedicated subvolume idea even further, consider
> an entirely separate dedicated filesystem for this image file.  That
> gives you much more flexibility, because then you can, for instance,
> still set autodefrag on the main filesystem, if it'd be useful there,
> without worrying about how that huge image file and autodefrag interact.
>
> Additionally, that lets you use something other than btrfs for the image
> file's filesystem, if you want, while still using btrfs for the rest of
> the system.  If you're nocowing the file, you're already killing many of
> the features that btrfs generally brings, and provided the additional
> overhead of managing the separate partition and filesystem isn't too
> much, you might /as/ /well/ simply use something other than btrfs for
> that particular file, thus avoiding the whole image file cowing
> complications scenario in the first place.
>
> I'd strongly consider the separate filesystem option here, as I already
> use multiple separate filesystems in ordered to avoid having my data eggs
> all in the same single filesystem basket (subvolumes don't cut it in
> terms of separation safety, for me).  But some people are far more averse
> to partitioning and similar solutions, for reasons that aren't entirely
> clear to me.  If you'd prefer to avoid the complexity of managing an
> entirely separate filesystem just for your image file, fine, just cross
> this option off your list and don't consider it further.
>
> 4) If the "do nothing" option doesn't cut it and your use-case involves
> snapshotting the image file, then things get much more complex.
>
> As mentioned above, the recommendation for this sort of use-case isn't
> going to give you a simple ideal, but others have reported it to work
> acceptably, even surprisingly, well, once it's all setup, and if that's
> the situation on spinning rust, it should be even better on ssd, since
> the "controlled amount of fragmentation" should be even further within
> acceptable levels on ssd with its zero-seek-times, than it is on spinning
> rust.
>
> Again, the recommendation for this use-case is to set nocow on the image-
> file's dir so it inherits, and aim for the low end of your acceptable
> snapshotting frequency range for the image file, weekly instead of daily,
> or daily instead of hourly.  If necessary, use the separate subvolume
> trick to separate the image file from the rest of the content you're
> snapshotting, so you can use a higher frequency snapshot schedule on the
> other stuff, while keeping it as low frequency as you can manage on the
> image file.
>
> Then do scheduled periodic targeted defrag of the image file, at a
> frequency some fraction of the snapshot frequency, perhaps monthly or
> quarterly for weekly snapshots, etc.
>
> Keep in mind that defrag will only affect the working copy, not existing
> snapshots, but provided you do it at some reasonable fraction of the
> snapshotting interval, you should reset the fragmentation for further
> snapshots often enough that it doesn't get out of hand for them, either.
>
>
> Finally, orthogonal to the original fragmentation question, but
> particularly important if you /are/ doing scheduled snapshots...
>
> For scheduled snapshots in particular, it's very important that you setup
> a reasonable snapshot thinning schedule as well, the object of which
> should be to keep the number of snapshots as low as possible, again, for
> scaling reasons.  At this point anyway, btrfs maintenance operations
> simply do /not/ scale well with snapshot numbers in the tens or hundreds
> of thousands range, as people often find themselves with if they aren't
> doing scheduled snapshot thinning as well.
>
> With reasonable thinning, it's quite possible to keep per-subvolume
> snapshots to 250 or so, reasonably under 300, even if starting with
> incredibly high snapshot frequency such as every half-hour or even every
> minute (tho the latter tends to be impractical because while snapshots
> are fast, very nearly instantaneous, removing them is rather more complex
> and definitely not instantaneous!).  With 250 snapshots per subvolume,
> you keep it to 1000 snapshots per filesystem if you're snapshotting four
> subvolumes, 2000 per filesystem if you're doing eight, etc.  Ideally,
> you'll target 1000 or less, possibly by thinning more drastically on some
> subvolume snapshots than others, but 2000 or even 3000 isn't out of hand,
> tho by 2500 to 3000, you'll probably notice increased maintenance times.
> By 10k snapshots, however, things are starting to go south, and above
> that, things go unreasonable pretty fast.
>
> So do try to keep to "a few thousand, at most" snapshots, or expect to
> btrfs balance and other maintenance tasks to take "unreasonable" amounts
> of time, should you need to run them.  And if you can keep to under 1000,
> so much the better; your improved maintenance times will reward you for
> it. =:^)
>
> Also, as you may have already seen, my recommendation for quotas is
> simply leave them off on btrfs.  They're broken and dramatically increase
> the scaling issues.  You either rely on quotas working or you don't.  If
> you don't, leave them off and avoid the issues.  If you do, use a more
> stable and mature filesystem where they're known to work reliably.
> Unless of course you're specifically working with the devs to test,
> report and trace down quota problems and test possible fixes.  In that
> case, please continue, as its your tolerance for the present pain that's
> helping to make the feature actually usable for the rest of us, someday
> hopefully soon. =:^)
>

Thanks for the very detailed answer! You text should find its way to the 
BTRSF wiki/doc.

I never have more than a few snapshots of my home dir.
I don't *need* to snapshot the VM image therefore I intended to use 
nocow. However and thanks to your answer, I'm going to try the "do 
nothing special" option. If things are getting to slow then I will 
report and probably switch back to the nocow option (and a good 
old-fashion backup of the VM image every night on old fashion ext4 on 
spinning rust).

Xavier