Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Date: Mon, 14 Dec 2015 10:51:11 +0000 (UTC)	[thread overview]
Message-ID: <pan$ebb43$1588cbbe$750076d0$4848c51f@cox.net> (raw)
In-Reply-To: 1450057495.2388.40.camel@scientia.net

Christoph Anton Mitterer posted on Mon, 14 Dec 2015 02:44:55 +0100 as
excerpted:

> Two more on these:
> 
> On Thu, 2015-11-26 at 00:33 +0000, Hugo Mills wrote:
>> 3) When I would actually disable datacow for e.g. a subvolume that
>> > holds VMs or DBs... what are all the implications?

>> After snapshotting, modifications are CoWed precisely once, and
>> then it reverts to nodatacow again. This means that making a snapshot
>> of a nodatacow object will cause it to fragment as writes are made to
>> it.

> AFAIU, the one the get's fragmented then is the snapshot, right, and the
> "original" will stay in place where it was? (Which is of course good,
> because one probably marked it nodatacow, to avoid that fragmentation
> problem on internal writes).

No.  Or more precisely, keep in mind that from btrfs' perspective, in 
terms of reflinks, once made, there's no "original" in terms of special 
treatment, all references to the extent are treated the same.

What a snapshot actually does is create another reference (reflink) to an 
extent.  What btrfs normally does on change as a cow-based filesystem is 
of course copy-on-write the change.  What nocow does, in the absence of 
other references to that extent, is rewrite the change in-place.

But if there's another reference to that extent, the change can't be in-
place because that would change the file reached by that other reference 
as well, and the change was only to be made to one of them.  So in the 
case of nocow, a cow1 (one-time-cow) exception must be made, rewriting 
the changed data to a new location, as the old location continues to be 
referenced by at least one other reflink.

So (with the fact that writable snapshots are available and thus it can 
be the snapshot that changed if it's what was written to) the one that 
gets the changed fragment written elsewhere, thus getting fragmented, is 
the one that changed, whether that's the working copy or the snapshot of 
that working copy.

> I'd assume the same happens when I do a reflink cp.

Yes.  It's the same reflinking mechanism, after all.  If there's other 
reflinks to the extent, snapshot or otherwise, changes must be written 
elsewhere, even if they'd otherwise be nocow.

> Can one make a copy, where one still has atomicity (which I guess
> implies CoW) but where the destination file isn't heavily fragmented
> afterwards,... i.e. there's some pre-allocation, and then cp really does
> copy each block (just everything's at the state of time where I stared
> cp, not including any other internal changes made on the source in
> between).

The way that's handled is via ro snapshots which are then copied, which 
of course is what btrfs send does (at least in non-incremental mode, and 
incremental mode still uses the ro snapshot part to get atomicity), in 
effect.

> And one more:
> You both said, auto-defrag is generally recommended.
> Does that also apply for SSDs (where we want to avoid unnecessary
> writes)?
> It does seem to get enabled, when SSD mode is detected.
> What would it actually do on an SSD?

Did you mean it does _not_ seem to get (automatically) enabled, when SSD 
mode is detected, or that it _does_ seem to get enabled, when 
specifically included in the mount options, even on SSDs?

Or did you actually mean it the way you wrote it, that it seems to be 
enabled (implying automatically, along with ssd), when ssd mode is 
detected?

Because the latter would be a shock to me, as that behavior hasn't been 
documented anywhere, but I can't imagine it's actually doing it and that 
you actually meant what you actually wrote.

If you look waaayyy back to shortly before I did my first more or less 
permanent deployment (I had initially posted some questions and did an 
initial experimental deployment several months earlier, but it didn't 
last long, because $reasons), you'll see a post I made to the list with 
pretty much the same general question, autodefrag on ssd, or not.

I believe the most accurate short answer is that the benefit of 
autodefrag on SSD is fuzzy, and thus left to local choice/policy, without 
an official recommendation either way.

There are two points that we know for certain: (1) the zero-seek-time of 
SSD effectively nullifies the biggest and most direct cost associated 
with fragmentation on spinning rust, thereby lessening the advantage of 
autodefrag as seen on spinning rust by an equally large degree, and (2) 
autodefrag will without question lead to a relatively limited number of 
near-time additional writes, as the rewrite is queued and eventually 
processed.

To the extent that an admin considers these undisputed factors alone, or 
weighs them less heavily than the more controversial factors below, 
they're likely to consider autodefrag on ssd a net negative and leave it 
off.

But I was persuaded by the discussion when I asked the question, to 
enable autodefrag on my all-ssd btrfs deployment here.  Why?  Those 
other, less direct and arguably less directly measurable (except possibly 
by actual detail benchmarking or a/b deployment testing over long 
periods).

There are three factors I'm aware of here as well, all favoring 
autodefrag, just as the two above favored leaving it off.

1) IOPS, Input/Output Operations Per Second.  SSDs typically have both an 
IOPS and a throughput rating.  And unlike spinning rust, where raw non-
sequential-write IOPS are generally bottlenecked by seek times, on SSDs 
with their zero seek-times, IOPS can actually be the bottleneck.

Now I'm /far/ from a hardware storage device expert and thus may be badly 
misconstruing things here, but at least as I understand things, reading/
writing a single extent/fragment is typically issued as a single IO 
operation (to some maximum size), and particularly at the higher 
throughput speeds ssds commonly have and with their zero-seek-times, it's 
quite possible to bottleneck on the number of such operations, hitting 
the IOPS ceiling on either the device itself or its controller, if files 
are highly fragmented and/or there's multiple tasks doing IO to the same 
device at once.

Back when I first setup btrfs on my then new SSDs, I didn't know a whole 
lot about SSDs and this was my primary reason for choosing autodefrag; 
less fragmentation means larger IO operations so fewer of them are 
necessary to complete the data transfer, placing a lower stress on the 
device controllers and making it less likely to bottleneck on the IOPS 
limits.

2) SSD physical write and erase block sizes as multiples of the logical/
read block size.  To the extent that extent sizes are multiples of the 
write and/or erase-block size, writing larger extents will reduce write 
amplification due to writing and blocks smaller than the write or erase 
block size.

While the initial autodefrag rewrite is a second-cycle write after a 
fragmented write, spending a write cycle for the autodefrag, consistent 
use of autodefrag should help keep file fragmentation and thus ultimately 
space fragmentation to a minimum, so initial writes, where there's enough 
data to write an initially large extent, won't be forced to be broken 
into smaller extents because there's simply no large free-space extents 
left due to space fragmentation.

IOW, autodefrag used consistently should reduce space fragmentation as 
well as file fragmentation, and this reduced space fragmentation will 
lead to the possibility of writing larger extents initially, where the 
amount of data to be written allows it, thereby reducing initial file 
write fragmentation and the need for autodefrag as a result.

This one dawned on me somewhat later, after I understood a bit more about 
SSDs and write amplification due to physical write and erase block 
sizing. I was in the process of explaining (in the context of spinning 
rust) how autodefrag used consistently should help manage space 
fragmentation as well, when I suddenly realized the implications that had 
on SSDs as well, due to their larger physical write and erase block sizes.

3) Btrfs metadata management overhead.  While btrfs tracks things like 
checksums at fixed sizes, other metadata is per extent.  Obviously, the 
more extents a file has, the harder btrfs has to work to track them all.  
Maintenance tasks such as balance and check already have scaling issues; 
do we really want to make them worse by forcing them to track thousands 
or tens of thousands of extents per (large) file where they could be 
tracking a dozen or two?

Autodefrag helps keep the work btrfs itself has to do under control, and 
in some contexts, that alone can be worth any write-amplification costs.

On balance, I was persuaded to use autodefrag on my own btrfs' on SSDs, 
and believe the near-term write-cycle damage may in fact be largely 
counteracted by indirect free-space defrag effect and the effect that in 
turn has on the ability to even find large areas of cohesive free space 
to write into in the first place.  With that largely counteracted, the 
other benefits in my mind again outweigh the negatives, so autodefrag 
continues to be worth it in general, even on SSDs.

But I can definitely see how someone could logically take the opposing 
position, and without someone actually doing either some pretty complex 
benchmarks or some longer term a/b testing where autodefrag's longer term 
effect on free space fragmentation can come into play, against just 
letting things fragment as they will on the other side, in enough 
different usage scenarios to be convincing for the general purpose case 
as well, it's unlikely the debate will ever be properly resolved.

I suppose someone will eventually do that sort of testing, but of course 
even if they did it now, with btrfs code still to be optimized and 
various scaling work still to be done, it's anyone's guess if the test 
results would still apply a few years down the road, after that scaling 
and optimization work.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

next prev parent reply	other threads:[~2015-12-14 10:51 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-23  1:43 btrfs: poor performance on deleting many large files Mitch Fossen
2015-11-23  6:29 ` Duncan
2015-11-25 21:49   ` Mitchell Fossen
2015-11-26 16:52     ` Duncan
2015-11-26 18:25       ` Christoph Anton Mitterer
2015-11-26 23:29         ` Duncan
2015-11-27  0:06           ` Christoph Anton Mitterer
2015-11-27  3:38             ` Duncan
2015-11-28  3:57               ` Christoph Anton Mitterer
2015-11-28  6:49                 ` Duncan
2015-12-12 22:15                   ` Christoph Anton Mitterer
2015-12-13  7:10                     ` Duncan
2015-12-16 22:14                       ` Christoph Anton Mitterer
2015-12-14 14:24                     ` Austin S. Hemmelgarn
2015-12-14 19:39                       ` Christoph Anton Mitterer
2015-12-14 20:27                         ` Austin S. Hemmelgarn
2015-12-14 21:30                           ` Lionel Bouton
2015-12-14 23:25                             ` Christoph Anton Mitterer
2015-12-15  1:49                               ` Duncan
2015-12-15  2:38                                 ` Lionel Bouton
2015-12-16  8:10                                   ` Duncan
2015-12-14 23:10                           ` Christoph Anton Mitterer
2015-12-14 23:16                           ` project idea: per-object default mount-options / more btrfs-properties / chattr attributes (was: btrfs: poor performance on deleting many large files) Christoph Anton Mitterer
2015-12-15  2:08                           ` btrfs: poor performance on deleting many large files Duncan
2015-12-15  4:05                       ` Chris Murphy
2015-11-27  1:49     ` Qu Wenruo
2015-11-23 12:59 ` Austin S Hemmelgarn
2015-11-26  0:23   ` [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?) Christoph Anton Mitterer
2015-11-26  0:33     ` Hugo Mills
2015-12-09  5:43       ` Christoph Anton Mitterer
2015-12-09 13:36         ` Duncan
2015-12-14  2:46           ` Christoph Anton Mitterer
2015-12-14 11:19             ` Duncan
2015-12-16 23:39           ` Kai Krakow
2015-12-14  1:44       ` Christoph Anton Mitterer
2015-12-14 10:51         ` Duncan [this message]
2015-12-16 23:55           ` Christoph Anton Mitterer
2015-11-26 23:08     ` Duncan
2015-12-09  5:45       ` Christoph Anton Mitterer
2015-12-09 16:36         ` Duncan
2015-12-16 21:59           ` Christoph Anton Mitterer
2015-12-17  4:06             ` Duncan
2015-12-18  0:21               ` Christoph Anton Mitterer
2015-12-17  4:35             ` Duncan
2015-12-17  5:07             ` Duncan
2015-12-17  5:12             ` Duncan
2015-12-17  6:00             ` Duncan
2015-12-17  6:01             ` Duncan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='pan$ebb43$1588cbbe$750076d0$4848c51f@cox.net' \
    --to=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.