linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Austin S Hemmelgarn <ahferroin7@gmail.com>
To: Duncan <1i5t5.duncan@cox.net>, linux-btrfs@vger.kernel.org
Subject: Re: Did btrfs filesystem defrag just make things worse?
Date: Mon, 13 Jul 2015 07:41:06 -0400	[thread overview]
Message-ID: <55A3A3D2.9020306@gmail.com> (raw)
In-Reply-To: <pan$a54a$19bc33f8$104eb18a$cf97db25@cox.net>

[-- Attachment #1: Type: text/plain, Size: 6444 bytes --]

On 2015-07-11 11:24, Duncan wrote:
> I'm not a coder, only a list regular and btrfs user, and I'm not sure on
> this, but there have been several reports of this nature on the list
> recently, and I have a theory.  Maybe the devs can step in and either
> confirm or shoot it down.
While I am a coder, I'm not a BTRFS developer, so what I say below may 
still be incorrect.
>
[...trimmed for brevity...]
> Of course during normal use, files get deleted as well, thereby clearing
> space in existing chunks.  But this space will be fragmented, with a mix
> of unallocated extents and still remaining files.  The allocator will I
> /believe/ (this is where people who can actually read the code come in)
> try to use up space in existing chunks before allocating additional
> space, possibly subject to some reasonable extent minimum size, below
> which btrfs will simply allocate another chunk.
AFAICT, this is in fact the case.
>
> 1) Prioritize reduced fragmentation, at the expense of higher data chunk
> allocation.  In the extreme, this would mean always choosing to allocate
> a new chunk and use it if the file (or remainder of the file not yet
> defragged) was larger than the largest free extent in existing data
> chunks.
>
> The problem with this is that over time, the number of partially used
> data chunks goes up as new ones are allocated to defrag into, but sub-1
> GiB files that are already defragged are left where they are.  Of course
> a balance can help here, by combining multiple partial chunks into fewer
> full chunks, but unless a balance is run...
>
> 2) Prioritize chunk utilization, at the expense of leaving some
> fragmentation, despite massive amounts of unallocated space.
>
> This is what I've begun to suspect defrag does.  With a bunch of free but
> fragmented space in existing chunks, defrag could actually increase
> fragmentation, as the space in existing chunks is so fragmented a rewrite
> is forced to use more, smaller extents, because that's all there is free,
> until another chunk is allocated.
>
> As I mentioned above for normal file allocation, it's quite possible that
> there's some minimum extent size (greater than the bare minimum 4 KiB
> block size) where the allocator will give up and allocate a new data
> chunk, but if so, perhaps this size needs bumped upward, as it seems a
> bit low, today.
If I'm reading the code correctly, defrag does indeed try to avoid 
allocating a new chunk if at all possible.
>
>
> Meanwhile, there's a number of exacerbating factors to consider as well.
>
> * Snapshots and other shared references lock extents in place.
>
> Defrag doesn't touch anything but the subvolume it's actually pointed at
> for the defrag.  Other subvolumes and shared-reference files will
> continue to keep the extents they reference locked in place.  And COW
> will rewrite blocks of a file, but the old reference extent remains
> locked, until all references to it are cleared -- the entire file (or at
> least all blocks that were in that extent) must be rewritten, and no
> snapshots or other references to it remain, before it can be freed.
>
> For a few kernel cycles btrfs had snapshot-aware-defrag, but that
> implementation didn't scale well at all, so it was disabled until it
> could be rewritten, and that rewrite hasn't occurred yet.  So snapshot-
> aware-defrag remains disabled, and defrag only works on the subvolume
> it's actually pointed at.
>
> As a result, if defrag rewrites a snapshotted file, it actually doubles
> the space that file takes, as it makes a new copy, breaking the reference
> link between it and the copy in the snapshot.
>
> Of course, with the space not freed up, this will, over time, tend to
> fragment space that is freed even more heavily.
To mitigate this, one can run offline data deduplication (duperemove is 
the tool I'd suggest for this), although there are caveats to doing that 
as well.
>
> * Chunk reclamation.
>
> This is the relatively new development that I think is triggering the
> surge in defrag not defragging reports we're seeing now.
>
> Until quite recently, btrfs could allocate new chunks, but it couldn't,
> on its own, deallocate empty chunks.  What tended to happen over time was
> that people would find all the filesystem space taken up by empty or
> mostly empty data chunks, and btrfs would start spitting ENOSPC errors
> when it needed to allocate new metadata chunks but couldn't, as all the
> space was in empty data chunks.  A balance could fix it, often relatively
> quickly with a -dusage=0 or -dusage-10 filter or the like, but it was a
> manual process, btrfs wouldn't do it on its own.
>
> Recently the devs (mostly) fixed that, and btrfs will automatically
> reclaim entirely empty chunks on its own now.  It still doesn't reclaim
> partially empty chunks automatically; a manual rebalance must still be
> used to combine multiple partially empty chunks into fewer full chunks;
> but it does well enough to make the previous problem pretty rare -- we
> don't see the hundreds of GiB of empty data chunks allocated any more,
> like we used to.
>
> Which fixed the one problem, but if my theory is correct, it exacerbated
> the defrag issue, which I think was there before but seldom triggered so
> it generally wasn't noticed.
>
> What I believe is happening now compared to before, based on the rash of
> reports we're seeing, is that before, space fragmentation in allocated
> data chunks seldom became an issue, because people tended to accumulate
> all these extra empty data chunks, leaving defrag all that unfragmented
> empty space to rewrite the new extents into as it did the defrag.
>
> But now, all those empty data chunks are reclaimed, leaving defrag only
> the heavily space-fragmented partially used chunks.  So now we're getting
> all these reports of defrag actually making the problem worse, not better!
I believe that this is in fact the root cause.  Personally, I would love 
to be able to turn this off without having to patch the kernel.  Since 
it went in, not only does it (apparently) cause issues with defrag, but 
DISCARD/TRIM support is broken, and most of my (heavily rewritten) 
filesystems are running noticeably slower as well.  I'm going to start a 
discussion regarding this in another thread however, as it doesn't just 
affect defrag.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]

      reply	other threads:[~2015-07-13 11:41 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-10 20:57 Did btrfs filesystem defrag just make things worse? Donald Pearson
2015-07-11  2:52 ` Chris Murphy
2015-07-11  4:30 ` Duncan
2015-07-11 12:18   ` Donald Pearson
2015-07-11 15:24     ` Duncan
2015-07-13 11:41       ` Austin S Hemmelgarn [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55A3A3D2.9020306@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).