Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Remi Gauvin <remi@georgianit.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
Date: Tue, 15 Mar 2022 14:51:23 -0400	[thread overview]
Message-ID: <YjDgKzAx/tawKHCz@hungrycats.org> (raw)
In-Reply-To: <eda21cae-4825-458a-dd69-1e2740955dc0@georgianit.com>

On Tue, Mar 15, 2022 at 10:14:01AM -0400, Remi Gauvin wrote:
> On 2022-03-14 7:39 p.m., Zygo Blaxell wrote:
> > If we're adding a mount option for this (I'm not opposed to it, I'm
> > pointing out that it's not the first tool to reach for), then ideally
> > we'd overload it for the compressed batch size (currently hardcoded
> > at 512K).
> 
> Are there any advantages to extents larger than 256K on ssd Media?  

The main advantage of larger extents is smaller metadata, and it doesn't
matter very much whether it's SSD or HDD.  Adjacent extents will be in
the same metadata page, so not much is lost with 256K extents even on HDD,
as long as they are physically allocated adjacent to each other.

There is a CPU hit for every extent, and when snapshot pages become
unshared, every distinct extent on the page needs its reference count
updated for the new page.  The costs of small extents add up during
balances, resizes, and snapshot deletes, but on a small filesystem you'd
want smaller extents so that balances and resizes are possible at all
(this is why there's a 128M limit now--previously, extents of multiple
GB were possible).

Averaged across my filesystems, half of the data blocks are in extents
below 512K, and only 1% of extents are 1M or larger.  Capping the extent
size at 256K wouldn't make much difference--the total extent count would
increase by less than 5%.

In my defrag experiments, the pareto limit kicks in at a target extent
size of 100K-200K (anything larger than this doesn't get better when
defragged, anything smaller kills performance if it's _not_ defragged).
256K may already be larger than optimal for some workloads.

> Even if a much needed garbage collection process were to be created,
> the smaller extents would mean less data would need to be re-written,
> (and potentially duplicated due to snapshots and ref copies.)

GC has to take all references into account when computing block
reachability, and it has to eliminate all references to remove garbage,
so there should not be any new duplicate data.  Currently GC has to
be implemented by copying the data and then using dedupe to replace
references to the original data individually, but that could be optimized
with a new kernel ioctl that handles all the references at once with a
lock, instead of comparing the data bytes for each one.

GC could create smaller extents intentionally, by creating new extents
in units of 256K, but reflinking them in reverse order over the original
large extents to prevent coalescing extents in writeback.

GC would also have to figure out whether the IO cost of splitting the
extent is worth the space saving (e.g. don't relocate 100MB of data to
save 4K of disk space, wait until it's at least 1MB of space saved).
That's a sysadmin policy input.

GC is not autodefrag.  If it sees that it has to carve up 100M extents
for sub-64K writes, GC can create 400x 256K extents to replace the large
extents, and only defrag when there's a contiguous range of modified
extents with length 64K or less.  Or whatever sizes turn out to be the
right ones--setting the sizes isn't the hard thing to do here.

Obviously, in that scenario it is more efficient if there's a way to
not write the 100M extents in the first place, but it quickly reaches
a steady state with relatively little wasted space, and doesn't require
tuning knobs in the kernel.

GC + autodefrag could go the other way, too:  make the default extent
size small, but allow autodefrag to request very large extents for files
that have not been modified in a while.  That's inefficient too, but
in other other direction, so it would be a better match for the steady
state of some workloads (e.g. video recording or log files).

Ideally there'd be an "optimum extent size" inheritable inode property,
so we can have databases use tiny extents and video recorders use huge
extents on the same filesystem.  But maybe that's overengineering,
and 256K (128K?  512K?) is within the range of values for most.

> The fine details on how to implement all of this is way over my head,
> but it seemed to me like the logic to keep the extents small is already
> more or less already there, and would need relatively very little work
> to manifest.

There's a #define for maximum new extent length.  It wouldn't be too
difficult to look up that number in fs_info instead, slightly harder
to look it up in an inode.  The limit applies only to new extents,
so there's no backward compatibility issue with the on-disk format.

next prev parent reply	other threads:[~2022-03-15 18:51 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-06 15:59 Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit Jan Ziak
2022-03-07  0:48 ` Qu Wenruo
2022-03-07  2:23   ` Jan Ziak
2022-03-07  2:39     ` Qu Wenruo
2022-03-07  7:31       ` Qu Wenruo
2022-03-10  1:10         ` Jan Ziak
2022-03-10  1:26           ` Qu Wenruo
2022-03-10  4:33             ` Jan Ziak
2022-03-10  6:42               ` Qu Wenruo
2022-03-10 21:31                 ` Jan Ziak
2022-03-10 23:27                   ` Qu Wenruo
2022-03-11  2:42                     ` Jan Ziak
2022-03-11  2:59                       ` Qu Wenruo
2022-03-11  5:04                         ` Jan Ziak
2022-03-11 16:31                           ` Jan Ziak
2022-03-11 20:02                             ` Jan Ziak
2022-03-11 23:04                             ` Qu Wenruo
2022-03-11 23:28                               ` Jan Ziak
2022-03-11 23:39                                 ` Qu Wenruo
2022-03-12  0:01                                   ` Jan Ziak
2022-03-12  0:15                                     ` Qu Wenruo
2022-03-12  3:16                                     ` Zygo Blaxell
2022-03-12  2:43                                 ` Zygo Blaxell
2022-03-12  3:24                                   ` Qu Wenruo
2022-03-12  3:48                                     ` Zygo Blaxell
2022-03-14 20:09                         ` Phillip Susi
2022-03-14 22:59                           ` Zygo Blaxell
2022-03-15 18:28                             ` Phillip Susi
2022-03-15 19:28                               ` Jan Ziak
2022-03-15 21:06                               ` Zygo Blaxell
2022-03-15 22:20                                 ` Jan Ziak
2022-03-16 17:02                                   ` Zygo Blaxell
2022-03-16 17:48                                     ` Jan Ziak
2022-03-17  2:11                                       ` Zygo Blaxell
2022-03-16 18:46                                 ` Phillip Susi
2022-03-16 19:59                                   ` Zygo Blaxell
2022-03-20 17:50                             ` Forza
2022-03-20 21:15                               ` Zygo Blaxell
2022-03-08 21:57       ` Jan Ziak
2022-03-08 23:40         ` Qu Wenruo
2022-03-09 22:22           ` Jan Ziak
2022-03-09 22:44             ` Qu Wenruo
2022-03-09 22:55               ` Jan Ziak
2022-03-09 23:00                 ` Jan Ziak
2022-03-09  4:48         ` Zygo Blaxell
2022-03-07 14:30 ` Phillip Susi
2022-03-08 21:43   ` Jan Ziak
2022-03-09 18:46     ` Phillip Susi
2022-03-09 21:35       ` Jan Ziak
2022-03-14 20:02         ` Phillip Susi
2022-03-14 21:53           ` Jan Ziak
2022-03-14 22:24             ` Remi Gauvin
2022-03-14 22:51               ` Zygo Blaxell
2022-03-14 23:07                 ` Remi Gauvin
2022-03-14 23:39                   ` Zygo Blaxell
2022-03-15 14:14                     ` Remi Gauvin
2022-03-15 18:51                       ` Zygo Blaxell [this message]
2022-03-15 19:22                         ` Remi Gauvin
2022-03-15 21:08                           ` Zygo Blaxell
2022-03-15 18:15             ` Phillip Susi
2022-03-16 16:52           ` Andrei Borzenkov
2022-03-16 18:28             ` Jan Ziak
2022-03-16 18:31             ` Phillip Susi
2022-03-16 18:43               ` Andrei Borzenkov
2022-03-16 18:46               ` Jan Ziak
2022-03-16 19:04               ` Zygo Blaxell
2022-03-17 20:34                 ` Phillip Susi
2022-03-17 22:06                   ` Zygo Blaxell
2022-03-16 12:47 ` Kai Krakow
2022-03-16 18:18   ` Jan Ziak
  -- strict thread matches above, loose matches on Subject: below --
2022-06-17  0:20 Jan Ziak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YjDgKzAx/tawKHCz@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=remi@georgianit.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox