Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Jan Ziak <0xe2.0x9a.0x9b@gmail.com>
Cc: Phillip Susi <phill@thesusis.net>,
	Qu Wenruo <quwenruo.btrfs@gmx.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
Date: Wed, 16 Mar 2022 13:02:46 -0400	[thread overview]
Message-ID: <YjIYNjq8b7Ar/+Gt@hungrycats.org> (raw)
In-Reply-To: <CAODFU0pwch49XB4oGX0GKvuRyrp+JEYBbrHvHcXTnWapPBQ8Aw@mail.gmail.com>

On Tue, Mar 15, 2022 at 11:20:09PM +0100, Jan Ziak wrote:
> On Tue, Mar 15, 2022 at 10:06 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> > This is what makes
> > nodatacow and prealloc slow--on every write, they have to check whether
> > the blocks being written are shared or not, and that check is expensive
> > because it's a linear search of every reference for overlapping block
> > ranges, and it can't exit the search early until it has proven there
> > are no shared references.  Contrast with datacow, which allocates a new
> > unshared extent that it knows it can write to, and only has to check
> > overwritten extents when they are completely overwritten (and only has
> > to check for the existence of one reference, not enumerate them all).
> 
> Some questions:
> 
> - Linear nodatacow search: Do you mean that write(fd1, buf1, 4096) to
> a larger nodatacow file is slower compared to write(fd2, buf2, 4096)
> to a smaller nodatacow file?

Size doesn't matter, the number and position of references do.  It's true
that large extents tend to end up with higher average reference counts
than small extents, but that's only spurious correlation--the "large
extent" and "many references" cases are independent.  An 8K nodatacow
extent, where the first 4K block has exactly one reference and the second
4K has 32767 references, requires a 32768 times more CPU work to write
than a 128M extent with a single reference.

In sane cases, there's only one reference to a nodatacow/prealloc extent,
because multiple references will turn off nodatacow and multiple writes
will turn off prealloc, defeating both features.  When there's only one
reference, the linear search for overlapping blocks ends quickly.

In insane cases (after hole punching, snapshots, reflinks, or writes
to prealloc files) there exist multiple references to the extent,
each covering distinct byte ranges of the extent.  The btrfs trees
only index references from leaf metadata pages to the entire extent,
so to calculate the number of times an individual block is referenced,
we have to iterate over every existing reference to see if it happens to
overlap the blocks of interest.  That's O(N) in the number of references
(roughly--e.g. we don't need to examine different snapshots sharing a
metadata page, because every snapshot sharing a metadata page references
the same bytes in the data extent, but I don't know if btrfs implements
that optimization).

We can't simply read the reference count on the extent for various
reasons.  One is that we don't know what the true reference count is
without walking all parent tree nodes toward the root to see if there's
a snapshot.  The extent is referenced by one metadata page, so its
reference count is 1, but the metadata page is shared by multiple tree
roots, so the true reference count is higher.  Another is that a hole
punched into the middle of an extent causes two references from the same
file, where each reference covers a distinct set of blocks.  None of
the individual blocks are shared, but the extent's reference count is 2.

> - Linear nodatacow search: Does the search happen only with uncached
> metadata, or also with metadata cached in RAM?

All metadata is cached in RAM prior to searching.  I think I missed
where you were going with this question.

> - Extent tree v2 + nodatacow: V2 also features the linear search (like
> v1) or has the search been redesigned to be logarithmic?

I haven't seen the implementation, but the design implies a linear
search over the adjacent range of extent physical addresses that is up
to 2 * max_extent_len wide.  It could be made faster with a clever data
structure, which is implied in the project description, but I haven't
seen details.

There are simple ways to make nodatacow fast, but btrfs doesn't implement
them.  e.g. nodatacow could be a subvol property, where reflink and
snapshot is prohibited over the entire subvol when nodatacow is enabled.
That would eliminate the need to ever search extent references on
write--nodatacow writes could safely assume everything in the subvol is
never shared--and it would match the expectations of people who prefer
that nodatacow takes precedence over all incompatible btrfs features.

> -Jan

next prev parent reply	other threads:[~2022-03-16 17:02 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-06 15:59 Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit Jan Ziak
2022-03-07  0:48 ` Qu Wenruo
2022-03-07  2:23   ` Jan Ziak
2022-03-07  2:39     ` Qu Wenruo
2022-03-07  7:31       ` Qu Wenruo
2022-03-10  1:10         ` Jan Ziak
2022-03-10  1:26           ` Qu Wenruo
2022-03-10  4:33             ` Jan Ziak
2022-03-10  6:42               ` Qu Wenruo
2022-03-10 21:31                 ` Jan Ziak
2022-03-10 23:27                   ` Qu Wenruo
2022-03-11  2:42                     ` Jan Ziak
2022-03-11  2:59                       ` Qu Wenruo
2022-03-11  5:04                         ` Jan Ziak
2022-03-11 16:31                           ` Jan Ziak
2022-03-11 20:02                             ` Jan Ziak
2022-03-11 23:04                             ` Qu Wenruo
2022-03-11 23:28                               ` Jan Ziak
2022-03-11 23:39                                 ` Qu Wenruo
2022-03-12  0:01                                   ` Jan Ziak
2022-03-12  0:15                                     ` Qu Wenruo
2022-03-12  3:16                                     ` Zygo Blaxell
2022-03-12  2:43                                 ` Zygo Blaxell
2022-03-12  3:24                                   ` Qu Wenruo
2022-03-12  3:48                                     ` Zygo Blaxell
2022-03-14 20:09                         ` Phillip Susi
2022-03-14 22:59                           ` Zygo Blaxell
2022-03-15 18:28                             ` Phillip Susi
2022-03-15 19:28                               ` Jan Ziak
2022-03-15 21:06                               ` Zygo Blaxell
2022-03-15 22:20                                 ` Jan Ziak
2022-03-16 17:02                                   ` Zygo Blaxell [this message]
2022-03-16 17:48                                     ` Jan Ziak
2022-03-17  2:11                                       ` Zygo Blaxell
2022-03-16 18:46                                 ` Phillip Susi
2022-03-16 19:59                                   ` Zygo Blaxell
2022-03-20 17:50                             ` Forza
2022-03-20 21:15                               ` Zygo Blaxell
2022-03-08 21:57       ` Jan Ziak
2022-03-08 23:40         ` Qu Wenruo
2022-03-09 22:22           ` Jan Ziak
2022-03-09 22:44             ` Qu Wenruo
2022-03-09 22:55               ` Jan Ziak
2022-03-09 23:00                 ` Jan Ziak
2022-03-09  4:48         ` Zygo Blaxell
2022-03-07 14:30 ` Phillip Susi
2022-03-08 21:43   ` Jan Ziak
2022-03-09 18:46     ` Phillip Susi
2022-03-09 21:35       ` Jan Ziak
2022-03-14 20:02         ` Phillip Susi
2022-03-14 21:53           ` Jan Ziak
2022-03-14 22:24             ` Remi Gauvin
2022-03-14 22:51               ` Zygo Blaxell
2022-03-14 23:07                 ` Remi Gauvin
2022-03-14 23:39                   ` Zygo Blaxell
2022-03-15 14:14                     ` Remi Gauvin
2022-03-15 18:51                       ` Zygo Blaxell
2022-03-15 19:22                         ` Remi Gauvin
2022-03-15 21:08                           ` Zygo Blaxell
2022-03-15 18:15             ` Phillip Susi
2022-03-16 16:52           ` Andrei Borzenkov
2022-03-16 18:28             ` Jan Ziak
2022-03-16 18:31             ` Phillip Susi
2022-03-16 18:43               ` Andrei Borzenkov
2022-03-16 18:46               ` Jan Ziak
2022-03-16 19:04               ` Zygo Blaxell
2022-03-17 20:34                 ` Phillip Susi
2022-03-17 22:06                   ` Zygo Blaxell
2022-03-16 12:47 ` Kai Krakow
2022-03-16 18:18   ` Jan Ziak
  -- strict thread matches above, loose matches on Subject: below --
2022-06-17  0:20 Jan Ziak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YjIYNjq8b7Ar/+Gt@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=0xe2.0x9a.0x9b@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=phill@thesusis.net \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox