From: Boris Burkov <boris@bur.io>
To: Filipe Manana <fdmanana@kernel.org>
Cc: Qu Wenruo <quwenruo.btrfs@gmx.com>,
linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH 1/5] btrfs: 1G falloc extents
Date: Thu, 6 Oct 2022 11:38:35 -0700 [thread overview]
Message-ID: <Yz8gq6ErCqeMGUO1@zen> (raw)
In-Reply-To: <CAL3q7H4tYntEXyjbQ+9UMj1PjXOFsnvpxEOSBfEXNGjS7k3HVA@mail.gmail.com>
On Thu, Oct 06, 2022 at 10:48:33AM +0100, Filipe Manana wrote:
> On Thu, Oct 6, 2022 at 9:06 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >
> >
> >
> > On 2022/10/6 03:49, Boris Burkov wrote:
> > > When doing a large fallocate, btrfs will break it up into 256MiB
> > > extents. Our data block groups are 1GiB, so a more natural maximum size
> > > is 1GiB, so that we tend to allocate and fully use block groups rather
> > > than fragmenting the file around.
> > >
> > > This is especially useful if large fallocates tend to be for "round"
> > > amounts, which strikes me as a reasonable assumption.
> > >
> > > While moving to size classes reduces the value of this change, it is
> > > also good to compare potential allocator algorithms against just 1G
> > > extents.
> >
> > Btrfs extent booking is already causing a lot of wasted space, is this
> > larger extent size really a good idea?
> >
> > E.g. after a lot of random writes, we may have only a very small part of
> > the original 1G still being referred.
> > (The first write into the pre-allocated range will not be COWed, but the
> > next one over the same range will be COWed)
> >
> > But the full 1G can only be freed if none of its sectors is referred.
> > Thus this would make preallocated space much harder to be free,
> > snapshots/reflink can make it even worse.
> >
> > So wouldn't such enlarged preallocted extent size cause more pressure?
>
> I agree, increasing the max extent size here does not seem like a good
> thing to do.
>
> If an application fallocates space, then it generally expects to write to all
> that space. However future random partial writes may not rewrite the entire
> extent for a very long time, therefore making us keep a 1G extent for a very
> long time (or forever in the worst case).
>
> Even for NOCOW files, it's still an issue if snapshots are used.
>
I see your point, and agree 1GiB is worse with respect to bookend
extents. Since the patchset doesn't really rely on this, I don't mind
dropping the change. I was mostly trying to rule this out as a trivial
fix that would obviate the need for other changes.
However, I'm not completely convinced by the argument, for two reasons.
The first is essentially Qu's last comment. If you guys are right, then
256MiB is probably a really bad value for this as well, and we should be
reaping the wins of making it smaller.
The second is that I'm not convinced of how the regression is going to
happen here in practice. Let's say someone does a 2GiB falloc and writes
the file out once. In the old code that will be 8 256MiB extents, in the
new code, 2 1GiB extents. Then, to have this be a regression, the user
would have to fully overwrite one of the 256MiB extents, but not 1GiB.
Are there a lot of workloads that don't use nocow, and which randomly
overwrite all of a 256MiB extent of a larger file? Maybe..
Since I'm making the change, it's incumbent on me to prove it's safe, so
with that in mind, I would reiterate I'm fine to drop it.
> >
> > In fact, the original 256M is already too large to me.
> >
> > Thanks,
> > Qu
> > >
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > ---
> > > fs/btrfs/inode.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > index 45ebef8d3ea8..fd66586ae2fc 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > > @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
> > > if (trans)
> > > own_trans = false;
> > > while (num_bytes > 0) {
> > > - cur_bytes = min_t(u64, num_bytes, SZ_256M);
> > > + cur_bytes = min_t(u64, num_bytes, SZ_1G);
> > > cur_bytes = max(cur_bytes, min_size);
> > > /*
> > > * If we are severely fragmented we could end up with really
next prev parent reply other threads:[~2022-10-06 18:38 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-05 19:49 [PATCH 0/5] btrfs: data block group size classes Boris Burkov
2022-10-05 19:49 ` [PATCH 1/5] btrfs: 1G falloc extents Boris Burkov
2022-10-06 7:37 ` Qu Wenruo
2022-10-06 9:48 ` Filipe Manana
2022-10-06 18:38 ` Boris Burkov [this message]
2022-10-06 19:56 ` Filipe Manana
2022-10-06 20:41 ` Boris Burkov
2022-10-06 23:03 ` Qu Wenruo
2022-10-06 8:48 ` Johannes Thumshirn
2022-10-07 3:23 ` Wang Yugui
2022-10-07 3:29 ` Qu Wenruo
2022-10-07 3:40 ` Qu Wenruo
2022-10-05 19:49 ` [PATCH 2/5] btrfs: use ffe_ctl in btrfs allocator tracepoints Boris Burkov
2022-10-11 13:03 ` David Sterba
2022-10-14 12:22 ` David Sterba
2022-10-05 19:49 ` [PATCH 3/5] btrfs: add more ffe tracepoints Boris Burkov
2022-10-05 19:49 ` [PATCH 4/5] btrfs: introduce size class to block group allocator Boris Burkov
2022-10-05 19:49 ` [PATCH 5/5] btrfs: load block group size class when caching Boris Burkov
2022-10-06 6:02 ` kernel test robot
2022-10-06 7:31 ` kernel test robot
2022-10-06 8:13 ` kernel test robot
2022-10-11 13:06 ` [PATCH 0/5] btrfs: data block group size classes David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yz8gq6ErCqeMGUO1@zen \
--to=boris@bur.io \
--cc=fdmanana@kernel.org \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).