Re: [PATCH 1/5] btrfs: 1G falloc extents

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Boris Burkov <boris@bur.io>
To: Filipe Manana <fdmanana@kernel.org>
Cc: Qu Wenruo <quwenruo.btrfs@gmx.com>,
	linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH 1/5] btrfs: 1G falloc extents
Date: Thu, 6 Oct 2022 13:41:32 -0700	[thread overview]
Message-ID: <Yz89fEqWnh+iwA9/@zen> (raw)
In-Reply-To: <CAL3q7H6Bn47CFL0tOsjCZ3iLgEPRm9_ZXV7duUSMZ2H-g0JhgQ@mail.gmail.com>

On Thu, Oct 06, 2022 at 08:56:03PM +0100, Filipe Manana wrote:
> On Thu, Oct 6, 2022 at 7:38 PM Boris Burkov <boris@bur.io> wrote:
> >
> > On Thu, Oct 06, 2022 at 10:48:33AM +0100, Filipe Manana wrote:
> > > On Thu, Oct 6, 2022 at 9:06 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > > >
> > > >
> > > >
> > > > On 2022/10/6 03:49, Boris Burkov wrote:
> > > > > When doing a large fallocate, btrfs will break it up into 256MiB
> > > > > extents. Our data block groups are 1GiB, so a more natural maximum size
> > > > > is 1GiB, so that we tend to allocate and fully use block groups rather
> > > > > than fragmenting the file around.
> > > > >
> > > > > This is especially useful if large fallocates tend to be for "round"
> > > > > amounts, which strikes me as a reasonable assumption.
> > > > >
> > > > > While moving to size classes reduces the value of this change, it is
> > > > > also good to compare potential allocator algorithms against just 1G
> > > > > extents.
> > > >
> > > > Btrfs extent booking is already causing a lot of wasted space, is this
> > > > larger extent size really a good idea?
> > > >
> > > > E.g. after a lot of random writes, we may have only a very small part of
> > > > the original 1G still being referred.
> > > > (The first write into the pre-allocated range will not be COWed, but the
> > > > next one over the same range will be COWed)
> > > >
> > > > But the full 1G can only be freed if none of its sectors is referred.
> > > > Thus this would make preallocated space much harder to be free,
> > > > snapshots/reflink can make it even worse.
> > > >
> > > > So wouldn't such enlarged preallocted extent size cause more pressure?
> > >
> > > I agree, increasing the max extent size here does not seem like a good
> > > thing to do.
> > >
> > > If an application fallocates space, then it generally expects to write to all
> > > that space. However future random partial writes may not rewrite the entire
> > > extent for a very long time, therefore making us keep a 1G extent for a very
> > > long time (or forever in the worst case).
> > >
> > > Even for NOCOW files, it's still an issue if snapshots are used.
> > >
> >
> > I see your point, and agree 1GiB is worse with respect to bookend
> > extents. Since the patchset doesn't really rely on this, I don't mind
> > dropping the change. I was mostly trying to rule this out as a trivial
> > fix that would obviate the need for other changes.
> >
> > However, I'm not completely convinced by the argument, for two reasons.
> >
> > The first is essentially Qu's last comment. If you guys are right, then
> > 256MiB is probably a really bad value for this as well, and we should be
> > reaping the wins of making it smaller.
> 
> Well, that's not a reason to increase the size, quite the opposite.

Agreed. My point was more: why is it 256MiB? There is something we would
tradeoff against bookend waste (based on your later comment, I take it
to be performance for the big files). I believe that with 256MiB we have
already crossed the bookend rubicon, so we might as well accept it.

> 
> >
> > The second is that I'm not convinced of how the regression is going to
> > happen here in practice.
> 
> Over the years, every now and then there are users reporting that
> their free space is mysteriously
> going away and they have no clue why, even when not using snapshots.
> More experienced users
> provide help and eventually notice it's caused by many bookend
> extents, and the given workaround
> is to defrag the files.
> 
> You can frequently see such reports in threads on this mailing list or
> in the IRC channel.

That's not what I was asking, really. Who overwrites 256MiB but not
1GiB? Is that common? Overwriting <256MiB (or not covering an extent)
in the status quo is exactly as bad, as far as I can see.

> 
> > Let's say someone does a 2GiB falloc and writes
> > the file out once. In the old code that will be 8 256MiB extents, in the
> > new code, 2 1GiB extents. Then, to have this be a regression, the user
> > would have to fully overwrite one of the 256MiB extents, but not 1GiB.
> > Are there a lot of workloads that don't use nocow, and which randomly
> > overwrite all of a 256MiB extent of a larger file? Maybe..
> 
> Assuming that all or most workloads that use fallocate also set nocow
> on the files, is quite a big stretch IMO.

Agreed. I won't argue based on nocow.

> It's true that for performance critical applications like traditional
> relational databases, people usually set nocow,
> or the application or some other software does it automatically for them.
> 
> But there are also many applications that use fallocate and people are
> not aware of and don't set nocow, nor
> anything else sets nocow automatically on the files used by them.
> Specially if they are not IO intensive, in which
> case they may not be noticed and therefore space wasted due to bookend
> extents is more likely to happen.
> 
> Having a bunch of very small extents, say 4K to 512K, is clearly bad
> compared to having just a few 256M extents.
> But having a few 1G extents instead of a group of 256M extents,
> probably doesn't make such a huge difference as
> in the former case.
> 
> Does the 1G extents benefit some facebook specific workload, or some
> well known workload?

I observed that we have highly fragmented file systems (before
autorelocate) and that after a balance, they regress back to fragmented
in a day or so. The frontier of new block group allocations is all from
large fallocates (mostly from installing binary packages for starting or
updating services, getting configuration/data blobs, etc..)

If you do two 1GiB fallocates in a fs that looks like:
       BG1                BG2            BG3
|xxx.............|................|................|
you get something like:
|xxxxAAAABBBBCCCC|DDDDEEEEFFFFGGGG|HHHH............|
then you do some random little writes:
|xxxxAAAABBBBCCCC|DDDDEEEEFFFFGGGG|HHHHy...........|
then you delete the second fallocate because you're done with that
container:
|xxxxAAAABBBBCCCC|DDDD............|HHHHy...........|
then you do another small write:
|xxxxAAAABBBBCCCC|DDDDz...........|HHHHy...........|
then you fallocate the next package:
|xxxxAAAABBBBCCCC|DDDDzIIIIJJJJ...|HHHHyKKKK.......|
and so on until you blow through 100 block groups in a day.

I don't think this is particularly Facebook specific, but does seem
specific to a "installs/upgrades lots of containers" type of workload.

> 
> 
> >
> > Since I'm making the change, it's incumbent on me to prove it's safe, so
> > with that in mind, I would reiterate I'm fine to drop it.
> >
> > > >
> > > > In fact, the original 256M is already too large to me.
> > > >
> > > > Thanks,
> > > > Qu
> > > > >
> > > > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > > > ---
> > > > >   fs/btrfs/inode.c | 2 +-
> > > > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > > > index 45ebef8d3ea8..fd66586ae2fc 100644
> > > > > --- a/fs/btrfs/inode.c
> > > > > +++ b/fs/btrfs/inode.c
> > > > > @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
> > > > >       if (trans)
> > > > >               own_trans = false;
> > > > >       while (num_bytes > 0) {
> > > > > -             cur_bytes = min_t(u64, num_bytes, SZ_256M);
> > > > > +             cur_bytes = min_t(u64, num_bytes, SZ_1G);
> > > > >               cur_bytes = max(cur_bytes, min_size);
> > > > >               /*
> > > > >                * If we are severely fragmented we could end up with really

next prev parent reply	other threads:[~2022-10-06 20:42 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-05 19:49 [PATCH 0/5] btrfs: data block group size classes Boris Burkov
2022-10-05 19:49 ` [PATCH 1/5] btrfs: 1G falloc extents Boris Burkov
2022-10-06  7:37   ` Qu Wenruo
2022-10-06  9:48     ` Filipe Manana
2022-10-06 18:38       ` Boris Burkov
2022-10-06 19:56         ` Filipe Manana
2022-10-06 20:41           ` Boris Burkov [this message]
2022-10-06 23:03             ` Qu Wenruo
2022-10-06  8:48   ` Johannes Thumshirn
2022-10-07  3:23   ` Wang Yugui
2022-10-07  3:29     ` Qu Wenruo
2022-10-07  3:40       ` Qu Wenruo
2022-10-05 19:49 ` [PATCH 2/5] btrfs: use ffe_ctl in btrfs allocator tracepoints Boris Burkov
2022-10-11 13:03   ` David Sterba
2022-10-14 12:22     ` David Sterba
2022-10-05 19:49 ` [PATCH 3/5] btrfs: add more ffe tracepoints Boris Burkov
2022-10-05 19:49 ` [PATCH 4/5] btrfs: introduce size class to block group allocator Boris Burkov
2022-10-05 19:49 ` [PATCH 5/5] btrfs: load block group size class when caching Boris Burkov
2022-10-06  6:02   ` kernel test robot
2022-10-06  7:31   ` kernel test robot
2022-10-06  8:13   ` kernel test robot
2022-10-11 13:06 ` [PATCH 0/5] btrfs: data block group size classes David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yz89fEqWnh+iwA9/@zen \
    --to=boris@bur.io \
    --cc=fdmanana@kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).