From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Omar Sandoval <osandov@osandov.com>,
linux-xfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] xfs: fix internal error from AGFL exhaustion
Date: Wed, 25 Oct 2023 20:21:02 -0700 [thread overview]
Message-ID: <20231026032102.GI3195650@frogsfrogsfrogs> (raw)
In-Reply-To: <ZTmmw7jDuEeBFbd3@dread.disaster.area>
On Thu, Oct 26, 2023 at 10:37:39AM +1100, Dave Chinner wrote:
> On Wed, Oct 25, 2023 at 03:39:09PM -0700, Darrick J. Wong wrote:
> > On Wed, Oct 25, 2023 at 01:03:00PM -0700, Omar Sandoval wrote:
> > > On Wed, Oct 25, 2023 at 08:50:46AM -0700, Darrick J. Wong wrote:
> > > > On Tue, Oct 24, 2023 at 10:14:07PM -0700, Omar Sandoval wrote:
> > > > > On Wed, Oct 25, 2023 at 03:35:17PM +1100, Dave Chinner wrote:
> > > > > > On Tue, Oct 24, 2023 at 04:37:33PM -0700, Omar Sandoval wrote:
> > > > > > > From: Omar Sandoval <osandov@fb.com>
> > > > > > > Fix it by adding an extra block of slack in the freelist for the
> > > > > > > potential leaf split in each of the bnobt and cntbt.
> > > >
> > > > I see how the cntbt can split because changing the length of a freespace
> > > > extent can require the record to move to a totally different part of the
> > > > cntbt. The reinsertion causes the split.
> > > >
> > > > How does the bnobt split during a refresh of the AGFL? Will we ever
> > > > allocate a block from the /middle/ of a free space extent?
> > >
> > > That's the case I was worried about for the bnobt. I see two ways that
> > > can happen:
> > >
> > > 1. alignment, prod, or mod requirements, which the freelist doesn't
> > > have.
> > > 2. Busy extents. I don't know enough about XFS to say whether or not
> > > this is applicable, but at first glance I don't see why not.
> >
> > Yes, I think it is possible -- we allocate an extent to fill the AGFL,
> > the beginning of that extent is still busy due to slow discard, so
> > xfs_alloc_compute_aligned gives us a block from the middle of the free
> > space. Since AGFL filling has no particular alignment/prod/mod, any
> > number of blocks are good enough, so we end up taking the blocks from
> > the middle of the extent.
>
> *nod*
>
> That's exactly the conclusion I came to when I wondered how it was
> possible...
>
> > > > > > Hmmm. yeah - example given is 2->3 level split, hence only need to
> > > > > > handle a single level leaf split before path intersection occurs.
> > > > > > Yup, adding an extra block will make the exact problem being seen go
> > > > > > away.
> > > > > >
> > > > > > Ok, what's the general solution? 4-level, 5-level or larger trees?
> > > > > >
> > > > > > Worst split after a full split is up to the level below the old root
> > > > > > block. the root block was split, so it won't need a split again, so
> > > > > > only it's children could split again. OK, so that's (height - 1)
> > > > > > needed for a max split to refill the AGFL after a full height split
> > > > > > occurred.
> > > > > >
> > > > > > Hence it looks like the minimum AGFL space for any given
> > > > > > allocation/free operation needs to be:
> > > > > >
> > > > > > (full height split reservation) + (AGFL refill split height)
> > > > > >
> > > > > > which is:
> > > > > >
> > > > > > = (new height) + (new height - 2)
> > > > > > = 2 * new height - 2
> > > > > > = 2 * (current height + 1) - 2
> > > > > > = 2 * current height
> > > > > >
> > > > > > Ok, so we essentially have to double the AGFL minimum size to handle
> > > > > > the generic case of AGFL refill splitting free space trees after a
> > > > > > transaction that has exhausted it's AGFL reservation.
> > > > > >
> > > > > > Now, did I get that right?
> > > > >
> > > > > That sounds right, but I'll have to think about it more after some sleep
> > > > > :)
> > > >
> > > > I think that's correct, assuming that (2 * current_height) is the
> > > > per-btree calcluation.
> > >
> > > Agreed, except when the tree is already the maximum level. In that case,
> > > the worst case is splitting up to but not including the root node twice,
> > > which is 2 * height - 2 (i.e., stopping before Dave substituted new
> > > height with current height + 1). So I think we want:
> > >
> > > min_free = min_t(unsigned int, levels[XFS_BTNUM_CNTi] + 1,
> > > mp->m_alloc_maxlevels) * 2 - 2;
> > >
> > > > > Assuming that is correct, your LE search suggestion sounds kind of nice
> > > > > rather than such a drastic change to the AGFL size.
> > >
> > > Come to think of it, if there is nothing <= the desired size but there
> > > is something >, we have no choice but to do the split, so that doesn't
> > > work.
> >
> > Yeah, I was kind of wondering that myself.
>
> Ok, I'm glad to see there are enough eyes on this to point out the
> things I missed when musing about it as a possible solution...
>
> > > > The absolute maximum height of a free space btree is 7 blocks, according
> > > > to xfs_db:
> > > >
> > > > # xfs_db /dev/sda -c 'btheight -w absmax all'
> > > > bnobt: 7
> > > > cntbt: 7
> > > > inobt: 6
> > > > finobt: 6
> > > > bmapbt: 14
> > > > refcountbt: 6
> > > > rmapbt: 10
> > > >
> > > > The smallest AGFL is 62 records long (V4 fs, 512b blocks) so I don't
> > > > think it's /that/ drastic. For a filesystem with rmapbt (V5, 1k blocks)
> > > > that minimum jumps to 121 blocks.
>
> Yup. keep in mind this comment in xfs_alloc_fix_freelist():
>
> * XXX (dgc): When we have lots of free space, does this buy us
> * anything other than extra overhead when we need to put more blocks
> * back on the free list? Maybe we should only do this when space is
> * getting low or the AGFL is more than half full?
>
> IOWs, I've previously mused about keeping the AGFL much fuller than
> the absolute minimum. There doesn't seem to be any real gain to
> keeping it at just the minimum size when we are nowhere near ENOSPC,
> it just means we are doing lots of management operations like
> reducing it immediately after a transaction that has released blocks
> to the freelist, then only to have to extend it again when we do an
> operation that consumes blocks from the free list.
>
> i.e. should we add also hysteresis to limit the number of AGFL
> size manipulations we need to do in mixed workloads? i.e. we don't free
> blocks from the AGFL until it might get overfilled by an operation
> (i.e. max size - max merge free blocks) or we are nearing ENOSPC,
> and we don't fill till we are at the minimum. In either case, the
> target for empty or fill operations is "half full"...
Hmm. What if we kept the AGFL completely full?
On that 512b-fsblock V4 filesystem, that'd use up 128 blocks instead of 7
blocks. 128 blocks == 64K per AG.
On a 64K-fsblock V5 filesystem, that'd use up 16370 64K blocks per AG, or
~1GB per AG. Ok, that might be too much.
On a 4K-fsblock V5 fs, that's 1010 4K blocks per AG, or ~4MB per AG.
A gigabyte is a bit ridiculous, but what if we always did the maximum
bnobt/cntbt/rmapbt height instead of current_height + 1?
For that 4K-fsblock filesystem, that's 24 blocks per AG, or 96K. Given
that we only support 300M+ filesystems now, that's 75M per AG. I feel
like we could go crazy and try to keep 640K in the AGFL.
640K ought to be enough for anyone.
(Numbers chosen arbitrarily, of course.)
--D
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
next prev parent reply other threads:[~2023-10-26 3:21 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-24 23:37 [PATCH] xfs: fix internal error from AGFL exhaustion Omar Sandoval
2023-10-24 23:37 ` [PATCH fstests] xfs: test refilling AGFL after lots of btree splits Omar Sandoval
2023-10-25 15:27 ` Darrick J. Wong
2023-10-25 20:15 ` Omar Sandoval
2023-10-25 22:05 ` Darrick J. Wong
2023-10-25 4:35 ` [PATCH] xfs: fix internal error from AGFL exhaustion Dave Chinner
2023-10-25 5:14 ` Omar Sandoval
2023-10-25 15:50 ` Darrick J. Wong
2023-10-25 20:03 ` Omar Sandoval
2023-10-25 22:39 ` Darrick J. Wong
2023-10-25 23:37 ` Dave Chinner
2023-10-26 3:21 ` Darrick J. Wong [this message]
2023-10-26 20:56 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231026032102.GI3195650@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=kernel-team@fb.com \
--cc=linux-xfs@vger.kernel.org \
--cc=osandov@osandov.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox