From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Omar Sandoval <osandov@osandov.com>,
linux-xfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] xfs: fix internal error from AGFL exhaustion
Date: Thu, 26 Oct 2023 10:37:39 +1100 [thread overview]
Message-ID: <ZTmmw7jDuEeBFbd3@dread.disaster.area> (raw)
In-Reply-To: <20231025223909.GG3195650@frogsfrogsfrogs>
On Wed, Oct 25, 2023 at 03:39:09PM -0700, Darrick J. Wong wrote:
> On Wed, Oct 25, 2023 at 01:03:00PM -0700, Omar Sandoval wrote:
> > On Wed, Oct 25, 2023 at 08:50:46AM -0700, Darrick J. Wong wrote:
> > > On Tue, Oct 24, 2023 at 10:14:07PM -0700, Omar Sandoval wrote:
> > > > On Wed, Oct 25, 2023 at 03:35:17PM +1100, Dave Chinner wrote:
> > > > > On Tue, Oct 24, 2023 at 04:37:33PM -0700, Omar Sandoval wrote:
> > > > > > From: Omar Sandoval <osandov@fb.com>
> > > > > > Fix it by adding an extra block of slack in the freelist for the
> > > > > > potential leaf split in each of the bnobt and cntbt.
> > >
> > > I see how the cntbt can split because changing the length of a freespace
> > > extent can require the record to move to a totally different part of the
> > > cntbt. The reinsertion causes the split.
> > >
> > > How does the bnobt split during a refresh of the AGFL? Will we ever
> > > allocate a block from the /middle/ of a free space extent?
> >
> > That's the case I was worried about for the bnobt. I see two ways that
> > can happen:
> >
> > 1. alignment, prod, or mod requirements, which the freelist doesn't
> > have.
> > 2. Busy extents. I don't know enough about XFS to say whether or not
> > this is applicable, but at first glance I don't see why not.
>
> Yes, I think it is possible -- we allocate an extent to fill the AGFL,
> the beginning of that extent is still busy due to slow discard, so
> xfs_alloc_compute_aligned gives us a block from the middle of the free
> space. Since AGFL filling has no particular alignment/prod/mod, any
> number of blocks are good enough, so we end up taking the blocks from
> the middle of the extent.
*nod*
That's exactly the conclusion I came to when I wondered how it was
possible...
> > > > > Hmmm. yeah - example given is 2->3 level split, hence only need to
> > > > > handle a single level leaf split before path intersection occurs.
> > > > > Yup, adding an extra block will make the exact problem being seen go
> > > > > away.
> > > > >
> > > > > Ok, what's the general solution? 4-level, 5-level or larger trees?
> > > > >
> > > > > Worst split after a full split is up to the level below the old root
> > > > > block. the root block was split, so it won't need a split again, so
> > > > > only it's children could split again. OK, so that's (height - 1)
> > > > > needed for a max split to refill the AGFL after a full height split
> > > > > occurred.
> > > > >
> > > > > Hence it looks like the minimum AGFL space for any given
> > > > > allocation/free operation needs to be:
> > > > >
> > > > > (full height split reservation) + (AGFL refill split height)
> > > > >
> > > > > which is:
> > > > >
> > > > > = (new height) + (new height - 2)
> > > > > = 2 * new height - 2
> > > > > = 2 * (current height + 1) - 2
> > > > > = 2 * current height
> > > > >
> > > > > Ok, so we essentially have to double the AGFL minimum size to handle
> > > > > the generic case of AGFL refill splitting free space trees after a
> > > > > transaction that has exhausted it's AGFL reservation.
> > > > >
> > > > > Now, did I get that right?
> > > >
> > > > That sounds right, but I'll have to think about it more after some sleep
> > > > :)
> > >
> > > I think that's correct, assuming that (2 * current_height) is the
> > > per-btree calcluation.
> >
> > Agreed, except when the tree is already the maximum level. In that case,
> > the worst case is splitting up to but not including the root node twice,
> > which is 2 * height - 2 (i.e., stopping before Dave substituted new
> > height with current height + 1). So I think we want:
> >
> > min_free = min_t(unsigned int, levels[XFS_BTNUM_CNTi] + 1,
> > mp->m_alloc_maxlevels) * 2 - 2;
> >
> > > > Assuming that is correct, your LE search suggestion sounds kind of nice
> > > > rather than such a drastic change to the AGFL size.
> >
> > Come to think of it, if there is nothing <= the desired size but there
> > is something >, we have no choice but to do the split, so that doesn't
> > work.
>
> Yeah, I was kind of wondering that myself.
Ok, I'm glad to see there are enough eyes on this to point out the
things I missed when musing about it as a possible solution...
> > > The absolute maximum height of a free space btree is 7 blocks, according
> > > to xfs_db:
> > >
> > > # xfs_db /dev/sda -c 'btheight -w absmax all'
> > > bnobt: 7
> > > cntbt: 7
> > > inobt: 6
> > > finobt: 6
> > > bmapbt: 14
> > > refcountbt: 6
> > > rmapbt: 10
> > >
> > > The smallest AGFL is 62 records long (V4 fs, 512b blocks) so I don't
> > > think it's /that/ drastic. For a filesystem with rmapbt (V5, 1k blocks)
> > > that minimum jumps to 121 blocks.
Yup. keep in mind this comment in xfs_alloc_fix_freelist():
* XXX (dgc): When we have lots of free space, does this buy us
* anything other than extra overhead when we need to put more blocks
* back on the free list? Maybe we should only do this when space is
* getting low or the AGFL is more than half full?
IOWs, I've previously mused about keeping the AGFL much fuller than
the absolute minimum. There doesn't seem to be any real gain to
keeping it at just the minimum size when we are nowhere near ENOSPC,
it just means we are doing lots of management operations like
reducing it immediately after a transaction that has released blocks
to the freelist, then only to have to extend it again when we do an
operation that consumes blocks from the free list.
i.e. should we add also hysteresis to limit the number of AGFL
size manipulations we need to do in mixed workloads? i.e. we don't free
blocks from the AGFL until it might get overfilled by an operation
(i.e. max size - max merge free blocks) or we are nearing ENOSPC,
and we don't fill till we are at the minimum. In either case, the
target for empty or fill operations is "half full"...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2023-10-25 23:37 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-24 23:37 [PATCH] xfs: fix internal error from AGFL exhaustion Omar Sandoval
2023-10-24 23:37 ` [PATCH fstests] xfs: test refilling AGFL after lots of btree splits Omar Sandoval
2023-10-25 15:27 ` Darrick J. Wong
2023-10-25 20:15 ` Omar Sandoval
2023-10-25 22:05 ` Darrick J. Wong
2023-10-25 4:35 ` [PATCH] xfs: fix internal error from AGFL exhaustion Dave Chinner
2023-10-25 5:14 ` Omar Sandoval
2023-10-25 15:50 ` Darrick J. Wong
2023-10-25 20:03 ` Omar Sandoval
2023-10-25 22:39 ` Darrick J. Wong
2023-10-25 23:37 ` Dave Chinner [this message]
2023-10-26 3:21 ` Darrick J. Wong
2023-10-26 20:56 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZTmmw7jDuEeBFbd3@dread.disaster.area \
--to=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=kernel-team@fb.com \
--cc=linux-xfs@vger.kernel.org \
--cc=osandov@osandov.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.