Re: [PATCH] xfs: fix internal error from AGFL exhaustion

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Omar Sandoval <osandov@osandov.com>,
	linux-xfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] xfs: fix internal error from AGFL exhaustion
Date: Wed, 25 Oct 2023 20:21:02 -0700	[thread overview]
Message-ID: <20231026032102.GI3195650@frogsfrogsfrogs> (raw)
In-Reply-To: <ZTmmw7jDuEeBFbd3@dread.disaster.area>

On Thu, Oct 26, 2023 at 10:37:39AM +1100, Dave Chinner wrote:
> On Wed, Oct 25, 2023 at 03:39:09PM -0700, Darrick J. Wong wrote:
> > On Wed, Oct 25, 2023 at 01:03:00PM -0700, Omar Sandoval wrote:
> > > On Wed, Oct 25, 2023 at 08:50:46AM -0700, Darrick J. Wong wrote:
> > > > On Tue, Oct 24, 2023 at 10:14:07PM -0700, Omar Sandoval wrote:
> > > > > On Wed, Oct 25, 2023 at 03:35:17PM +1100, Dave Chinner wrote:
> > > > > > On Tue, Oct 24, 2023 at 04:37:33PM -0700, Omar Sandoval wrote:
> > > > > > > From: Omar Sandoval <osandov@fb.com>
> > > > > > > Fix it by adding an extra block of slack in the freelist for the
> > > > > > > potential leaf split in each of the bnobt and cntbt.
> > > > 
> > > > I see how the cntbt can split because changing the length of a freespace
> > > > extent can require the record to move to a totally different part of the
> > > > cntbt.  The reinsertion causes the split.
> > > > 
> > > > How does the bnobt split during a refresh of the AGFL?  Will we ever
> > > > allocate a block from the /middle/ of a free space extent?
> > > 
> > > That's the case I was worried about for the bnobt. I see two ways that
> > > can happen:
> > > 
> > > 1. alignment, prod, or mod requirements, which the freelist doesn't
> > >    have.
> > > 2. Busy extents. I don't know enough about XFS to say whether or not
> > >    this is applicable, but at first glance I don't see why not.
> > 
> > Yes, I think it is possible -- we allocate an extent to fill the AGFL,
> > the beginning of that extent is still busy due to slow discard, so
> > xfs_alloc_compute_aligned gives us a block from the middle of the free
> > space.  Since AGFL filling has no particular alignment/prod/mod, any
> > number of blocks are good enough, so we end up taking the blocks from
> > the middle of the extent.
> 
> *nod*
> 
> That's exactly the conclusion I came to when I wondered how it was
> possible...
> 
> > > > > > Hmmm. yeah - example given is 2->3 level split, hence only need to
> > > > > > handle a single level leaf split before path intersection occurs.
> > > > > > Yup, adding an extra block will make the exact problem being seen go
> > > > > > away.
> > > > > > 
> > > > > > Ok, what's the general solution? 4-level, 5-level or larger trees?
> > > > > > 
> > > > > > Worst split after a full split is up to the level below the old root
> > > > > > block. the root block was split, so it won't need a split again, so
> > > > > > only it's children could split again. OK, so that's (height - 1)
> > > > > > needed for a max split to refill the AGFL after a full height split
> > > > > > occurred.
> > > > > > 
> > > > > > Hence it looks like the minimum AGFL space for any given
> > > > > > allocation/free operation needs to be:
> > > > > > 
> > > > > > 	(full height split reservation) + (AGFL refill split height)
> > > > > > 
> > > > > > which is:
> > > > > > 
> > > > > > 	= (new height) + (new height - 2)
> > > > > > 	= 2 * new height - 2
> > > > > > 	= 2 * (current height + 1) - 2
> > > > > > 	= 2 * current height
> > > > > > 
> > > > > > Ok, so we essentially have to double the AGFL minimum size to handle
> > > > > > the generic case of AGFL refill splitting free space trees after a
> > > > > > transaction that has exhausted it's AGFL reservation.
> > > > > > 
> > > > > > Now, did I get that right?
> > > > > 
> > > > > That sounds right, but I'll have to think about it more after some sleep
> > > > > :)
> > > > 
> > > > I think that's correct, assuming that (2 * current_height) is the
> > > > per-btree calcluation.
> > > 
> > > Agreed, except when the tree is already the maximum level. In that case,
> > > the worst case is splitting up to but not including the root node twice,
> > > which is 2 * height - 2 (i.e., stopping before Dave substituted new
> > > height with current height + 1). So I think we want:
> > > 
> > > min_free = min_t(unsigned int, levels[XFS_BTNUM_CNTi] + 1,
> > > 			       mp->m_alloc_maxlevels) * 2 - 2;
> > > 
> > > > > Assuming that is correct, your LE search suggestion sounds kind of nice
> > > > > rather than such a drastic change to the AGFL size.
> > > 
> > > Come to think of it, if there is nothing <= the desired size but there
> > > is something >, we have no choice but to do the split, so that doesn't
> > > work.
> > 
> > Yeah, I was kind of wondering that myself.
> 
> Ok, I'm glad to see there are enough eyes on this to point out the
> things I missed when musing about it as a possible solution...
> 
> > > > The absolute maximum height of a free space btree is 7 blocks, according
> > > > to xfs_db:
> > > > 
> > > > # xfs_db /dev/sda -c 'btheight -w absmax all'
> > > > bnobt: 7
> > > > cntbt: 7
> > > > inobt: 6
> > > > finobt: 6
> > > > bmapbt: 14
> > > > refcountbt: 6
> > > > rmapbt: 10
> > > > 
> > > > The smallest AGFL is 62 records long (V4 fs, 512b blocks) so I don't
> > > > think it's /that/ drastic.  For a filesystem with rmapbt (V5, 1k blocks)
> > > > that minimum jumps to 121 blocks.
> 
> Yup. keep in mind this comment in xfs_alloc_fix_freelist():
> 
> 	 * XXX (dgc): When we have lots of free space, does this buy us
>          * anything other than extra overhead when we need to put more blocks
>          * back on the free list? Maybe we should only do this when space is
>          * getting low or the AGFL is more than half full?
> 
> IOWs, I've previously mused about keeping the AGFL much fuller than
> the absolute minimum. There doesn't seem to be any real gain to
> keeping it at just the minimum size when we are nowhere near ENOSPC,
> it just means we are doing lots of management operations like
> reducing it immediately after a transaction that has released blocks
> to the freelist, then only to have to extend it again when we do an
> operation that consumes blocks from the free list.
> 
> i.e. should we add also hysteresis to limit the number of AGFL
> size manipulations we need to do in mixed workloads? i.e. we don't free
> blocks from the AGFL until it might get overfilled by an operation
> (i.e. max size - max merge free blocks) or we are nearing ENOSPC,
> and we don't fill till we are at the minimum. In either case, the
> target for empty or fill operations is "half full"...

Hmm.   What if we kept the AGFL completely full?

On that 512b-fsblock V4 filesystem, that'd use up 128 blocks instead of 7
blocks.  128 blocks == 64K per AG.

On a 64K-fsblock V5 filesystem, that'd use up 16370 64K blocks per AG, or
~1GB per AG.  Ok, that might be too much.

On a 4K-fsblock V5 fs, that's 1010 4K blocks per AG, or ~4MB per AG.

A gigabyte is a bit ridiculous, but what if we always did the maximum
bnobt/cntbt/rmapbt height instead of current_height + 1?

For that 4K-fsblock filesystem, that's 24 blocks per AG, or 96K.  Given
that we only support 300M+ filesystems now, that's 75M per AG.  I feel
like we could go crazy and try to keep 640K in the AGFL.

640K ought to be enough for anyone.

(Numbers chosen arbitrarily, of course.)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

next prev parent reply	other threads:[~2023-10-26  3:21 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-24 23:37 [PATCH] xfs: fix internal error from AGFL exhaustion Omar Sandoval
2023-10-24 23:37 ` [PATCH fstests] xfs: test refilling AGFL after lots of btree splits Omar Sandoval
2023-10-25 15:27   ` Darrick J. Wong
2023-10-25 20:15     ` Omar Sandoval
2023-10-25 22:05       ` Darrick J. Wong
2023-10-25  4:35 ` [PATCH] xfs: fix internal error from AGFL exhaustion Dave Chinner
2023-10-25  5:14   ` Omar Sandoval
2023-10-25 15:50     ` Darrick J. Wong
2023-10-25 20:03       ` Omar Sandoval
2023-10-25 22:39         ` Darrick J. Wong
2023-10-25 23:37           ` Dave Chinner
2023-10-26  3:21             ` Darrick J. Wong [this message]
2023-10-26 20:56               ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231026032102.GI3195650@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=david@fromorbit.com \
    --cc=kernel-team@fb.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=osandov@osandov.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox