All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Omar Sandoval <osandov@osandov.com>,
	linux-xfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] xfs: fix internal error from AGFL exhaustion
Date: Wed, 25 Oct 2023 20:21:02 -0700	[thread overview]
Message-ID: <20231026032102.GI3195650@frogsfrogsfrogs> (raw)
In-Reply-To: <ZTmmw7jDuEeBFbd3@dread.disaster.area>

On Thu, Oct 26, 2023 at 10:37:39AM +1100, Dave Chinner wrote:
> On Wed, Oct 25, 2023 at 03:39:09PM -0700, Darrick J. Wong wrote:
> > On Wed, Oct 25, 2023 at 01:03:00PM -0700, Omar Sandoval wrote:
> > > On Wed, Oct 25, 2023 at 08:50:46AM -0700, Darrick J. Wong wrote:
> > > > On Tue, Oct 24, 2023 at 10:14:07PM -0700, Omar Sandoval wrote:
> > > > > On Wed, Oct 25, 2023 at 03:35:17PM +1100, Dave Chinner wrote:
> > > > > > On Tue, Oct 24, 2023 at 04:37:33PM -0700, Omar Sandoval wrote:
> > > > > > > From: Omar Sandoval <osandov@fb.com>
> > > > > > > Fix it by adding an extra block of slack in the freelist for the
> > > > > > > potential leaf split in each of the bnobt and cntbt.
> > > > 
> > > > I see how the cntbt can split because changing the length of a freespace
> > > > extent can require the record to move to a totally different part of the
> > > > cntbt.  The reinsertion causes the split.
> > > > 
> > > > How does the bnobt split during a refresh of the AGFL?  Will we ever
> > > > allocate a block from the /middle/ of a free space extent?
> > > 
> > > That's the case I was worried about for the bnobt. I see two ways that
> > > can happen:
> > > 
> > > 1. alignment, prod, or mod requirements, which the freelist doesn't
> > >    have.
> > > 2. Busy extents. I don't know enough about XFS to say whether or not
> > >    this is applicable, but at first glance I don't see why not.
> > 
> > Yes, I think it is possible -- we allocate an extent to fill the AGFL,
> > the beginning of that extent is still busy due to slow discard, so
> > xfs_alloc_compute_aligned gives us a block from the middle of the free
> > space.  Since AGFL filling has no particular alignment/prod/mod, any
> > number of blocks are good enough, so we end up taking the blocks from
> > the middle of the extent.
> 
> *nod*
> 
> That's exactly the conclusion I came to when I wondered how it was
> possible...
> 
> > > > > > Hmmm. yeah - example given is 2->3 level split, hence only need to
> > > > > > handle a single level leaf split before path intersection occurs.
> > > > > > Yup, adding an extra block will make the exact problem being seen go
> > > > > > away.
> > > > > > 
> > > > > > Ok, what's the general solution? 4-level, 5-level or larger trees?
> > > > > > 
> > > > > > Worst split after a full split is up to the level below the old root
> > > > > > block. the root block was split, so it won't need a split again, so
> > > > > > only it's children could split again. OK, so that's (height - 1)
> > > > > > needed for a max split to refill the AGFL after a full height split
> > > > > > occurred.
> > > > > > 
> > > > > > Hence it looks like the minimum AGFL space for any given
> > > > > > allocation/free operation needs to be:
> > > > > > 
> > > > > > 	(full height split reservation) + (AGFL refill split height)
> > > > > > 
> > > > > > which is:
> > > > > > 
> > > > > > 	= (new height) + (new height - 2)
> > > > > > 	= 2 * new height - 2
> > > > > > 	= 2 * (current height + 1) - 2
> > > > > > 	= 2 * current height
> > > > > > 
> > > > > > Ok, so we essentially have to double the AGFL minimum size to handle
> > > > > > the generic case of AGFL refill splitting free space trees after a
> > > > > > transaction that has exhausted it's AGFL reservation.
> > > > > > 
> > > > > > Now, did I get that right?
> > > > > 
> > > > > That sounds right, but I'll have to think about it more after some sleep
> > > > > :)
> > > > 
> > > > I think that's correct, assuming that (2 * current_height) is the
> > > > per-btree calcluation.
> > > 
> > > Agreed, except when the tree is already the maximum level. In that case,
> > > the worst case is splitting up to but not including the root node twice,
> > > which is 2 * height - 2 (i.e., stopping before Dave substituted new
> > > height with current height + 1). So I think we want:
> > > 
> > > min_free = min_t(unsigned int, levels[XFS_BTNUM_CNTi] + 1,
> > > 			       mp->m_alloc_maxlevels) * 2 - 2;
> > > 
> > > > > Assuming that is correct, your LE search suggestion sounds kind of nice
> > > > > rather than such a drastic change to the AGFL size.
> > > 
> > > Come to think of it, if there is nothing <= the desired size but there
> > > is something >, we have no choice but to do the split, so that doesn't
> > > work.
> > 
> > Yeah, I was kind of wondering that myself.
> 
> Ok, I'm glad to see there are enough eyes on this to point out the
> things I missed when musing about it as a possible solution...
> 
> > > > The absolute maximum height of a free space btree is 7 blocks, according
> > > > to xfs_db:
> > > > 
> > > > # xfs_db /dev/sda -c 'btheight -w absmax all'
> > > > bnobt: 7
> > > > cntbt: 7
> > > > inobt: 6
> > > > finobt: 6
> > > > bmapbt: 14
> > > > refcountbt: 6
> > > > rmapbt: 10
> > > > 
> > > > The smallest AGFL is 62 records long (V4 fs, 512b blocks) so I don't
> > > > think it's /that/ drastic.  For a filesystem with rmapbt (V5, 1k blocks)
> > > > that minimum jumps to 121 blocks.
> 
> Yup. keep in mind this comment in xfs_alloc_fix_freelist():
> 
> 	 * XXX (dgc): When we have lots of free space, does this buy us
>          * anything other than extra overhead when we need to put more blocks
>          * back on the free list? Maybe we should only do this when space is
>          * getting low or the AGFL is more than half full?
> 
> IOWs, I've previously mused about keeping the AGFL much fuller than
> the absolute minimum. There doesn't seem to be any real gain to
> keeping it at just the minimum size when we are nowhere near ENOSPC,
> it just means we are doing lots of management operations like
> reducing it immediately after a transaction that has released blocks
> to the freelist, then only to have to extend it again when we do an
> operation that consumes blocks from the free list.
> 
> i.e. should we add also hysteresis to limit the number of AGFL
> size manipulations we need to do in mixed workloads? i.e. we don't free
> blocks from the AGFL until it might get overfilled by an operation
> (i.e. max size - max merge free blocks) or we are nearing ENOSPC,
> and we don't fill till we are at the minimum. In either case, the
> target for empty or fill operations is "half full"...

Hmm.   What if we kept the AGFL completely full?

On that 512b-fsblock V4 filesystem, that'd use up 128 blocks instead of 7
blocks.  128 blocks == 64K per AG.

On a 64K-fsblock V5 filesystem, that'd use up 16370 64K blocks per AG, or
~1GB per AG.  Ok, that might be too much.

On a 4K-fsblock V5 fs, that's 1010 4K blocks per AG, or ~4MB per AG.

A gigabyte is a bit ridiculous, but what if we always did the maximum
bnobt/cntbt/rmapbt height instead of current_height + 1?

For that 4K-fsblock filesystem, that's 24 blocks per AG, or 96K.  Given
that we only support 300M+ filesystems now, that's 75M per AG.  I feel
like we could go crazy and try to keep 640K in the AGFL.

640K ought to be enough for anyone.

(Numbers chosen arbitrarily, of course.)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

  reply	other threads:[~2023-10-26  3:21 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-24 23:37 [PATCH] xfs: fix internal error from AGFL exhaustion Omar Sandoval
2023-10-24 23:37 ` [PATCH fstests] xfs: test refilling AGFL after lots of btree splits Omar Sandoval
2023-10-25 15:27   ` Darrick J. Wong
2023-10-25 20:15     ` Omar Sandoval
2023-10-25 22:05       ` Darrick J. Wong
2023-10-25  4:35 ` [PATCH] xfs: fix internal error from AGFL exhaustion Dave Chinner
2023-10-25  5:14   ` Omar Sandoval
2023-10-25 15:50     ` Darrick J. Wong
2023-10-25 20:03       ` Omar Sandoval
2023-10-25 22:39         ` Darrick J. Wong
2023-10-25 23:37           ` Dave Chinner
2023-10-26  3:21             ` Darrick J. Wong [this message]
2023-10-26 20:56               ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231026032102.GI3195650@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=david@fromorbit.com \
    --cc=kernel-team@fb.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=osandov@osandov.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.