Re: [PATCH] xfs: don't include bnobt blocks when reserving free block pool

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: xfs <linux-xfs@vger.kernel.org>
Subject: Re: [PATCH] xfs: don't include bnobt blocks when reserving free block pool
Date: Wed, 16 Mar 2022 13:29:01 -0400	[thread overview]
Message-ID: <YjIeXX6XeX36bmXx@bfoster> (raw)
In-Reply-To: <20220316163216.GU8224@magnolia>

On Wed, Mar 16, 2022 at 09:32:16AM -0700, Darrick J. Wong wrote:
> On Wed, Mar 16, 2022 at 07:28:18AM -0400, Brian Foster wrote:
> > On Mon, Mar 14, 2022 at 11:08:47AM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > xfs_reserve_blocks controls the size of the user-visible free space
> > > reserve pool.  Given the difference between the current and requested
> > > pool sizes, it will try to reserve free space from fdblocks.  However,
> > > the amount requested from fdblocks is also constrained by the amount of
> > > space that we think xfs_mod_fdblocks will give us.  We'll keep trying to
> > > reserve space so long as xfs_mod_fdblocks returns ENOSPC.
> > > 
> > > In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand
> > > out the "free space" used by the free space btrees, because some portion
> > > of the free space btrees hold in reserve space for future btree
> > > expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
> > > of blocks that it could request from xfs_mod_fdblocks was not updated to
> > > include m_allocbt_blks, so if space is extremely low, the caller hangs.
> > > 
> > > Fix this by including m_allocbt_blks in the estimation, and modify the
> > > loop so that it will not retry infinitely.
> > > 
> > > Found by running xfs/306 (which formats a single-AG 20MB filesystem)
> > > with an fstests configuration that specifies a 1k blocksize and a
> > > specially crafted log size that will consume 7/8 of the space (17920
> > > blocks, specifically) in that AG.
> > > 
> > > Cc: Brian Foster <bfoster@redhat.com>
> > > Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation")
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/xfs_fsops.c |   16 ++++++++++++----
> > >  1 file changed, 12 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > index 33e26690a8c4..78b6982ea5b0 100644
> > > --- a/fs/xfs/xfs_fsops.c
> > > +++ b/fs/xfs/xfs_fsops.c
> > > @@ -379,6 +379,7 @@ xfs_reserve_blocks(
> > >  	int64_t			fdblks_delta = 0;
> > >  	uint64_t		request;
> > >  	int64_t			free;
> > > +	unsigned int		tries;
> > >  	int			error = 0;
> > >  
> > >  	/* If inval is null, report current values and return */
> > > @@ -432,9 +433,16 @@ xfs_reserve_blocks(
> > >  	 * perform a partial reservation if the request exceeds free space.
> > >  	 */
> > >  	error = -ENOSPC;
> > > -	do {
> > > -		free = percpu_counter_sum(&mp->m_fdblocks) -
> > > -						mp->m_alloc_set_aside;
> > > +	for (tries = 0; tries < 30 && error == -ENOSPC; tries++) {
> > 
> > Any reason for the magic number of retries as opposed to perhaps just
> > not retrying at all?
> 
> I /think/ the origins of the loop was commit dbcabad19aa9 ("[XFS] Fix
> block reservation mechanism."), where I guess Dave decided that we
> should loop forever trying to satisfy a request from userspace to
> increase the reserve pool.  OFC you and I have been patching this
> function to fix all its horrible warts over the years, so maybe you're
> right that this should only try once...
> 
> (For the mount time default reservation, we should only iterate the loop
> once (provided the accounting is correct ;) since nobody else is
> touching the free space counters.)
> 
> > This seems a little odd when you think about it
> > given that the request is already intended to take available space into
> > account and modify the request from userspace. OTOH, another
> > consideration could be to retry some (really large?) number of times and
> > then bail out if we happen to iterate without an observable change in
> > free space (i.e., something is wrong), however I suppose that could be
> > racy as well. *shrug*
> 
> ...but if you're the sysadmin desperately trying to increase the size of
> the reserve pool when the fs is running near ENOSPC, you're going to be
> racing with fdblocks bouncing up and down.  The @free samples that we
> take here in the loop body are indeed racy since we can't tell the
> difference between @free being unchanged from the last iteration because
> someone freed a block and someone else immediately consumed it, or a
> totally idle system.
> 
> Either way, it's better than hanging the whole system. :)
> 

Yeah.. I'm not bothered much by whether we retry once, 42 times or
forever. I think what this boils down to for me is whether it's worth
the risk of a behavior change of an -ENOSPC return causing something
unexpected for some random user or use case. Could we just do this in
two separate patches? Patch 1 fixes the calculation and targets stable,
patch 2 does whatever to the retry loop that potentially changes retry
semantics (and doesn't really need backporting)..?

> What if I augment the loop control with a comment capturing some of this:
> 
> 	/*
> 	 * The loop body estimates how many blocks it can request from
> 	 * fdblocks to stash in the reserve pool.  This is a classic
> 	 * TOCTOU race since fdblocks updates are not always coordinated
> 	 * via m_sb_lock.  We also cannot tell if @free remaining
> 	 * unchanged between iterations is due to an idle system or
> 	 * freed blocks being consumed immediately, so we'll try a
> 	 * finite number of times to satisfy the request.
> 	 */
> 	for (tries = 0; tries < 30...) {
> 
> > 
> > > +		/*
> > > +		 * The reservation pool cannot take space that xfs_mod_fdblocks
> > > +		 * will not give us.  This includes the per-AG set-aside space
> > > +		 * and free space btree blocks that are not available for
> > > +		 * allocation due to per-AG metadata reservations.
> > > +		 */
> > > +		free = percpu_counter_sum(&mp->m_fdblocks);
> > > +		free -= mp->m_alloc_set_aside;
> > > +		free -= atomic64_read(&mp->m_allocbt_blks);
> > 
> > Seems reasonable. Do we want to consider ->m_allocbt_blks in other
> > places where ->m_alloc_set_aside is used (i.e. xfs_fs_statfs(), etc.)?
> > Not sure how much it matters for space reporting purposes, but if so, it
> > might also be worth reconsidering the usefulness of a static field and
> > initialization helper (i.e. xfs_alloc_set_aside()) if the majority of
> > uses involve a dynamic calculation (due to ->m_allocbt_blks).
> 
> When I was writing this patch, I very nearly decided to make those three
> lines above their own helper.  I didn't see any other spots that looked
> like obvious candidates for such a calculation outside of statfs.
> 

Indeed..

> Subtracting m_allocbt_blks from statfs' avail field is a behavior
> change, since we always used to consider bnobt blocks as available.  We
> don't have an exact count of how many blocks are needed to hide the
> per-ag reserved extents, so in the end we have to decide whether we want
> to hear complaints about over- or under-estimation of available blocks.
> 
> So I think the statfs stuff is a separate patch. :)
> 

Similar deal as above.. I'm more interested in a potential cleanup of
the code that helps prevent this sort of buglet for the next user of
->m_alloc_set_aside that will (expectedly) have no idea about this
subtle quirk than I am about what's presented in the free space
counters. ISTM that we ought to just ditch ->m_alloc_set_aside, replace
the existing xfs_alloc_set_aside() with an XFS_ALLOC_FS_RESERVED() macro
or something that just does the (agcount << 3) thing, and then define a
new xfs_alloc_set_aside() that combines the macro calculation with
->m_allocbt_blks. Then the whole "set aside" concept is calculated and
documented in one place. Hm?

Brian

> --D
> 
> > 
> > Brian
> > 
> > >  		if (free <= 0)
> > >  			break;
> > >  
> > > @@ -459,7 +467,7 @@ xfs_reserve_blocks(
> > >  		spin_unlock(&mp->m_sb_lock);
> > >  		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
> > >  		spin_lock(&mp->m_sb_lock);
> > > -	} while (error == -ENOSPC);
> > > +	}
> > >  
> > >  	/*
> > >  	 * Update the reserve counters if blocks have been successfully
> > > 
> > 
>

next prev parent reply	other threads:[~2022-03-16 17:29 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-14 18:08 [PATCH] xfs: don't include bnobt blocks when reserving free block pool Darrick J. Wong
2022-03-16 11:28 ` Brian Foster
2022-03-16 16:32   ` Darrick J. Wong
2022-03-16 17:29     ` Brian Foster [this message]
2022-03-16 18:17       ` Darrick J. Wong
2022-03-16 18:48         ` Brian Foster
2022-03-16 19:17           ` Brian Foster
2022-03-16 21:15             ` Darrick J. Wong
2022-03-17  2:19               ` Dave Chinner
2022-03-17 12:53               ` Brian Foster
2022-03-17  2:05         ` Dave Chinner
2022-03-17 12:56           ` Brian Foster
2022-03-17 15:46             ` Darrick J. Wong
2022-03-17 17:05           ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YjIeXX6XeX36bmXx@bfoster \
    --to=bfoster@redhat.com \
    --cc=djwong@kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.