From: Dave Chinner <david@fromorbit.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH RFC 3/4] xfs: crude chunk allocation retry mechanism
Date: Tue, 1 Mar 2022 09:55:56 +1100 [thread overview]
Message-ID: <20220228225556.GS59715@dread.disaster.area> (raw)
In-Reply-To: <Yh1CjRONamUG0k1C@bfoster>
On Mon, Feb 28, 2022 at 04:45:49PM -0500, Brian Foster wrote:
> On Wed, Feb 23, 2022 at 06:00:58PM +1100, Dave Chinner wrote:
> > i.e. as long as we track whether we've allocated a new inode chunk
> > or not, we can bound the finobt search to a single retry. If we
> > allocated a new chunk before entering the finobt search, then all we
> > need is a log force because the busy inodes, by definition, are
> > XFS_ISTALE at this point and waiting for a CIL push before they can
> > be reclaimed. At this point an retry of the finobt scan will find
> > those inodes that were busy now available for allocation.
>
> Remind me what aspect of the prospective VFS changes prevents inode
> allocation from seeing a free inode with not-yet-elapsed RCU grace
> period..? Will this delay freeing (or evicting) the inode until a grace
> period elapses from when the last reference was dropped?
It will delay it until the inode has been reclaimed, in which case
reallocating it will result in a new struct xfs_inode being
allocated from slab, and we're all good. Any lookup that finds the
old inode will see either I_WILL_FREE, I_FREEING, I_CLEAR,
XFS_NEED_INACTIVE, XFS_IRECLAIM or ip->i_ino == 0 depending on what
point the lookup occurs. Hence the lookup will be unable to take a
new reference to the unlinked inode, and the lookup retry should
then get the newly allocated xfs_inode once it has been inserted
into the radix tree...
IOWs, it gets rid of recycling altogether, and that's how we avoid
the RCU grace period issues with recycled inodes.
> > > Perhaps a simple enough short term option is to use the existing block
> > > alloc min/max range mechanisms (as mentioned on IRC). For example:
> > >
> > > - Use the existing min/max_agbno allocation arg input values to attempt
> > > one or two chunk allocs outside of the known range of busy inodes for
> > > the AG (i.e., allocate blocks higher than the max busy agino or lower
> > > than the min busy agino).
> > > - If success, then we know we've got a chunk w/o busy inodes.
> > > - If failure, fall back to the existing chunk alloc calls, take whatever
> > > we get and retry the finobt scan (perhaps more aggressively checking
> > > each record) hoping we got a usable new record.
> > > - If that still fails, then fall back to synchronize_rcu() as a last
> > > resort and grab one of the previously busy inodes.
> > >
> > > I couldn't say for sure if that would be effective enough without
> > > playing with it a bit, but that would sort of emulate an algorithm that
> > > filtered chunk block allocations with at least one busy inode without
> > > having to modify block allocation code. If it avoids an RCU sync in the
> > > majority of cases it might be effective enough as a stopgap until
> > > background freeing exists. Thoughts?
> >
> > It might work, but I'm not a fan of how many hoops we are considering
> > jumping through to avoid getting tangled up in the RCU requirements
> > for VFS inode life cycles. I'd much prefer just being able to say
> > "all inodes busy, log force, try again" like we do with busy extent
> > limited block allocation...
> >
>
> Well presumably that works fine for your implementation of busy inodes,
> but not so much for the variant intended to work around the RCU inode
> reuse problem. ISTM this all just depends on the goals and plan here,
> and I'm not seeing clear enough reasoning to grok what that is. A
> summary of the options discussed so far:
>
> - deferred inode freeing - ideal but too involved, want something sooner
> and more simple
> - hinted non-busy chunk allocation - simple but jumping through too many
> RCU requirement hoops
> - sync RCU and retry - most analogous to longer term busy inode
> allocation handling (i.e. just replace rcu sync w/ log force and
> retry), but RCU sync is too performance costly as a direct fallback
>
> ISTM the options here range from simple and slow, to moderately simple
> and effective (TBD), to complex and ideal. So what is the goal for this
> series?
To find out if we can do something fast and effective (and without
hacks that might bite us unexepectedly) with RCU grace periods that
is less intrusive than changing inode life cycles. If it turns out
that we can change the inode life cycle faster than we can come up
with a RCU grace period based solution, then we should just go with
inode life cycle changes and forget about the interim RCU grace
period detection changes...
> My understanding to this point is that VFS changes are a ways
> out, so the first step was busy inodes == inodes with pending grace
> periods and a rework of the busy inode definition comes later with the
> associated VFS changes. That essentially means this series gets fixed up
> and posted as a mergeable component in the meantime, albeit with a
> "general direction" that is compatible with longer term changes.
It seems to me that the RCU detection code is also a ways out,
so I'm trying to keep our options open and not have us duplicate
work unnecessarily.
> However, every RCU requirement/characteristic that this series has to
> deal with is never going to be 100% perfectly aligned with the longer
> term busy inode requirements because the implementations/definitions
> will differ, particularly if the RCU handling is pushed up into the VFS.
> So if we're concerned about that, the alternative approach is to skip
> incrementally addressing the RCU inode reuse problem and just fold
> whatever bits of useful logic from here into your inode lifecycle work
> as needed and drop this as a mergeable component.
*nod*
The signs are pointing somewhat that way, but I'm still finding
the occasional unexpected gotcha in the code that I have to work
around. I think I've got them all, but until I've got it working it
pays to keep our options open....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2022-02-28 22:56 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-17 17:25 [PATCH RFC 0/4] xfs: track and skip realloc of busy inodes Brian Foster
2022-02-17 17:25 ` [PATCH RFC 1/4] xfs: require an rcu grace period before inode recycle Brian Foster
2022-02-17 17:25 ` [PATCH RFC 2/4] xfs: tag reclaimable inodes with pending RCU grace periods as busy Brian Foster
2022-02-17 23:16 ` Dave Chinner
2022-02-18 14:19 ` Brian Foster
2022-02-17 17:25 ` [PATCH RFC 3/4] xfs: crude chunk allocation retry mechanism Brian Foster
2022-02-17 23:20 ` Dave Chinner
2022-02-18 14:21 ` Brian Foster
2022-02-18 22:54 ` Dave Chinner
2022-02-20 18:48 ` Brian Foster
2022-02-23 7:00 ` Dave Chinner
2022-02-28 21:45 ` Brian Foster
2022-02-28 22:55 ` Dave Chinner [this message]
2022-03-01 15:05 ` Brian Foster
2022-02-17 17:25 ` [PATCH RFC 4/4] xfs: skip busy inodes on finobt inode allocation Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220228225556.GS59715@dread.disaster.area \
--to=david@fromorbit.com \
--cc=bfoster@redhat.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.