[PATCH] xfs: require an rcu grace period before inode recycle

rcu.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] xfs: require an rcu grace period before inode recycle
@ 2022-01-21 14:24 Brian Foster
  2022-01-21 17:26 ` Darrick J. Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Brian Foster @ 2022-01-21 14:24 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Al Viro, Ian Kent, rcu

The XFS inode allocation algorithm aggressively reuses recently
freed inodes. This is historical behavior that has been in place for
quite some time, since XFS was imported to mainline Linux. Once the
VFS adopted RCUwalk path lookups (also some time ago), this behavior
became slightly incompatible because the inode recycle path doesn't
isolate concurrent access to the inode from the VFS.

This has recently manifested as problems in the VFS when XFS happens
to change the type or properties of a recently unlinked inode while
still involved in an RCU lookup. For example, if the VFS refers to a
previous incarnation of a symlink inode, obtains the ->get_link()
callback from inode_operations, and the latter happens to change to
a non-symlink type via a recycle event, the ->get_link() callback
pointer is reset to NULL and the lookup results in a crash.

To avoid this class of problem, isolate in-core inodes for recycling
with an RCU grace period. This is the same level of protection the
VFS expects for inactivated inodes that are never reused, and so
guarantees no further concurrent access before the type or
properties of the inode change. We don't want an unconditional
synchronize_rcu() event here because that would result in a
significant performance impact to mixed inode allocation workloads.

Fortunately, we can take advantage of the recently added deferred
inactivation mechanism to mitigate the need for an RCU wait in most
cases. Deferred inactivation queues and batches the on-disk freeing
of recently destroyed inodes, and so significantly increases the
likelihood that a grace period has elapsed by the time an inode is
freed and observable by the allocation code as a reuse candidate.
Capture the current RCU grace period cookie at inode destroy time
and refer to it at allocation time to conditionally wait for an RCU
grace period if one hadn't expired in the meantime.  Since only
unlinked inodes are recycle candidates and unlinked inodes always
require inactivation, we only need to poll and assign RCU state in
the inactivation codepath. Slightly adjust struct xfs_inode to fit
the new field into padding holes that conveniently preexist in the
same cacheline as the deferred inactivation list.

Finally, note that the ideal long term solution here is to
rearchitect bits of XFS' internal inode lifecycle management such
that this additional stall point is not required, but this requires
more thought, time and work to address. This approach restores
functional correctness in the meantime.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Hi all,

Here's the RCU fixup patch for inode reuse that I've been playing with,
re: the vfs patch discussion [1]. I've put it in pretty much the most
basic form, but I think there are a couple aspects worth thinking about:

1. Use and frequency of start_poll_synchronize_rcu() (vs.
get_state_synchronize_rcu()). The former is a bit more active than the
latter in that it triggers the start of a grace period, when necessary.
This currently invokes per inode, which is the ideal frequency in
theory, but could be reduced, associated with the xfs_inogegc thresholds
in some manner, etc., if there is good reason to do that.

2. The rcu cookie lifecycle. This variant updates it on inactivation
queue and nowhere else because the RCU docs imply that counter rollover
is not a significant problem. In practice, I think this means that if an
inode is stamped at least once, and the counter rolls over, future
(non-inactivation, non-unlinked) eviction -> repopulation cycles could
trigger rcu syncs. I think this would require repeated
eviction/reinstantiation cycles within a small window to be noticeable,
so I'm not sure how likely this is to occur. We could be more defensive
by resetting or refreshing the cookie. E.g., refresh (or reset to zero)
at recycle time, unconditionally refresh at destroy time (using
get_state_synchronize_rcu() for non-inactivation), etc.

Otherwise testing is ongoing, but this version at least survives an
fstests regression run.

Brian

[1] https://lore.kernel.org/linux-fsdevel/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/

 fs/xfs/xfs_icache.c | 11 +++++++++++
 fs/xfs/xfs_inode.h  |  3 ++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index d019c98eb839..4931daa45ca4 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -349,6 +349,16 @@ xfs_iget_recycle(
 	spin_unlock(&ip->i_flags_lock);
 	rcu_read_unlock();

+	/*
+	 * VFS RCU pathwalk lookups dictate the same lifecycle rules for an
+	 * inode recycle as for freeing an inode. I.e., we cannot repurpose the
+	 * inode until a grace period has elapsed from the time the previous
+	 * version of the inode was destroyed. In most cases a grace period has
+	 * already elapsed if the inode was (deferred) inactivated, but
+	 * synchronize here as a last resort to guarantee correctness.
+	 */
+	cond_synchronize_rcu(ip->i_destroy_gp);
+
 	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
 	error = xfs_reinit_inode(mp, inode);
 	if (error) {
@@ -2019,6 +2029,7 @@ xfs_inodegc_queue(
 	trace_xfs_inode_set_need_inactive(ip);
 	spin_lock(&ip->i_flags_lock);
 	ip->i_flags |= XFS_NEED_INACTIVE;
+	ip->i_destroy_gp = start_poll_synchronize_rcu();
 	spin_unlock(&ip->i_flags_lock);

 	gc = get_cpu_ptr(mp->m_inodegc);
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index c447bf04205a..2153e3edbb86 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -40,8 +40,9 @@ typedef struct xfs_inode {
 	/* Transaction and locking information. */
 	struct xfs_inode_log_item *i_itemp;	/* logging information */
 	mrlock_t		i_lock;		/* inode lock */
-	atomic_t		i_pincount;	/* inode pin count */
 	struct llist_node	i_gclist;	/* deferred inactivation list */
+	unsigned long		i_destroy_gp;	/* destroy rcugp cookie */
+	atomic_t		i_pincount;	/* inode pin count */

 	/*
 	 * Bitsets of inode metadata that have been checked and/or are sick.
-- 
2.31.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-21 14:24 [PATCH] xfs: require an rcu grace period before inode recycle Brian Foster
@ 2022-01-21 17:26 ` Darrick J. Wong
  2022-01-21 18:33   ` Brian Foster
  2022-01-23 22:43 ` Dave Chinner
  2022-01-24 15:02 ` Brian Foster
  2 siblings, 1 reply; 36+ messages in thread
From: Darrick J. Wong @ 2022-01-21 17:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Dave Chinner, Al Viro, Ian Kent, rcu

On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote:
> The XFS inode allocation algorithm aggressively reuses recently
> freed inodes. This is historical behavior that has been in place for
> quite some time, since XFS was imported to mainline Linux. Once the
> VFS adopted RCUwalk path lookups (also some time ago), this behavior
> became slightly incompatible because the inode recycle path doesn't
> isolate concurrent access to the inode from the VFS.
> 
> This has recently manifested as problems in the VFS when XFS happens
> to change the type or properties of a recently unlinked inode while
> still involved in an RCU lookup. For example, if the VFS refers to a
> previous incarnation of a symlink inode, obtains the ->get_link()
> callback from inode_operations, and the latter happens to change to
> a non-symlink type via a recycle event, the ->get_link() callback
> pointer is reset to NULL and the lookup results in a crash.

Hmm, so I guess what you're saying is that if the memory buffer
allocation in ->get_link is slow enough, some other thread can free the
inode, drop it, reallocate it, and reinstantiate it (not as a symlink
this time) all before ->get_link's memory allocation call returns, after
which Bad Things Happen(tm)?

Can the lookup thread end up with the wrong inode->i_ops too?

> To avoid this class of problem, isolate in-core inodes for recycling
> with an RCU grace period. This is the same level of protection the
> VFS expects for inactivated inodes that are never reused, and so
> guarantees no further concurrent access before the type or
> properties of the inode change. We don't want an unconditional
> synchronize_rcu() event here because that would result in a
> significant performance impact to mixed inode allocation workloads.
> 
> Fortunately, we can take advantage of the recently added deferred
> inactivation mechanism to mitigate the need for an RCU wait in most
> cases. Deferred inactivation queues and batches the on-disk freeing
> of recently destroyed inodes, and so significantly increases the
> likelihood that a grace period has elapsed by the time an inode is
> freed and observable by the allocation code as a reuse candidate.
> Capture the current RCU grace period cookie at inode destroy time
> and refer to it at allocation time to conditionally wait for an RCU
> grace period if one hadn't expired in the meantime.  Since only
> unlinked inodes are recycle candidates and unlinked inodes always
> require inactivation,

Any inode can become a recycle candidate (i.e. RECLAIMABLE but otherwise
idle) but I think your point here is that unlinked inodes that become
recycling candidates can cause lookup threads to trip over symlinks, and
that's why we need to assign RCU state and poll on it, right?

(That wasn't a challenge, I'm just making sure I understand this
correctly.)

> we only need to poll and assign RCU state in
> the inactivation codepath. Slightly adjust struct xfs_inode to fit
> the new field into padding holes that conveniently preexist in the
> same cacheline as the deferred inactivation list.
> 
> Finally, note that the ideal long term solution here is to
> rearchitect bits of XFS' internal inode lifecycle management such
> that this additional stall point is not required, but this requires
> more thought, time and work to address. This approach restores
> functional correctness in the meantime.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> Hi all,
> 
> Here's the RCU fixup patch for inode reuse that I've been playing with,
> re: the vfs patch discussion [1]. I've put it in pretty much the most
> basic form, but I think there are a couple aspects worth thinking about:
> 
> 1. Use and frequency of start_poll_synchronize_rcu() (vs.
> get_state_synchronize_rcu()). The former is a bit more active than the
> latter in that it triggers the start of a grace period, when necessary.
> This currently invokes per inode, which is the ideal frequency in
> theory, but could be reduced, associated with the xfs_inogegc thresholds
> in some manner, etc., if there is good reason to do that.

If you rm -rf $path, do each of the inodes get a separate rcu state, or
do they share?

> 2. The rcu cookie lifecycle. This variant updates it on inactivation
> queue and nowhere else because the RCU docs imply that counter rollover
> is not a significant problem. In practice, I think this means that if an
> inode is stamped at least once, and the counter rolls over, future
> (non-inactivation, non-unlinked) eviction -> repopulation cycles could
> trigger rcu syncs. I think this would require repeated
> eviction/reinstantiation cycles within a small window to be noticeable,
> so I'm not sure how likely this is to occur. We could be more defensive
> by resetting or refreshing the cookie. E.g., refresh (or reset to zero)
> at recycle time, unconditionally refresh at destroy time (using
> get_state_synchronize_rcu() for non-inactivation), etc.
> 
> Otherwise testing is ongoing, but this version at least survives an
> fstests regression run.
> 
> Brian
> 
> [1] https://lore.kernel.org/linux-fsdevel/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
> 
>  fs/xfs/xfs_icache.c | 11 +++++++++++
>  fs/xfs/xfs_inode.h  |  3 ++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index d019c98eb839..4931daa45ca4 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -349,6 +349,16 @@ xfs_iget_recycle(
>  	spin_unlock(&ip->i_flags_lock);
>  	rcu_read_unlock();
>  
> +	/*
> +	 * VFS RCU pathwalk lookups dictate the same lifecycle rules for an
> +	 * inode recycle as for freeing an inode. I.e., we cannot repurpose the
> +	 * inode until a grace period has elapsed from the time the previous
> +	 * version of the inode was destroyed. In most cases a grace period has
> +	 * already elapsed if the inode was (deferred) inactivated, but
> +	 * synchronize here as a last resort to guarantee correctness.
> +	 */
> +	cond_synchronize_rcu(ip->i_destroy_gp);
> +
>  	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
>  	error = xfs_reinit_inode(mp, inode);
>  	if (error) {
> @@ -2019,6 +2029,7 @@ xfs_inodegc_queue(
>  	trace_xfs_inode_set_need_inactive(ip);
>  	spin_lock(&ip->i_flags_lock);
>  	ip->i_flags |= XFS_NEED_INACTIVE;
> +	ip->i_destroy_gp = start_poll_synchronize_rcu();

Hmm.  The description says that we only need the rcu synchronization
when we're freeing an inode after its link count drops to zero, because
that's the vector for (say) the VFS inode ops actually changing due to
free/inactivate/reallocate/recycle while someone else is doing a lookup.

I'm a bit puzzled why this unconditionally starts an rcu grace period,
instead of done only if i_nlink==0; and why we call cond_synchronize_rcu
above unconditionally instead of checking for i_mode==0 (or whatever
state the cached inode is left in after it's freed)?

--D

>  	spin_unlock(&ip->i_flags_lock);
>  
>  	gc = get_cpu_ptr(mp->m_inodegc);
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index c447bf04205a..2153e3edbb86 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -40,8 +40,9 @@ typedef struct xfs_inode {
>  	/* Transaction and locking information. */
>  	struct xfs_inode_log_item *i_itemp;	/* logging information */
>  	mrlock_t		i_lock;		/* inode lock */
> -	atomic_t		i_pincount;	/* inode pin count */
>  	struct llist_node	i_gclist;	/* deferred inactivation list */
> +	unsigned long		i_destroy_gp;	/* destroy rcugp cookie */
> +	atomic_t		i_pincount;	/* inode pin count */
>  
>  	/*
>  	 * Bitsets of inode metadata that have been checked and/or are sick.
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-21 17:26 ` Darrick J. Wong
@ 2022-01-21 18:33   ` Brian Foster
  2022-01-22  5:30     ` Paul E. McKenney
  0 siblings, 1 reply; 36+ messages in thread
From: Brian Foster @ 2022-01-21 18:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, Dave Chinner, Al Viro, Ian Kent, rcu

On Fri, Jan 21, 2022 at 09:26:03AM -0800, Darrick J. Wong wrote:
> On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote:
> > The XFS inode allocation algorithm aggressively reuses recently
> > freed inodes. This is historical behavior that has been in place for
> > quite some time, since XFS was imported to mainline Linux. Once the
> > VFS adopted RCUwalk path lookups (also some time ago), this behavior
> > became slightly incompatible because the inode recycle path doesn't
> > isolate concurrent access to the inode from the VFS.
> > 
> > This has recently manifested as problems in the VFS when XFS happens
> > to change the type or properties of a recently unlinked inode while
> > still involved in an RCU lookup. For example, if the VFS refers to a
> > previous incarnation of a symlink inode, obtains the ->get_link()
> > callback from inode_operations, and the latter happens to change to
> > a non-symlink type via a recycle event, the ->get_link() callback
> > pointer is reset to NULL and the lookup results in a crash.
> 
> Hmm, so I guess what you're saying is that if the memory buffer
> allocation in ->get_link is slow enough, some other thread can free the
> inode, drop it, reallocate it, and reinstantiate it (not as a symlink
> this time) all before ->get_link's memory allocation call returns, after
> which Bad Things Happen(tm)?
> 
> Can the lookup thread end up with the wrong inode->i_ops too?
> 

We really don't need to even get into the XFS symlink code to reason
about the fundamental form of this issue. Consider that an RCU walk
starts, locates a symlink inode, meanwhile XFS recycles that inode into
something completely different, then the VFS loads and calls
->get_link() (which is now NULL) on said inode and explodes. So the
presumption is that the VFS uses RCU protection to rely on some form of
stability of the inode (i.e., that the inode memory isn't freed,
callback vectors don't change, etc.).

Validity of the symlink content is a variant of that class of problem,
likely already addressed by the recent inline symlink change, but that
doesn't address the broader issue.

> > To avoid this class of problem, isolate in-core inodes for recycling
> > with an RCU grace period. This is the same level of protection the
> > VFS expects for inactivated inodes that are never reused, and so
> > guarantees no further concurrent access before the type or
> > properties of the inode change. We don't want an unconditional
> > synchronize_rcu() event here because that would result in a
> > significant performance impact to mixed inode allocation workloads.
> > 
> > Fortunately, we can take advantage of the recently added deferred
> > inactivation mechanism to mitigate the need for an RCU wait in most
> > cases. Deferred inactivation queues and batches the on-disk freeing
> > of recently destroyed inodes, and so significantly increases the
> > likelihood that a grace period has elapsed by the time an inode is
> > freed and observable by the allocation code as a reuse candidate.
> > Capture the current RCU grace period cookie at inode destroy time
> > and refer to it at allocation time to conditionally wait for an RCU
> > grace period if one hadn't expired in the meantime.  Since only
> > unlinked inodes are recycle candidates and unlinked inodes always
> > require inactivation,
> 
> Any inode can become a recycle candidate (i.e. RECLAIMABLE but otherwise
> idle) but I think your point here is that unlinked inodes that become
> recycling candidates can cause lookup threads to trip over symlinks, and
> that's why we need to assign RCU state and poll on it, right?
> 

Good point. When I wrote the commit log I was thinking of recycled
inodes as "reincarnated" inodes, so that wording could probably be
improved. But yes, the code is written minimally/simply so I was trying
to document that it's unlinked -> freed -> reallocated inodes that we
really care about here.

WRT to symlinks, I was trying to use that as an example and not
necessarily as the general reason for the patch. I.e., the general
reason is that the VFS uses rcu protection for inode stability (just as
for the inode free path), and the symlink thing is just an example of
how things can go wrong in the current implementation without it.

> (That wasn't a challenge, I'm just making sure I understand this
> correctly.)
> 
> > we only need to poll and assign RCU state in
> > the inactivation codepath. Slightly adjust struct xfs_inode to fit
> > the new field into padding holes that conveniently preexist in the
> > same cacheline as the deferred inactivation list.
> > 
> > Finally, note that the ideal long term solution here is to
> > rearchitect bits of XFS' internal inode lifecycle management such
> > that this additional stall point is not required, but this requires
> > more thought, time and work to address. This approach restores
> > functional correctness in the meantime.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > Hi all,
> > 
> > Here's the RCU fixup patch for inode reuse that I've been playing with,
> > re: the vfs patch discussion [1]. I've put it in pretty much the most
> > basic form, but I think there are a couple aspects worth thinking about:
> > 
> > 1. Use and frequency of start_poll_synchronize_rcu() (vs.
> > get_state_synchronize_rcu()). The former is a bit more active than the
> > latter in that it triggers the start of a grace period, when necessary.
> > This currently invokes per inode, which is the ideal frequency in
> > theory, but could be reduced, associated with the xfs_inogegc thresholds
> > in some manner, etc., if there is good reason to do that.
> 
> If you rm -rf $path, do each of the inodes get a separate rcu state, or
> do they share?
> 

My previous experiments on a teardown grace period had me thinking
batching would occur, but I don't recall which RCU call I was using at
the time so I'd probably have to throw a tracepoint in there to dump
some of the grace period values and double check to be sure. (If this is
not the case, that might be a good reason to tweak things as discussed
above).

> > 2. The rcu cookie lifecycle. This variant updates it on inactivation
> > queue and nowhere else because the RCU docs imply that counter rollover
> > is not a significant problem. In practice, I think this means that if an
> > inode is stamped at least once, and the counter rolls over, future
> > (non-inactivation, non-unlinked) eviction -> repopulation cycles could
> > trigger rcu syncs. I think this would require repeated
> > eviction/reinstantiation cycles within a small window to be noticeable,
> > so I'm not sure how likely this is to occur. We could be more defensive
> > by resetting or refreshing the cookie. E.g., refresh (or reset to zero)
> > at recycle time, unconditionally refresh at destroy time (using
> > get_state_synchronize_rcu() for non-inactivation), etc.
> > 
> > Otherwise testing is ongoing, but this version at least survives an
> > fstests regression run.
> > 
> > Brian
> > 
> > [1] https://lore.kernel.org/linux-fsdevel/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
> > 
> >  fs/xfs/xfs_icache.c | 11 +++++++++++
> >  fs/xfs/xfs_inode.h  |  3 ++-
> >  2 files changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index d019c98eb839..4931daa45ca4 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -349,6 +349,16 @@ xfs_iget_recycle(
> >  	spin_unlock(&ip->i_flags_lock);
> >  	rcu_read_unlock();
> >  
> > +	/*
> > +	 * VFS RCU pathwalk lookups dictate the same lifecycle rules for an
> > +	 * inode recycle as for freeing an inode. I.e., we cannot repurpose the
> > +	 * inode until a grace period has elapsed from the time the previous
> > +	 * version of the inode was destroyed. In most cases a grace period has
> > +	 * already elapsed if the inode was (deferred) inactivated, but
> > +	 * synchronize here as a last resort to guarantee correctness.
> > +	 */
> > +	cond_synchronize_rcu(ip->i_destroy_gp);
> > +
> >  	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
> >  	error = xfs_reinit_inode(mp, inode);
> >  	if (error) {
> > @@ -2019,6 +2029,7 @@ xfs_inodegc_queue(
> >  	trace_xfs_inode_set_need_inactive(ip);
> >  	spin_lock(&ip->i_flags_lock);
> >  	ip->i_flags |= XFS_NEED_INACTIVE;
> > +	ip->i_destroy_gp = start_poll_synchronize_rcu();
> 
> Hmm.  The description says that we only need the rcu synchronization
> when we're freeing an inode after its link count drops to zero, because
> that's the vector for (say) the VFS inode ops actually changing due to
> free/inactivate/reallocate/recycle while someone else is doing a lookup.
> 

Right..

> I'm a bit puzzled why this unconditionally starts an rcu grace period,
> instead of done only if i_nlink==0; and why we call cond_synchronize_rcu
> above unconditionally instead of checking for i_mode==0 (or whatever
> state the cached inode is left in after it's freed)?
> 

Just an attempt to start simple and/or make any performance
test/problems more blatant. I probably could have tagged this RFC. My
primary goal with this patch was to establish whether the general
approach is sane/viable/acceptable or we need to move in another
direction.

That aside, I think it's reasonable to have explicit logic around the
unlinked case if we want to keep it restricted to that, though I would
probably implement that as a conditional i_destroy_gp assignment and let
the consumer context key off whether that field is set rather than
attempt to infer unlinked logic (and then I guess reset it back to zero
so it doesn't leak across reincarnation). That also probably facilitates
a meaningful tracepoint to track the cases that do end up syncing, which
helps with your earlier question around batching, so I'll look into
those changes once I get through broader testing

Brian

> --D
> 
> >  	spin_unlock(&ip->i_flags_lock);
> >  
> >  	gc = get_cpu_ptr(mp->m_inodegc);
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index c447bf04205a..2153e3edbb86 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -40,8 +40,9 @@ typedef struct xfs_inode {
> >  	/* Transaction and locking information. */
> >  	struct xfs_inode_log_item *i_itemp;	/* logging information */
> >  	mrlock_t		i_lock;		/* inode lock */
> > -	atomic_t		i_pincount;	/* inode pin count */
> >  	struct llist_node	i_gclist;	/* deferred inactivation list */
> > +	unsigned long		i_destroy_gp;	/* destroy rcugp cookie */
> > +	atomic_t		i_pincount;	/* inode pin count */
> >  
> >  	/*
> >  	 * Bitsets of inode metadata that have been checked and/or are sick.
> > -- 
> > 2.31.1
> > 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-21 18:33   ` Brian Foster
@ 2022-01-22  5:30     ` Paul E. McKenney
  2022-01-22 16:55       ` Paul E. McKenney
  2022-01-24 15:12       ` Brian Foster
  0 siblings, 2 replies; 36+ messages in thread
From: Paul E. McKenney @ 2022-01-22  5:30 UTC (permalink / raw)
  To: Brian Foster
  Cc: Darrick J. Wong, linux-xfs, Dave Chinner, Al Viro, Ian Kent, rcu

On Fri, Jan 21, 2022 at 01:33:46PM -0500, Brian Foster wrote:
> On Fri, Jan 21, 2022 at 09:26:03AM -0800, Darrick J. Wong wrote:
> > On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote:
> > > The XFS inode allocation algorithm aggressively reuses recently
> > > freed inodes. This is historical behavior that has been in place for
> > > quite some time, since XFS was imported to mainline Linux. Once the
> > > VFS adopted RCUwalk path lookups (also some time ago), this behavior
> > > became slightly incompatible because the inode recycle path doesn't
> > > isolate concurrent access to the inode from the VFS.
> > > 
> > > This has recently manifested as problems in the VFS when XFS happens
> > > to change the type or properties of a recently unlinked inode while
> > > still involved in an RCU lookup. For example, if the VFS refers to a
> > > previous incarnation of a symlink inode, obtains the ->get_link()
> > > callback from inode_operations, and the latter happens to change to
> > > a non-symlink type via a recycle event, the ->get_link() callback
> > > pointer is reset to NULL and the lookup results in a crash.
> > 
> > Hmm, so I guess what you're saying is that if the memory buffer
> > allocation in ->get_link is slow enough, some other thread can free the
> > inode, drop it, reallocate it, and reinstantiate it (not as a symlink
> > this time) all before ->get_link's memory allocation call returns, after
> > which Bad Things Happen(tm)?
> > 
> > Can the lookup thread end up with the wrong inode->i_ops too?
> > 
> 
> We really don't need to even get into the XFS symlink code to reason
> about the fundamental form of this issue. Consider that an RCU walk
> starts, locates a symlink inode, meanwhile XFS recycles that inode into
> something completely different, then the VFS loads and calls
> ->get_link() (which is now NULL) on said inode and explodes. So the
> presumption is that the VFS uses RCU protection to rely on some form of
> stability of the inode (i.e., that the inode memory isn't freed,
> callback vectors don't change, etc.).
> 
> Validity of the symlink content is a variant of that class of problem,
> likely already addressed by the recent inline symlink change, but that
> doesn't address the broader issue.
> 
> > > To avoid this class of problem, isolate in-core inodes for recycling
> > > with an RCU grace period. This is the same level of protection the
> > > VFS expects for inactivated inodes that are never reused, and so
> > > guarantees no further concurrent access before the type or
> > > properties of the inode change. We don't want an unconditional
> > > synchronize_rcu() event here because that would result in a
> > > significant performance impact to mixed inode allocation workloads.
> > > 
> > > Fortunately, we can take advantage of the recently added deferred
> > > inactivation mechanism to mitigate the need for an RCU wait in most
> > > cases. Deferred inactivation queues and batches the on-disk freeing
> > > of recently destroyed inodes, and so significantly increases the
> > > likelihood that a grace period has elapsed by the time an inode is
> > > freed and observable by the allocation code as a reuse candidate.
> > > Capture the current RCU grace period cookie at inode destroy time
> > > and refer to it at allocation time to conditionally wait for an RCU
> > > grace period if one hadn't expired in the meantime.  Since only
> > > unlinked inodes are recycle candidates and unlinked inodes always
> > > require inactivation,
> > 
> > Any inode can become a recycle candidate (i.e. RECLAIMABLE but otherwise
> > idle) but I think your point here is that unlinked inodes that become
> > recycling candidates can cause lookup threads to trip over symlinks, and
> > that's why we need to assign RCU state and poll on it, right?
> > 
> 
> Good point. When I wrote the commit log I was thinking of recycled
> inodes as "reincarnated" inodes, so that wording could probably be
> improved. But yes, the code is written minimally/simply so I was trying
> to document that it's unlinked -> freed -> reallocated inodes that we
> really care about here.
> 
> WRT to symlinks, I was trying to use that as an example and not
> necessarily as the general reason for the patch. I.e., the general
> reason is that the VFS uses rcu protection for inode stability (just as
> for the inode free path), and the symlink thing is just an example of
> how things can go wrong in the current implementation without it.
> 
> > (That wasn't a challenge, I'm just making sure I understand this
> > correctly.)
> > 
> > > we only need to poll and assign RCU state in
> > > the inactivation codepath. Slightly adjust struct xfs_inode to fit
> > > the new field into padding holes that conveniently preexist in the
> > > same cacheline as the deferred inactivation list.
> > > 
> > > Finally, note that the ideal long term solution here is to
> > > rearchitect bits of XFS' internal inode lifecycle management such
> > > that this additional stall point is not required, but this requires
> > > more thought, time and work to address. This approach restores
> > > functional correctness in the meantime.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > > 
> > > Hi all,
> > > 
> > > Here's the RCU fixup patch for inode reuse that I've been playing with,
> > > re: the vfs patch discussion [1]. I've put it in pretty much the most
> > > basic form, but I think there are a couple aspects worth thinking about:
> > > 
> > > 1. Use and frequency of start_poll_synchronize_rcu() (vs.
> > > get_state_synchronize_rcu()). The former is a bit more active than the
> > > latter in that it triggers the start of a grace period, when necessary.
> > > This currently invokes per inode, which is the ideal frequency in
> > > theory, but could be reduced, associated with the xfs_inogegc thresholds
> > > in some manner, etc., if there is good reason to do that.
> > 
> > If you rm -rf $path, do each of the inodes get a separate rcu state, or
> > do they share?
> 
> My previous experiments on a teardown grace period had me thinking
> batching would occur, but I don't recall which RCU call I was using at
> the time so I'd probably have to throw a tracepoint in there to dump
> some of the grace period values and double check to be sure. (If this is
> not the case, that might be a good reason to tweak things as discussed
> above).

An RCU grace period typically takes some milliseconds to complete, so a
great many inodes would end up being tagged for the same grace period.
For example, if "rm -rf" could delete one file per microsecond, the
first few thousand files would be tagged with one grace period,
the next few thousand with the next grace period, and so on.

In the unlikely event that RCU was totally idle when the "rm -rf"
started, the very first file might get its own grace period, but
they would batch in the thousands thereafter.

On start_poll_synchronize_rcu() vs. get_state_synchronize_rcu(), if
there is always other RCU update activity, get_state_synchronize_rcu()
is just fine.  So if XFS does a call_rcu() or synchronize_rcu() every
so often, all you need here is get_state_synchronize_rcu()().

Another approach is to do a start_poll_synchronize_rcu() every 1,000
events, and use get_state_synchronize_rcu() otherwise.  And there are
a lot of possible variations on that theme.

But why not just try always doing start_poll_synchronize_rcu() and
only bother with get_state_synchronize_rcu() if that turns out to
be too slow?

> > > 2. The rcu cookie lifecycle. This variant updates it on inactivation
> > > queue and nowhere else because the RCU docs imply that counter rollover
> > > is not a significant problem. In practice, I think this means that if an
> > > inode is stamped at least once, and the counter rolls over, future
> > > (non-inactivation, non-unlinked) eviction -> repopulation cycles could
> > > trigger rcu syncs. I think this would require repeated
> > > eviction/reinstantiation cycles within a small window to be noticeable,
> > > so I'm not sure how likely this is to occur. We could be more defensive
> > > by resetting or refreshing the cookie. E.g., refresh (or reset to zero)
> > > at recycle time, unconditionally refresh at destroy time (using
> > > get_state_synchronize_rcu() for non-inactivation), etc.

Even on a 32-bit system that is running RCU grace periods as fast as they
will go, it will take about 12 days to overflow that counter.  But if
you have an inode sitting on the list for that long, yes, you could
see unnecessary synchronous grace-period waits.

Would it help if there was an API that gave you a special cookie value
that cond_synchronize_rcu() and friends recognized as "already expired"?
That way if poll_state_synchronize_rcu() says that original cookie
has expired, you could replace that cookie value with one that would
stay expired.  Maybe a get_expired_synchronize_rcu() or some such?

							Thanx, Paul

> > > Otherwise testing is ongoing, but this version at least survives an
> > > fstests regression run.
> > > 
> > > Brian
> > > 
> > > [1] https://lore.kernel.org/linux-fsdevel/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
> > > 
> > >  fs/xfs/xfs_icache.c | 11 +++++++++++
> > >  fs/xfs/xfs_inode.h  |  3 ++-
> > >  2 files changed, 13 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index d019c98eb839..4931daa45ca4 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -349,6 +349,16 @@ xfs_iget_recycle(
> > >  	spin_unlock(&ip->i_flags_lock);
> > >  	rcu_read_unlock();
> > >  
> > > +	/*
> > > +	 * VFS RCU pathwalk lookups dictate the same lifecycle rules for an
> > > +	 * inode recycle as for freeing an inode. I.e., we cannot repurpose the
> > > +	 * inode until a grace period has elapsed from the time the previous
> > > +	 * version of the inode was destroyed. In most cases a grace period has
> > > +	 * already elapsed if the inode was (deferred) inactivated, but
> > > +	 * synchronize here as a last resort to guarantee correctness.
> > > +	 */
> > > +	cond_synchronize_rcu(ip->i_destroy_gp);
> > > +
> > >  	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
> > >  	error = xfs_reinit_inode(mp, inode);
> > >  	if (error) {
> > > @@ -2019,6 +2029,7 @@ xfs_inodegc_queue(
> > >  	trace_xfs_inode_set_need_inactive(ip);
> > >  	spin_lock(&ip->i_flags_lock);
> > >  	ip->i_flags |= XFS_NEED_INACTIVE;
> > > +	ip->i_destroy_gp = start_poll_synchronize_rcu();
> > 
> > Hmm.  The description says that we only need the rcu synchronization
> > when we're freeing an inode after its link count drops to zero, because
> > that's the vector for (say) the VFS inode ops actually changing due to
> > free/inactivate/reallocate/recycle while someone else is doing a lookup.
> > 
> 
> Right..
> 
> > I'm a bit puzzled why this unconditionally starts an rcu grace period,
> > instead of done only if i_nlink==0; and why we call cond_synchronize_rcu
> > above unconditionally instead of checking for i_mode==0 (or whatever
> > state the cached inode is left in after it's freed)?
> > 
> 
> Just an attempt to start simple and/or make any performance
> test/problems more blatant. I probably could have tagged this RFC. My
> primary goal with this patch was to establish whether the general
> approach is sane/viable/acceptable or we need to move in another
> direction.
> 
> That aside, I think it's reasonable to have explicit logic around the
> unlinked case if we want to keep it restricted to that, though I would
> probably implement that as a conditional i_destroy_gp assignment and let
> the consumer context key off whether that field is set rather than
> attempt to infer unlinked logic (and then I guess reset it back to zero
> so it doesn't leak across reincarnation). That also probably facilitates
> a meaningful tracepoint to track the cases that do end up syncing, which
> helps with your earlier question around batching, so I'll look into
> those changes once I get through broader testing
> 
> Brian
> 
> > --D
> > 
> > >  	spin_unlock(&ip->i_flags_lock);
> > >  
> > >  	gc = get_cpu_ptr(mp->m_inodegc);
> > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > index c447bf04205a..2153e3edbb86 100644
> > > --- a/fs/xfs/xfs_inode.h
> > > +++ b/fs/xfs/xfs_inode.h
> > > @@ -40,8 +40,9 @@ typedef struct xfs_inode {
> > >  	/* Transaction and locking information. */
> > >  	struct xfs_inode_log_item *i_itemp;	/* logging information */
> > >  	mrlock_t		i_lock;		/* inode lock */
> > > -	atomic_t		i_pincount;	/* inode pin count */
> > >  	struct llist_node	i_gclist;	/* deferred inactivation list */
> > > +	unsigned long		i_destroy_gp;	/* destroy rcugp cookie */
> > > +	atomic_t		i_pincount;	/* inode pin count */
> > >  
> > >  	/*
> > >  	 * Bitsets of inode metadata that have been checked and/or are sick.
> > > -- 
> > > 2.31.1
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-22  5:30     ` Paul E. McKenney
@ 2022-01-22 16:55       ` Paul E. McKenney
  2022-01-24 15:12       ` Brian Foster
  1 sibling, 0 replies; 36+ messages in thread
From: Paul E. McKenney @ 2022-01-22 16:55 UTC (permalink / raw)
  To: Brian Foster
  Cc: Darrick J. Wong, linux-xfs, Dave Chinner, Al Viro, Ian Kent, rcu

On Fri, Jan 21, 2022 at 09:30:19PM -0800, Paul E. McKenney wrote:
> On Fri, Jan 21, 2022 at 01:33:46PM -0500, Brian Foster wrote:

[ . . . ]

> > My previous experiments on a teardown grace period had me thinking
> > batching would occur, but I don't recall which RCU call I was using at
> > the time so I'd probably have to throw a tracepoint in there to dump
> > some of the grace period values and double check to be sure. (If this is
> > not the case, that might be a good reason to tweak things as discussed
> > above).
> 
> An RCU grace period typically takes some milliseconds to complete, so a
> great many inodes would end up being tagged for the same grace period.
> For example, if "rm -rf" could delete one file per microsecond, the
> first few thousand files would be tagged with one grace period,
> the next few thousand with the next grace period, and so on.
> 
> In the unlikely event that RCU was totally idle when the "rm -rf"
> started, the very first file might get its own grace period, but
> they would batch in the thousands thereafter.
> 
> On start_poll_synchronize_rcu() vs. get_state_synchronize_rcu(), if
> there is always other RCU update activity, get_state_synchronize_rcu()
> is just fine.  So if XFS does a call_rcu() or synchronize_rcu() every
> so often, all you need here is get_state_synchronize_rcu()().
> 
> Another approach is to do a start_poll_synchronize_rcu() every 1,000
> events, and use get_state_synchronize_rcu() otherwise.  And there are
> a lot of possible variations on that theme.
> 
> But why not just try always doing start_poll_synchronize_rcu() and
> only bother with get_state_synchronize_rcu() if that turns out to
> be too slow?

Plus there are a few optimizations I could apply that would speed up
get_state_synchronize_rcu(), for example, reducing lock contention.
But I would of course have to see a need before increasing complexity.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-22  5:30     ` Paul E. McKenney
  2022-01-22 16:55       ` Paul E. McKenney
@ 2022-01-24 15:12       ` Brian Foster
  2022-01-24 16:40         ` Paul E. McKenney
  1 sibling, 1 reply; 36+ messages in thread
From: Brian Foster @ 2022-01-24 15:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Darrick J. Wong, linux-xfs, Dave Chinner, Al Viro, Ian Kent, rcu

On Fri, Jan 21, 2022 at 09:30:19PM -0800, Paul E. McKenney wrote:
> On Fri, Jan 21, 2022 at 01:33:46PM -0500, Brian Foster wrote:
> > On Fri, Jan 21, 2022 at 09:26:03AM -0800, Darrick J. Wong wrote:
> > > On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote:
> > > > The XFS inode allocation algorithm aggressively reuses recently
> > > > freed inodes. This is historical behavior that has been in place for
> > > > quite some time, since XFS was imported to mainline Linux. Once the
> > > > VFS adopted RCUwalk path lookups (also some time ago), this behavior
> > > > became slightly incompatible because the inode recycle path doesn't
> > > > isolate concurrent access to the inode from the VFS.
> > > > 
> > > > This has recently manifested as problems in the VFS when XFS happens
> > > > to change the type or properties of a recently unlinked inode while
> > > > still involved in an RCU lookup. For example, if the VFS refers to a
> > > > previous incarnation of a symlink inode, obtains the ->get_link()
> > > > callback from inode_operations, and the latter happens to change to
> > > > a non-symlink type via a recycle event, the ->get_link() callback
> > > > pointer is reset to NULL and the lookup results in a crash.
> > > 
> > > Hmm, so I guess what you're saying is that if the memory buffer
> > > allocation in ->get_link is slow enough, some other thread can free the
> > > inode, drop it, reallocate it, and reinstantiate it (not as a symlink
> > > this time) all before ->get_link's memory allocation call returns, after
> > > which Bad Things Happen(tm)?
> > > 
> > > Can the lookup thread end up with the wrong inode->i_ops too?
> > > 
> > 
> > We really don't need to even get into the XFS symlink code to reason
> > about the fundamental form of this issue. Consider that an RCU walk
> > starts, locates a symlink inode, meanwhile XFS recycles that inode into
> > something completely different, then the VFS loads and calls
> > ->get_link() (which is now NULL) on said inode and explodes. So the
> > presumption is that the VFS uses RCU protection to rely on some form of
> > stability of the inode (i.e., that the inode memory isn't freed,
> > callback vectors don't change, etc.).
> > 
> > Validity of the symlink content is a variant of that class of problem,
> > likely already addressed by the recent inline symlink change, but that
> > doesn't address the broader issue.
> > 
> > > > To avoid this class of problem, isolate in-core inodes for recycling
> > > > with an RCU grace period. This is the same level of protection the
> > > > VFS expects for inactivated inodes that are never reused, and so
> > > > guarantees no further concurrent access before the type or
> > > > properties of the inode change. We don't want an unconditional
> > > > synchronize_rcu() event here because that would result in a
> > > > significant performance impact to mixed inode allocation workloads.
> > > > 
> > > > Fortunately, we can take advantage of the recently added deferred
> > > > inactivation mechanism to mitigate the need for an RCU wait in most
> > > > cases. Deferred inactivation queues and batches the on-disk freeing
> > > > of recently destroyed inodes, and so significantly increases the
> > > > likelihood that a grace period has elapsed by the time an inode is
> > > > freed and observable by the allocation code as a reuse candidate.
> > > > Capture the current RCU grace period cookie at inode destroy time
> > > > and refer to it at allocation time to conditionally wait for an RCU
> > > > grace period if one hadn't expired in the meantime.  Since only
> > > > unlinked inodes are recycle candidates and unlinked inodes always
> > > > require inactivation,
> > > 
> > > Any inode can become a recycle candidate (i.e. RECLAIMABLE but otherwise
> > > idle) but I think your point here is that unlinked inodes that become
> > > recycling candidates can cause lookup threads to trip over symlinks, and
> > > that's why we need to assign RCU state and poll on it, right?
> > > 
> > 
> > Good point. When I wrote the commit log I was thinking of recycled
> > inodes as "reincarnated" inodes, so that wording could probably be
> > improved. But yes, the code is written minimally/simply so I was trying
> > to document that it's unlinked -> freed -> reallocated inodes that we
> > really care about here.
> > 
> > WRT to symlinks, I was trying to use that as an example and not
> > necessarily as the general reason for the patch. I.e., the general
> > reason is that the VFS uses rcu protection for inode stability (just as
> > for the inode free path), and the symlink thing is just an example of
> > how things can go wrong in the current implementation without it.
> > 
> > > (That wasn't a challenge, I'm just making sure I understand this
> > > correctly.)
> > > 
> > > > we only need to poll and assign RCU state in
> > > > the inactivation codepath. Slightly adjust struct xfs_inode to fit
> > > > the new field into padding holes that conveniently preexist in the
> > > > same cacheline as the deferred inactivation list.
> > > > 
> > > > Finally, note that the ideal long term solution here is to
> > > > rearchitect bits of XFS' internal inode lifecycle management such
> > > > that this additional stall point is not required, but this requires
> > > > more thought, time and work to address. This approach restores
> > > > functional correctness in the meantime.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > > 
> > > > Hi all,
> > > > 
> > > > Here's the RCU fixup patch for inode reuse that I've been playing with,
> > > > re: the vfs patch discussion [1]. I've put it in pretty much the most
> > > > basic form, but I think there are a couple aspects worth thinking about:
> > > > 
> > > > 1. Use and frequency of start_poll_synchronize_rcu() (vs.
> > > > get_state_synchronize_rcu()). The former is a bit more active than the
> > > > latter in that it triggers the start of a grace period, when necessary.
> > > > This currently invokes per inode, which is the ideal frequency in
> > > > theory, but could be reduced, associated with the xfs_inogegc thresholds
> > > > in some manner, etc., if there is good reason to do that.
> > > 
> > > If you rm -rf $path, do each of the inodes get a separate rcu state, or
> > > do they share?
> > 
> > My previous experiments on a teardown grace period had me thinking
> > batching would occur, but I don't recall which RCU call I was using at
> > the time so I'd probably have to throw a tracepoint in there to dump
> > some of the grace period values and double check to be sure. (If this is
> > not the case, that might be a good reason to tweak things as discussed
> > above).
> 
> An RCU grace period typically takes some milliseconds to complete, so a
> great many inodes would end up being tagged for the same grace period.
> For example, if "rm -rf" could delete one file per microsecond, the
> first few thousand files would be tagged with one grace period,
> the next few thousand with the next grace period, and so on.
> 
> In the unlikely event that RCU was totally idle when the "rm -rf"
> started, the very first file might get its own grace period, but
> they would batch in the thousands thereafter.
> 

Great, thanks for the info.

> On start_poll_synchronize_rcu() vs. get_state_synchronize_rcu(), if
> there is always other RCU update activity, get_state_synchronize_rcu()
> is just fine.  So if XFS does a call_rcu() or synchronize_rcu() every
> so often, all you need here is get_state_synchronize_rcu()().
> 
> Another approach is to do a start_poll_synchronize_rcu() every 1,000
> events, and use get_state_synchronize_rcu() otherwise.  And there are
> a lot of possible variations on that theme.
> 
> But why not just try always doing start_poll_synchronize_rcu() and
> only bother with get_state_synchronize_rcu() if that turns out to
> be too slow?
> 

Ack, that makes sense to me. We use call_rcu() to free inode memory and
obviously will have a sync in the lookup path after this patch, but that
is a consequence of the polling we add at the same time. I'm not sure
that's enough activity on our own so I'd probably prefer to keep things
simple, use the start_poll_*() variant from the start, and then consider
further start/get filtering like you describe above if it ever becomes a
problem.

> > > > 2. The rcu cookie lifecycle. This variant updates it on inactivation
> > > > queue and nowhere else because the RCU docs imply that counter rollover
> > > > is not a significant problem. In practice, I think this means that if an
> > > > inode is stamped at least once, and the counter rolls over, future
> > > > (non-inactivation, non-unlinked) eviction -> repopulation cycles could
> > > > trigger rcu syncs. I think this would require repeated
> > > > eviction/reinstantiation cycles within a small window to be noticeable,
> > > > so I'm not sure how likely this is to occur. We could be more defensive
> > > > by resetting or refreshing the cookie. E.g., refresh (or reset to zero)
> > > > at recycle time, unconditionally refresh at destroy time (using
> > > > get_state_synchronize_rcu() for non-inactivation), etc.
> 
> Even on a 32-bit system that is running RCU grace periods as fast as they
> will go, it will take about 12 days to overflow that counter.  But if
> you have an inode sitting on the list for that long, yes, you could
> see unnecessary synchronous grace-period waits.
> 
> Would it help if there was an API that gave you a special cookie value
> that cond_synchronize_rcu() and friends recognized as "already expired"?
> That way if poll_state_synchronize_rcu() says that original cookie
> has expired, you could replace that cookie value with one that would
> stay expired.  Maybe a get_expired_synchronize_rcu() or some such?
> 

Hmm.. so I think this would be helpful if we were to stamp the inode
conditionally (i.e. unlinked inodes only) on eviction because then we
wouldn't have to worry about clearing the cookie if said inode happens
to be reallocated and then run through one or more eviction -> recycle
sequences after a rollover of the grace period counter. With that sort
of scheme, the inode could be sitting in cache for who knows how long
with a counter that was conditionally synced against many days (or
weeks?) prior, from whenever it was initially reallocated.

However, as Dave points out that we probably want to poll RCU state on
every inode eviction, I suspect that means this is less of an issue. An
inode must be evicted for it to become a recycle candidate, and so if we
update the inode unconditionally on every eviction, then I think the
recycle code should always see the most recent cookie value and we don't
have to worry much about clearing it.

I think it's technically possible for an inode to sit in an inactivation
queue for that sort of time period, but that would probably require the
filesystem go idle or drop to low enough activity that a spurious rcu
sync here or there is probably not a big deal. So all in all, I suspect
if we already had such a special cookie variant of the API that was
otherwise functionally equivalent, I'd probably use it to cover that
potential case, but it's not clear to me atm that this use case
necessarily warrants introduction of such an API...

Brian

> 							Thanx, Paul
> 
> > > > Otherwise testing is ongoing, but this version at least survives an
> > > > fstests regression run.
> > > > 
> > > > Brian
> > > > 
> > > > [1] https://lore.kernel.org/linux-fsdevel/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
> > > > 
> > > >  fs/xfs/xfs_icache.c | 11 +++++++++++
> > > >  fs/xfs/xfs_inode.h  |  3 ++-
> > > >  2 files changed, 13 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > index d019c98eb839..4931daa45ca4 100644
> > > > --- a/fs/xfs/xfs_icache.c
> > > > +++ b/fs/xfs/xfs_icache.c
> > > > @@ -349,6 +349,16 @@ xfs_iget_recycle(
> > > >  	spin_unlock(&ip->i_flags_lock);
> > > >  	rcu_read_unlock();
> > > >  
> > > > +	/*
> > > > +	 * VFS RCU pathwalk lookups dictate the same lifecycle rules for an
> > > > +	 * inode recycle as for freeing an inode. I.e., we cannot repurpose the
> > > > +	 * inode until a grace period has elapsed from the time the previous
> > > > +	 * version of the inode was destroyed. In most cases a grace period has
> > > > +	 * already elapsed if the inode was (deferred) inactivated, but
> > > > +	 * synchronize here as a last resort to guarantee correctness.
> > > > +	 */
> > > > +	cond_synchronize_rcu(ip->i_destroy_gp);
> > > > +
> > > >  	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
> > > >  	error = xfs_reinit_inode(mp, inode);
> > > >  	if (error) {
> > > > @@ -2019,6 +2029,7 @@ xfs_inodegc_queue(
> > > >  	trace_xfs_inode_set_need_inactive(ip);
> > > >  	spin_lock(&ip->i_flags_lock);
> > > >  	ip->i_flags |= XFS_NEED_INACTIVE;
> > > > +	ip->i_destroy_gp = start_poll_synchronize_rcu();
> > > 
> > > Hmm.  The description says that we only need the rcu synchronization
> > > when we're freeing an inode after its link count drops to zero, because
> > > that's the vector for (say) the VFS inode ops actually changing due to
> > > free/inactivate/reallocate/recycle while someone else is doing a lookup.
> > > 
> > 
> > Right..
> > 
> > > I'm a bit puzzled why this unconditionally starts an rcu grace period,
> > > instead of done only if i_nlink==0; and why we call cond_synchronize_rcu
> > > above unconditionally instead of checking for i_mode==0 (or whatever
> > > state the cached inode is left in after it's freed)?
> > > 
> > 
> > Just an attempt to start simple and/or make any performance
> > test/problems more blatant. I probably could have tagged this RFC. My
> > primary goal with this patch was to establish whether the general
> > approach is sane/viable/acceptable or we need to move in another
> > direction.
> > 
> > That aside, I think it's reasonable to have explicit logic around the
> > unlinked case if we want to keep it restricted to that, though I would
> > probably implement that as a conditional i_destroy_gp assignment and let
> > the consumer context key off whether that field is set rather than
> > attempt to infer unlinked logic (and then I guess reset it back to zero
> > so it doesn't leak across reincarnation). That also probably facilitates
> > a meaningful tracepoint to track the cases that do end up syncing, which
> > helps with your earlier question around batching, so I'll look into
> > those changes once I get through broader testing
> > 
> > Brian
> > 
> > > --D
> > > 
> > > >  	spin_unlock(&ip->i_flags_lock);
> > > >  
> > > >  	gc = get_cpu_ptr(mp->m_inodegc);
> > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > > index c447bf04205a..2153e3edbb86 100644
> > > > --- a/fs/xfs/xfs_inode.h
> > > > +++ b/fs/xfs/xfs_inode.h
> > > > @@ -40,8 +40,9 @@ typedef struct xfs_inode {
> > > >  	/* Transaction and locking information. */
> > > >  	struct xfs_inode_log_item *i_itemp;	/* logging information */
> > > >  	mrlock_t		i_lock;		/* inode lock */
> > > > -	atomic_t		i_pincount;	/* inode pin count */
> > > >  	struct llist_node	i_gclist;	/* deferred inactivation list */
> > > > +	unsigned long		i_destroy_gp;	/* destroy rcugp cookie */
> > > > +	atomic_t		i_pincount;	/* inode pin count */
> > > >  
> > > >  	/*
> > > >  	 * Bitsets of inode metadata that have been checked and/or are sick.
> > > > -- 
> > > > 2.31.1
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-24 15:12       ` Brian Foster
@ 2022-01-24 16:40         ` Paul E. McKenney
  0 siblings, 0 replies; 36+ messages in thread
From: Paul E. McKenney @ 2022-01-24 16:40 UTC (permalink / raw)
  To: Brian Foster
  Cc: Darrick J. Wong, linux-xfs, Dave Chinner, Al Viro, Ian Kent, rcu

On Mon, Jan 24, 2022 at 10:12:45AM -0500, Brian Foster wrote:
> On Fri, Jan 21, 2022 at 09:30:19PM -0800, Paul E. McKenney wrote:
> > On Fri, Jan 21, 2022 at 01:33:46PM -0500, Brian Foster wrote:
> > > On Fri, Jan 21, 2022 at 09:26:03AM -0800, Darrick J. Wong wrote:
> > > > On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote:
> > > > > The XFS inode allocation algorithm aggressively reuses recently
> > > > > freed inodes. This is historical behavior that has been in place for
> > > > > quite some time, since XFS was imported to mainline Linux. Once the
> > > > > VFS adopted RCUwalk path lookups (also some time ago), this behavior
> > > > > became slightly incompatible because the inode recycle path doesn't
> > > > > isolate concurrent access to the inode from the VFS.
> > > > > 
> > > > > This has recently manifested as problems in the VFS when XFS happens
> > > > > to change the type or properties of a recently unlinked inode while
> > > > > still involved in an RCU lookup. For example, if the VFS refers to a
> > > > > previous incarnation of a symlink inode, obtains the ->get_link()
> > > > > callback from inode_operations, and the latter happens to change to
> > > > > a non-symlink type via a recycle event, the ->get_link() callback
> > > > > pointer is reset to NULL and the lookup results in a crash.
> > > > 
> > > > Hmm, so I guess what you're saying is that if the memory buffer
> > > > allocation in ->get_link is slow enough, some other thread can free the
> > > > inode, drop it, reallocate it, and reinstantiate it (not as a symlink
> > > > this time) all before ->get_link's memory allocation call returns, after
> > > > which Bad Things Happen(tm)?
> > > > 
> > > > Can the lookup thread end up with the wrong inode->i_ops too?
> > > > 
> > > 
> > > We really don't need to even get into the XFS symlink code to reason
> > > about the fundamental form of this issue. Consider that an RCU walk
> > > starts, locates a symlink inode, meanwhile XFS recycles that inode into
> > > something completely different, then the VFS loads and calls
> > > ->get_link() (which is now NULL) on said inode and explodes. So the
> > > presumption is that the VFS uses RCU protection to rely on some form of
> > > stability of the inode (i.e., that the inode memory isn't freed,
> > > callback vectors don't change, etc.).
> > > 
> > > Validity of the symlink content is a variant of that class of problem,
> > > likely already addressed by the recent inline symlink change, but that
> > > doesn't address the broader issue.
> > > 
> > > > > To avoid this class of problem, isolate in-core inodes for recycling
> > > > > with an RCU grace period. This is the same level of protection the
> > > > > VFS expects for inactivated inodes that are never reused, and so
> > > > > guarantees no further concurrent access before the type or
> > > > > properties of the inode change. We don't want an unconditional
> > > > > synchronize_rcu() event here because that would result in a
> > > > > significant performance impact to mixed inode allocation workloads.
> > > > > 
> > > > > Fortunately, we can take advantage of the recently added deferred
> > > > > inactivation mechanism to mitigate the need for an RCU wait in most
> > > > > cases. Deferred inactivation queues and batches the on-disk freeing
> > > > > of recently destroyed inodes, and so significantly increases the
> > > > > likelihood that a grace period has elapsed by the time an inode is
> > > > > freed and observable by the allocation code as a reuse candidate.
> > > > > Capture the current RCU grace period cookie at inode destroy time
> > > > > and refer to it at allocation time to conditionally wait for an RCU
> > > > > grace period if one hadn't expired in the meantime.  Since only
> > > > > unlinked inodes are recycle candidates and unlinked inodes always
> > > > > require inactivation,
> > > > 
> > > > Any inode can become a recycle candidate (i.e. RECLAIMABLE but otherwise
> > > > idle) but I think your point here is that unlinked inodes that become
> > > > recycling candidates can cause lookup threads to trip over symlinks, and
> > > > that's why we need to assign RCU state and poll on it, right?
> > > > 
> > > 
> > > Good point. When I wrote the commit log I was thinking of recycled
> > > inodes as "reincarnated" inodes, so that wording could probably be
> > > improved. But yes, the code is written minimally/simply so I was trying
> > > to document that it's unlinked -> freed -> reallocated inodes that we
> > > really care about here.
> > > 
> > > WRT to symlinks, I was trying to use that as an example and not
> > > necessarily as the general reason for the patch. I.e., the general
> > > reason is that the VFS uses rcu protection for inode stability (just as
> > > for the inode free path), and the symlink thing is just an example of
> > > how things can go wrong in the current implementation without it.
> > > 
> > > > (That wasn't a challenge, I'm just making sure I understand this
> > > > correctly.)
> > > > 
> > > > > we only need to poll and assign RCU state in
> > > > > the inactivation codepath. Slightly adjust struct xfs_inode to fit
> > > > > the new field into padding holes that conveniently preexist in the
> > > > > same cacheline as the deferred inactivation list.
> > > > > 
> > > > > Finally, note that the ideal long term solution here is to
> > > > > rearchitect bits of XFS' internal inode lifecycle management such
> > > > > that this additional stall point is not required, but this requires
> > > > > more thought, time and work to address. This approach restores
> > > > > functional correctness in the meantime.
> > > > > 
> > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > ---
> > > > > 
> > > > > Hi all,
> > > > > 
> > > > > Here's the RCU fixup patch for inode reuse that I've been playing with,
> > > > > re: the vfs patch discussion [1]. I've put it in pretty much the most
> > > > > basic form, but I think there are a couple aspects worth thinking about:
> > > > > 
> > > > > 1. Use and frequency of start_poll_synchronize_rcu() (vs.
> > > > > get_state_synchronize_rcu()). The former is a bit more active than the
> > > > > latter in that it triggers the start of a grace period, when necessary.
> > > > > This currently invokes per inode, which is the ideal frequency in
> > > > > theory, but could be reduced, associated with the xfs_inogegc thresholds
> > > > > in some manner, etc., if there is good reason to do that.
> > > > 
> > > > If you rm -rf $path, do each of the inodes get a separate rcu state, or
> > > > do they share?
> > > 
> > > My previous experiments on a teardown grace period had me thinking
> > > batching would occur, but I don't recall which RCU call I was using at
> > > the time so I'd probably have to throw a tracepoint in there to dump
> > > some of the grace period values and double check to be sure. (If this is
> > > not the case, that might be a good reason to tweak things as discussed
> > > above).
> > 
> > An RCU grace period typically takes some milliseconds to complete, so a
> > great many inodes would end up being tagged for the same grace period.
> > For example, if "rm -rf" could delete one file per microsecond, the
> > first few thousand files would be tagged with one grace period,
> > the next few thousand with the next grace period, and so on.
> > 
> > In the unlikely event that RCU was totally idle when the "rm -rf"
> > started, the very first file might get its own grace period, but
> > they would batch in the thousands thereafter.
> > 
> 
> Great, thanks for the info.
> 
> > On start_poll_synchronize_rcu() vs. get_state_synchronize_rcu(), if
> > there is always other RCU update activity, get_state_synchronize_rcu()
> > is just fine.  So if XFS does a call_rcu() or synchronize_rcu() every
> > so often, all you need here is get_state_synchronize_rcu()().
> > 
> > Another approach is to do a start_poll_synchronize_rcu() every 1,000
> > events, and use get_state_synchronize_rcu() otherwise.  And there are
> > a lot of possible variations on that theme.
> > 
> > But why not just try always doing start_poll_synchronize_rcu() and
> > only bother with get_state_synchronize_rcu() if that turns out to
> > be too slow?
> > 
> 
> Ack, that makes sense to me. We use call_rcu() to free inode memory and
> obviously will have a sync in the lookup path after this patch, but that
> is a consequence of the polling we add at the same time. I'm not sure
> that's enough activity on our own so I'd probably prefer to keep things
> simple, use the start_poll_*() variant from the start, and then consider
> further start/get filtering like you describe above if it ever becomes a
> problem.
> 
> > > > > 2. The rcu cookie lifecycle. This variant updates it on inactivation
> > > > > queue and nowhere else because the RCU docs imply that counter rollover
> > > > > is not a significant problem. In practice, I think this means that if an
> > > > > inode is stamped at least once, and the counter rolls over, future
> > > > > (non-inactivation, non-unlinked) eviction -> repopulation cycles could
> > > > > trigger rcu syncs. I think this would require repeated
> > > > > eviction/reinstantiation cycles within a small window to be noticeable,
> > > > > so I'm not sure how likely this is to occur. We could be more defensive
> > > > > by resetting or refreshing the cookie. E.g., refresh (or reset to zero)
> > > > > at recycle time, unconditionally refresh at destroy time (using
> > > > > get_state_synchronize_rcu() for non-inactivation), etc.
> > 
> > Even on a 32-bit system that is running RCU grace periods as fast as they
> > will go, it will take about 12 days to overflow that counter.  But if
> > you have an inode sitting on the list for that long, yes, you could
> > see unnecessary synchronous grace-period waits.
> > 
> > Would it help if there was an API that gave you a special cookie value
> > that cond_synchronize_rcu() and friends recognized as "already expired"?
> > That way if poll_state_synchronize_rcu() says that original cookie
> > has expired, you could replace that cookie value with one that would
> > stay expired.  Maybe a get_expired_synchronize_rcu() or some such?
> > 
> 
> Hmm.. so I think this would be helpful if we were to stamp the inode
> conditionally (i.e. unlinked inodes only) on eviction because then we
> wouldn't have to worry about clearing the cookie if said inode happens
> to be reallocated and then run through one or more eviction -> recycle
> sequences after a rollover of the grace period counter. With that sort
> of scheme, the inode could be sitting in cache for who knows how long
> with a counter that was conditionally synced against many days (or
> weeks?) prior, from whenever it was initially reallocated.
> 
> However, as Dave points out that we probably want to poll RCU state on
> every inode eviction, I suspect that means this is less of an issue. An
> inode must be evicted for it to become a recycle candidate, and so if we
> update the inode unconditionally on every eviction, then I think the
> recycle code should always see the most recent cookie value and we don't
> have to worry much about clearing it.
> 
> I think it's technically possible for an inode to sit in an inactivation
> queue for that sort of time period, but that would probably require the
> filesystem go idle or drop to low enough activity that a spurious rcu
> sync here or there is probably not a big deal. So all in all, I suspect
> if we already had such a special cookie variant of the API that was
> otherwise functionally equivalent, I'd probably use it to cover that
> potential case, but it's not clear to me atm that this use case
> necessarily warrants introduction of such an API...

If you need it, it happens to be easy to provide.  If you don't need it,
I am of course happy to avoid adding another RCU API member.  ;-)

							Thanx, Paul

> > > > > Otherwise testing is ongoing, but this version at least survives an
> > > > > fstests regression run.
> > > > > 
> > > > > Brian
> > > > > 
> > > > > [1] https://lore.kernel.org/linux-fsdevel/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
> > > > > 
> > > > >  fs/xfs/xfs_icache.c | 11 +++++++++++
> > > > >  fs/xfs/xfs_inode.h  |  3 ++-
> > > > >  2 files changed, 13 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > > index d019c98eb839..4931daa45ca4 100644
> > > > > --- a/fs/xfs/xfs_icache.c
> > > > > +++ b/fs/xfs/xfs_icache.c
> > > > > @@ -349,6 +349,16 @@ xfs_iget_recycle(
> > > > >  	spin_unlock(&ip->i_flags_lock);
> > > > >  	rcu_read_unlock();
> > > > >  
> > > > > +	/*
> > > > > +	 * VFS RCU pathwalk lookups dictate the same lifecycle rules for an
> > > > > +	 * inode recycle as for freeing an inode. I.e., we cannot repurpose the
> > > > > +	 * inode until a grace period has elapsed from the time the previous
> > > > > +	 * version of the inode was destroyed. In most cases a grace period has
> > > > > +	 * already elapsed if the inode was (deferred) inactivated, but
> > > > > +	 * synchronize here as a last resort to guarantee correctness.
> > > > > +	 */
> > > > > +	cond_synchronize_rcu(ip->i_destroy_gp);
> > > > > +
> > > > >  	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
> > > > >  	error = xfs_reinit_inode(mp, inode);
> > > > >  	if (error) {
> > > > > @@ -2019,6 +2029,7 @@ xfs_inodegc_queue(
> > > > >  	trace_xfs_inode_set_need_inactive(ip);
> > > > >  	spin_lock(&ip->i_flags_lock);
> > > > >  	ip->i_flags |= XFS_NEED_INACTIVE;
> > > > > +	ip->i_destroy_gp = start_poll_synchronize_rcu();
> > > > 
> > > > Hmm.  The description says that we only need the rcu synchronization
> > > > when we're freeing an inode after its link count drops to zero, because
> > > > that's the vector for (say) the VFS inode ops actually changing due to
> > > > free/inactivate/reallocate/recycle while someone else is doing a lookup.
> > > > 
> > > 
> > > Right..
> > > 
> > > > I'm a bit puzzled why this unconditionally starts an rcu grace period,
> > > > instead of done only if i_nlink==0; and why we call cond_synchronize_rcu
> > > > above unconditionally instead of checking for i_mode==0 (or whatever
> > > > state the cached inode is left in after it's freed)?
> > > > 
> > > 
> > > Just an attempt to start simple and/or make any performance
> > > test/problems more blatant. I probably could have tagged this RFC. My
> > > primary goal with this patch was to establish whether the general
> > > approach is sane/viable/acceptable or we need to move in another
> > > direction.
> > > 
> > > That aside, I think it's reasonable to have explicit logic around the
> > > unlinked case if we want to keep it restricted to that, though I would
> > > probably implement that as a conditional i_destroy_gp assignment and let
> > > the consumer context key off whether that field is set rather than
> > > attempt to infer unlinked logic (and then I guess reset it back to zero
> > > so it doesn't leak across reincarnation). That also probably facilitates
> > > a meaningful tracepoint to track the cases that do end up syncing, which
> > > helps with your earlier question around batching, so I'll look into
> > > those changes once I get through broader testing
> > > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > >  	spin_unlock(&ip->i_flags_lock);
> > > > >  
> > > > >  	gc = get_cpu_ptr(mp->m_inodegc);
> > > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > > > index c447bf04205a..2153e3edbb86 100644
> > > > > --- a/fs/xfs/xfs_inode.h
> > > > > +++ b/fs/xfs/xfs_inode.h
> > > > > @@ -40,8 +40,9 @@ typedef struct xfs_inode {
> > > > >  	/* Transaction and locking information. */
> > > > >  	struct xfs_inode_log_item *i_itemp;	/* logging information */
> > > > >  	mrlock_t		i_lock;		/* inode lock */
> > > > > -	atomic_t		i_pincount;	/* inode pin count */
> > > > >  	struct llist_node	i_gclist;	/* deferred inactivation list */
> > > > > +	unsigned long		i_destroy_gp;	/* destroy rcugp cookie */
> > > > > +	atomic_t		i_pincount;	/* inode pin count */
> > > > >  
> > > > >  	/*
> > > > >  	 * Bitsets of inode metadata that have been checked and/or are sick.
> > > > > -- 
> > > > > 2.31.1
> > > > > 
> > > > 
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-21 14:24 [PATCH] xfs: require an rcu grace period before inode recycle Brian Foster
  2022-01-21 17:26 ` Darrick J. Wong
@ 2022-01-23 22:43 ` Dave Chinner
  2022-01-24 15:06   ` Brian Foster
  2022-01-24 15:02 ` Brian Foster
  2 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2022-01-23 22:43 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Al Viro, Ian Kent, rcu

On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote:
> The XFS inode allocation algorithm aggressively reuses recently
> freed inodes. This is historical behavior that has been in place for
> quite some time, since XFS was imported to mainline Linux. Once the
> VFS adopted RCUwalk path lookups (also some time ago), this behavior
> became slightly incompatible because the inode recycle path doesn't
> isolate concurrent access to the inode from the VFS.
> 
> This has recently manifested as problems in the VFS when XFS happens
> to change the type or properties of a recently unlinked inode while
> still involved in an RCU lookup. For example, if the VFS refers to a
> previous incarnation of a symlink inode, obtains the ->get_link()
> callback from inode_operations, and the latter happens to change to
> a non-symlink type via a recycle event, the ->get_link() callback
> pointer is reset to NULL and the lookup results in a crash.
> 
> To avoid this class of problem, isolate in-core inodes for recycling
> with an RCU grace period. This is the same level of protection the
> VFS expects for inactivated inodes that are never reused, and so
> guarantees no further concurrent access before the type or
> properties of the inode change. We don't want an unconditional
> synchronize_rcu() event here because that would result in a
> significant performance impact to mixed inode allocation workloads.
> 
> Fortunately, we can take advantage of the recently added deferred
> inactivation mechanism to mitigate the need for an RCU wait in most
> cases. Deferred inactivation queues and batches the on-disk freeing
> of recently destroyed inodes, and so significantly increases the
> likelihood that a grace period has elapsed by the time an inode is
> freed and observable by the allocation code as a reuse candidate.
> Capture the current RCU grace period cookie at inode destroy time
> and refer to it at allocation time to conditionally wait for an RCU
> grace period if one hadn't expired in the meantime.  Since only
> unlinked inodes are recycle candidates and unlinked inodes always
> require inactivation, we only need to poll and assign RCU state in
> the inactivation codepath.

I think this assertion is incorrect.

Recycling can occur on any inode that has been evicted from the VFS
cache. i.e. while the inode is sitting in XFS_IRECLAIMABLE state
waiting for the background inodegc to run (every ~5s by default) a
->lookup from the VFS can occur and we find that same inode sitting
there in XFS_IRECLAIMABLE state. This lookup then hits the recycle
path.

In this case, even though we re-instantiate the inode into the same
identity, it goes through a transient state where the inode has it's
identity returned to the default initial "just allocated" VFS state
and this transient state can be visible from RCU lookups within the
RCU grace period the inode was evicted from. This means the RCU
lookup could see the inode with i_ops having been reset to
&empty_ops, which means any method called on the inode at this time
(e.g. ->get_link) will hit a NULL pointer dereference.

This requires multiple concurrent lookups on the same inode that
just got evicted, some which the RCU pathwalk finds the old stale
dentry/inode pair, and others that don't find that old pair. This is
much harder to trip over but, IIRC, we used to see this quite a lot
with NFS server workloads when multiple operations on a single inode
could come in from multiple clients and be processed in parallel by
knfsd threads. This was quite a hot path before the NFS server had an
open-file cache added to it, and it probably still is if the NFS
server OFC is not large enough for the working set of files being
accessed...

Hence we have to ensure that RCU lookups can't find an evicted inode
through anything other than xfs_iget() while we are re-instantiating
the VFS inode state in xfs_iget_recycle().  Hence the RCU state
sampling needs to be done unconditionally for all inodes going
through ->destroy_inode so we can ensure grace periods expire for
all inodes being recycled, not just those that required
inactivation...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-23 22:43 ` Dave Chinner
@ 2022-01-24 15:06   ` Brian Foster
  0 siblings, 0 replies; 36+ messages in thread
From: Brian Foster @ 2022-01-24 15:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Al Viro, Ian Kent, rcu

On Mon, Jan 24, 2022 at 09:43:46AM +1100, Dave Chinner wrote:
> On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote:
> > The XFS inode allocation algorithm aggressively reuses recently
> > freed inodes. This is historical behavior that has been in place for
> > quite some time, since XFS was imported to mainline Linux. Once the
> > VFS adopted RCUwalk path lookups (also some time ago), this behavior
> > became slightly incompatible because the inode recycle path doesn't
> > isolate concurrent access to the inode from the VFS.
> > 
> > This has recently manifested as problems in the VFS when XFS happens
> > to change the type or properties of a recently unlinked inode while
> > still involved in an RCU lookup. For example, if the VFS refers to a
> > previous incarnation of a symlink inode, obtains the ->get_link()
> > callback from inode_operations, and the latter happens to change to
> > a non-symlink type via a recycle event, the ->get_link() callback
> > pointer is reset to NULL and the lookup results in a crash.
> > 
> > To avoid this class of problem, isolate in-core inodes for recycling
> > with an RCU grace period. This is the same level of protection the
> > VFS expects for inactivated inodes that are never reused, and so
> > guarantees no further concurrent access before the type or
> > properties of the inode change. We don't want an unconditional
> > synchronize_rcu() event here because that would result in a
> > significant performance impact to mixed inode allocation workloads.
> > 
> > Fortunately, we can take advantage of the recently added deferred
> > inactivation mechanism to mitigate the need for an RCU wait in most
> > cases. Deferred inactivation queues and batches the on-disk freeing
> > of recently destroyed inodes, and so significantly increases the
> > likelihood that a grace period has elapsed by the time an inode is
> > freed and observable by the allocation code as a reuse candidate.
> > Capture the current RCU grace period cookie at inode destroy time
> > and refer to it at allocation time to conditionally wait for an RCU
> > grace period if one hadn't expired in the meantime.  Since only
> > unlinked inodes are recycle candidates and unlinked inodes always
> > require inactivation, we only need to poll and assign RCU state in
> > the inactivation codepath.
> 
> I think this assertion is incorrect.
> 
> Recycling can occur on any inode that has been evicted from the VFS
> cache. i.e. while the inode is sitting in XFS_IRECLAIMABLE state
> waiting for the background inodegc to run (every ~5s by default) a
> ->lookup from the VFS can occur and we find that same inode sitting
> there in XFS_IRECLAIMABLE state. This lookup then hits the recycle
> path.
> 

See my reply to Darrick wrt to the poor wording. I'm aware of the
eviction -> recycle case, just didn't think we needed to deal with it
here.

> In this case, even though we re-instantiate the inode into the same
> identity, it goes through a transient state where the inode has it's
> identity returned to the default initial "just allocated" VFS state
> and this transient state can be visible from RCU lookups within the
> RCU grace period the inode was evicted from. This means the RCU
> lookup could see the inode with i_ops having been reset to
> &empty_ops, which means any method called on the inode at this time
> (e.g. ->get_link) will hit a NULL pointer dereference.
> 

Hmm, good point.

> This requires multiple concurrent lookups on the same inode that
> just got evicted, some which the RCU pathwalk finds the old stale
> dentry/inode pair, and others that don't find that old pair. This is
> much harder to trip over but, IIRC, we used to see this quite a lot
> with NFS server workloads when multiple operations on a single inode
> could come in from multiple clients and be processed in parallel by
> knfsd threads. This was quite a hot path before the NFS server had an
> open-file cache added to it, and it probably still is if the NFS
> server OFC is not large enough for the working set of files being
> accessed...
> 
> Hence we have to ensure that RCU lookups can't find an evicted inode
> through anything other than xfs_iget() while we are re-instantiating
> the VFS inode state in xfs_iget_recycle().  Hence the RCU state
> sampling needs to be done unconditionally for all inodes going
> through ->destroy_inode so we can ensure grace periods expire for
> all inodes being recycled, not just those that required
> inactivation...
> 

Yeah, that makes sense. So this means we don't want to filter to
unlinked inodes, but OTOH Paul's feedback suggests the RCU calls should
be fairly efficient on a per-inode basis. On top of that, the
non-unlinked eviction case doesn't have such a direct impact on a mixed
workload the way the unlinked case does (i.e. inactivation populating a
free inode record for the next inode allocation to discover), so this is
probably less significant of a change.

Personally, my general takeaway from the just posted test results is
that we really should be thinking about how to shift the allocation path
cost away into the inactivation side, even if not done from the start.
This changes things a bit because we know we need an rcu sync in the
iget path for the (non-unlinnked) eviction case regardless, so perhaps
the right approach is to get the basic functional fix in place to start,
then revisit potential optimizations in the inactivation path for the
unlinked inode case. IOW, a conditional, asynchronous rcu delay in the
inactivation path (only) for unlinked inodes doesn't remove the need for
an iget rcu sync in general, but it would still improve inode allocation
performance if we ensure those inodes aren't reallocatable until a grace
period has elapsed. We just have to implement it in a way that doesn't
unreasonably impact sustained removal performance. Thoughts?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-21 14:24 [PATCH] xfs: require an rcu grace period before inode recycle Brian Foster
  2022-01-21 17:26 ` Darrick J. Wong
  2022-01-23 22:43 ` Dave Chinner
@ 2022-01-24 15:02 ` Brian Foster
  2022-01-24 22:08   ` Dave Chinner
  2 siblings, 1 reply; 36+ messages in thread
From: Brian Foster @ 2022-01-24 15:02 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Al Viro, Ian Kent, rcu

On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote:
> The XFS inode allocation algorithm aggressively reuses recently
> freed inodes. This is historical behavior that has been in place for
> quite some time, since XFS was imported to mainline Linux. Once the
> VFS adopted RCUwalk path lookups (also some time ago), this behavior
> became slightly incompatible because the inode recycle path doesn't
> isolate concurrent access to the inode from the VFS.
> 
> This has recently manifested as problems in the VFS when XFS happens
> to change the type or properties of a recently unlinked inode while
> still involved in an RCU lookup. For example, if the VFS refers to a
> previous incarnation of a symlink inode, obtains the ->get_link()
> callback from inode_operations, and the latter happens to change to
> a non-symlink type via a recycle event, the ->get_link() callback
> pointer is reset to NULL and the lookup results in a crash.
> 
> To avoid this class of problem, isolate in-core inodes for recycling
> with an RCU grace period. This is the same level of protection the
> VFS expects for inactivated inodes that are never reused, and so
> guarantees no further concurrent access before the type or
> properties of the inode change. We don't want an unconditional
> synchronize_rcu() event here because that would result in a
> significant performance impact to mixed inode allocation workloads.
> 
> Fortunately, we can take advantage of the recently added deferred
> inactivation mechanism to mitigate the need for an RCU wait in most
> cases. Deferred inactivation queues and batches the on-disk freeing
> of recently destroyed inodes, and so significantly increases the
> likelihood that a grace period has elapsed by the time an inode is
> freed and observable by the allocation code as a reuse candidate.
> Capture the current RCU grace period cookie at inode destroy time
> and refer to it at allocation time to conditionally wait for an RCU
> grace period if one hadn't expired in the meantime.  Since only
> unlinked inodes are recycle candidates and unlinked inodes always
> require inactivation, we only need to poll and assign RCU state in
> the inactivation codepath. Slightly adjust struct xfs_inode to fit
> the new field into padding holes that conveniently preexist in the
> same cacheline as the deferred inactivation list.
> 
> Finally, note that the ideal long term solution here is to
> rearchitect bits of XFS' internal inode lifecycle management such
> that this additional stall point is not required, but this requires
> more thought, time and work to address. This approach restores
> functional correctness in the meantime.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> Hi all,
> 
> Here's the RCU fixup patch for inode reuse that I've been playing with,
> re: the vfs patch discussion [1]. I've put it in pretty much the most
> basic form, but I think there are a couple aspects worth thinking about:
> 
> 1. Use and frequency of start_poll_synchronize_rcu() (vs.
> get_state_synchronize_rcu()). The former is a bit more active than the
> latter in that it triggers the start of a grace period, when necessary.
> This currently invokes per inode, which is the ideal frequency in
> theory, but could be reduced, associated with the xfs_inogegc thresholds
> in some manner, etc., if there is good reason to do that.
> 
> 2. The rcu cookie lifecycle. This variant updates it on inactivation
> queue and nowhere else because the RCU docs imply that counter rollover
> is not a significant problem. In practice, I think this means that if an
> inode is stamped at least once, and the counter rolls over, future
> (non-inactivation, non-unlinked) eviction -> repopulation cycles could
> trigger rcu syncs. I think this would require repeated
> eviction/reinstantiation cycles within a small window to be noticeable,
> so I'm not sure how likely this is to occur. We could be more defensive
> by resetting or refreshing the cookie. E.g., refresh (or reset to zero)
> at recycle time, unconditionally refresh at destroy time (using
> get_state_synchronize_rcu() for non-inactivation), etc.
> 
> Otherwise testing is ongoing, but this version at least survives an
> fstests regression run.
> 

FYI, I modified my repeated alloc/free test to do some batching and form
it into something more able to measure the potential side effect / cost
of the grace period sync. The test is a single threaded, file alloc/free
loop using a variable per iteration batch size. The test runs for ~60s
and reports how many total files were allocated/freed in that period
with the specified batch size. Note that this particular test ran
without any background workload. Results are as follows:

	files		baseline	test

	1		38480		38437
	4		126055		111080
	8		218299		134469
	16		306619		141968
	32		397909		152267
	64		418603		200875
	128		469077		289365
	256		684117		566016
	512		931328		878933
	1024		1126741		1118891

The first column shows the batch size of the test run while the second
and third show results (averaged across three test runs) for the
baseline (5.16.0-rc5) and test kernels. This basically shows that as the
inactivation queue more efficiently batches removals, the number of
stalls on the allocation side increase accordingly and thus slow the
task down. This becomes significant by around 8 files per alloc/free
iteration and seems to recover at around 512 files per iteration.
Outside of those values, the additional overhead appears to be mostly
masked.

I'm not sure how realistic this sort of symmetric/predictable workload
is in the wild, but this is more designed to show potential impact of
the change. The delay cost can be shifted to the remove side to some
degree if we wanted to go that route. E.g., a quick experiment to add an
rcu sync in the inactivation path right before the inode is freed allows
this test to behave much more in line with baseline up through about the
256 file mark, after which point results start to fall off as I suspect
we start to measure stalls in the remove side.

That's just a test of a quick hack, however. Since there is no real
urgency to inactivate an unlinked inode (it has no potential users until
it's freed), I suspect that result can be further optimized to absorb
the cost of an rcu delay by deferring the steps that make the inode
available for reallocation in the first place. In theory if that can be
made completely asynchronous, then there is no real latency cost at all
because nothing can use the inode until it's ultimately free on disk.
However in reality we must have thresholds and whatnot to ensure the
outstanding queue cannot grow out of control. My previous experiments
suggest that an RCU delay on the inactivation side is measureable via a
simple 'rm -rf' with the current thresholds, but can be mitigated if the
pipeline/thresholds are tuned up a bit to accomodate the added delay.
This has more complexity and tradeoffs, but IMO, this is something we
should be thinking about at least as a next step to something like this
patch.

Brian

> Brian
> 
> [1] https://lore.kernel.org/linux-fsdevel/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
> 
>  fs/xfs/xfs_icache.c | 11 +++++++++++
>  fs/xfs/xfs_inode.h  |  3 ++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index d019c98eb839..4931daa45ca4 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -349,6 +349,16 @@ xfs_iget_recycle(
>  	spin_unlock(&ip->i_flags_lock);
>  	rcu_read_unlock();
>  
> +	/*
> +	 * VFS RCU pathwalk lookups dictate the same lifecycle rules for an
> +	 * inode recycle as for freeing an inode. I.e., we cannot repurpose the
> +	 * inode until a grace period has elapsed from the time the previous
> +	 * version of the inode was destroyed. In most cases a grace period has
> +	 * already elapsed if the inode was (deferred) inactivated, but
> +	 * synchronize here as a last resort to guarantee correctness.
> +	 */
> +	cond_synchronize_rcu(ip->i_destroy_gp);
> +
>  	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
>  	error = xfs_reinit_inode(mp, inode);
>  	if (error) {
> @@ -2019,6 +2029,7 @@ xfs_inodegc_queue(
>  	trace_xfs_inode_set_need_inactive(ip);
>  	spin_lock(&ip->i_flags_lock);
>  	ip->i_flags |= XFS_NEED_INACTIVE;
> +	ip->i_destroy_gp = start_poll_synchronize_rcu();
>  	spin_unlock(&ip->i_flags_lock);
>  
>  	gc = get_cpu_ptr(mp->m_inodegc);
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index c447bf04205a..2153e3edbb86 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -40,8 +40,9 @@ typedef struct xfs_inode {
>  	/* Transaction and locking information. */
>  	struct xfs_inode_log_item *i_itemp;	/* logging information */
>  	mrlock_t		i_lock;		/* inode lock */
> -	atomic_t		i_pincount;	/* inode pin count */
>  	struct llist_node	i_gclist;	/* deferred inactivation list */
> +	unsigned long		i_destroy_gp;	/* destroy rcugp cookie */
> +	atomic_t		i_pincount;	/* inode pin count */
>  
>  	/*
>  	 * Bitsets of inode metadata that have been checked and/or are sick.
> -- 
> 2.31.1
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-24 15:02 ` Brian Foster
@ 2022-01-24 22:08   ` Dave Chinner
  2022-01-24 23:29     ` Brian Foster
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2022-01-24 22:08 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Al Viro, Ian Kent, rcu

On Mon, Jan 24, 2022 at 10:02:27AM -0500, Brian Foster wrote:
> On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote:
> > The XFS inode allocation algorithm aggressively reuses recently
> > freed inodes. This is historical behavior that has been in place for
> > quite some time, since XFS was imported to mainline Linux. Once the
> > VFS adopted RCUwalk path lookups (also some time ago), this behavior
> > became slightly incompatible because the inode recycle path doesn't
> > isolate concurrent access to the inode from the VFS.
> > 
> > This has recently manifested as problems in the VFS when XFS happens
> > to change the type or properties of a recently unlinked inode while
> > still involved in an RCU lookup. For example, if the VFS refers to a
> > previous incarnation of a symlink inode, obtains the ->get_link()
> > callback from inode_operations, and the latter happens to change to
> > a non-symlink type via a recycle event, the ->get_link() callback
> > pointer is reset to NULL and the lookup results in a crash.
> > 
> > To avoid this class of problem, isolate in-core inodes for recycling
> > with an RCU grace period. This is the same level of protection the
> > VFS expects for inactivated inodes that are never reused, and so
> > guarantees no further concurrent access before the type or
> > properties of the inode change. We don't want an unconditional
> > synchronize_rcu() event here because that would result in a
> > significant performance impact to mixed inode allocation workloads.
> > 
> > Fortunately, we can take advantage of the recently added deferred
> > inactivation mechanism to mitigate the need for an RCU wait in most
> > cases. Deferred inactivation queues and batches the on-disk freeing
> > of recently destroyed inodes, and so significantly increases the
> > likelihood that a grace period has elapsed by the time an inode is
> > freed and observable by the allocation code as a reuse candidate.
> > Capture the current RCU grace period cookie at inode destroy time
> > and refer to it at allocation time to conditionally wait for an RCU
> > grace period if one hadn't expired in the meantime.  Since only
> > unlinked inodes are recycle candidates and unlinked inodes always
> > require inactivation, we only need to poll and assign RCU state in
> > the inactivation codepath. Slightly adjust struct xfs_inode to fit
> > the new field into padding holes that conveniently preexist in the
> > same cacheline as the deferred inactivation list.
> > 
> > Finally, note that the ideal long term solution here is to
> > rearchitect bits of XFS' internal inode lifecycle management such
> > that this additional stall point is not required, but this requires
> > more thought, time and work to address. This approach restores
> > functional correctness in the meantime.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > Hi all,
> > 
> > Here's the RCU fixup patch for inode reuse that I've been playing with,
> > re: the vfs patch discussion [1]. I've put it in pretty much the most
> > basic form, but I think there are a couple aspects worth thinking about:
> > 
> > 1. Use and frequency of start_poll_synchronize_rcu() (vs.
> > get_state_synchronize_rcu()). The former is a bit more active than the
> > latter in that it triggers the start of a grace period, when necessary.
> > This currently invokes per inode, which is the ideal frequency in
> > theory, but could be reduced, associated with the xfs_inogegc thresholds
> > in some manner, etc., if there is good reason to do that.
> > 
> > 2. The rcu cookie lifecycle. This variant updates it on inactivation
> > queue and nowhere else because the RCU docs imply that counter rollover
> > is not a significant problem. In practice, I think this means that if an
> > inode is stamped at least once, and the counter rolls over, future
> > (non-inactivation, non-unlinked) eviction -> repopulation cycles could
> > trigger rcu syncs. I think this would require repeated
> > eviction/reinstantiation cycles within a small window to be noticeable,
> > so I'm not sure how likely this is to occur. We could be more defensive
> > by resetting or refreshing the cookie. E.g., refresh (or reset to zero)
> > at recycle time, unconditionally refresh at destroy time (using
> > get_state_synchronize_rcu() for non-inactivation), etc.
> > 
> > Otherwise testing is ongoing, but this version at least survives an
> > fstests regression run.
> > 
> 
> FYI, I modified my repeated alloc/free test to do some batching and form
> it into something more able to measure the potential side effect / cost
> of the grace period sync. The test is a single threaded, file alloc/free
> loop using a variable per iteration batch size. The test runs for ~60s
> and reports how many total files were allocated/freed in that period
> with the specified batch size. Note that this particular test ran
> without any background workload. Results are as follows:
> 
> 	files		baseline	test
> 
> 	1		38480		38437
> 	4		126055		111080
> 	8		218299		134469
> 	16		306619		141968
> 	32		397909		152267
> 	64		418603		200875
> 	128		469077		289365
> 	256		684117		566016
> 	512		931328		878933
> 	1024		1126741		1118891

Can you post the test code, because 38,000 alloc/unlinks in 60s is
extremely slow for a single tight open-unlink-close loop. I'd be
expecting at least ~10,000 alloc/unlink iterations per second, not
650/second.

A quick test here with "batch size == 1" main loop on a vanilla
5.17-rc1 kernel:

        for (i = 0; i < iters; i++) {
                int fd = open(file, O_CREAT|O_RDWR, 0777);

                if (fd < 0) {
                        perror("open");
                        exit(1);
                }
                unlink(file);
                close(fd);
        }


$ time ./open-unlink 10000 /mnt/scratch/blah

real    0m0.962s
user    0m0.022s
sys     0m0.775s

Shows pretty much 10,000 alloc/unlinks a second without any specific
batching on my slow machine. And my "fast" machine (3yr old 2.1GHz
Xeons)

$ time sudo ./open-unlink 40000 /mnt/scratch/foo

real    0m0.958s
user    0m0.033s
sys     0m0.770s

Runs single loop iterations at 40,000 alloc/unlink iterations per
second.

So I'm either not understanding the test you are running and/or the
kernel/patches that you are comparing here. Is the "baseline" just a
vanilla, unmodified upstream kernel, or something else?

> That's just a test of a quick hack, however. Since there is no real
> urgency to inactivate an unlinked inode (it has no potential users until
> it's freed),

On the contrary, there is extreme urgency to inactivate inodes
quickly.

Darrick made the original assumption that we could delay
inactivation indefinitely and so he allowed really deep queues of up
to 64k deferred inactivations. But with queues this deep, we could
never get that background inactivation code to perform anywhere near
the original synchronous background inactivation code. e.g. I
measured 60-70% performance degradataions on my scalability tests,
and nothing stood out in the profiles until I started looking at
CPU data cache misses.

What we found was that if we don't run the background inactivation
while the inodes are still hot in the CPU cache, the cost of bring
the inodes back into the CPU cache at a later time is extremely
expensive and cannot be avoided. That's where all the performance
was lost and so this is exactly what the current per-cpu background
inactivation implementation avoids. i.e. we have shallow queues,
early throttling and CPU affinity to ensure that the inodes are
processed before they are evicted from the CPU caches and ensure we
don't take a performance hit.

IOWs, the deferred inactivation queues are designed to minimise
inactivation delay, generally trying to delay inactivation for a
couple of milliseconds at most during typical fast-path
inactivations (i.e. an extent or two per inode needing to be freed,
plus maybe the inode itself). Such inactivations generally take
50-100us of CPU time each to process, and we try to keep the
inactivation batch size down to 32 inodes...

> I suspect that result can be further optimized to absorb
> the cost of an rcu delay by deferring the steps that make the inode
> available for reallocation in the first place.

A typical RCU grace period delay is longer than the latency we
require to keep the inodes hot in cache for efficient background
inactivation. We can't move the "we need to RCU delay inactivation"
overhead to the background inactivation code without taking a
global performance hit to the filesystem performance due to the CPU
cache thrashing it will introduce....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-24 22:08   ` Dave Chinner
@ 2022-01-24 23:29     ` Brian Foster
  2022-01-25  0:31       ` Dave Chinner
  0 siblings, 1 reply; 36+ messages in thread
From: Brian Foster @ 2022-01-24 23:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Al Viro, Ian Kent, rcu

On Tue, Jan 25, 2022 at 09:08:53AM +1100, Dave Chinner wrote:
> On Mon, Jan 24, 2022 at 10:02:27AM -0500, Brian Foster wrote:
> > On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote:
> > > The XFS inode allocation algorithm aggressively reuses recently
> > > freed inodes. This is historical behavior that has been in place for
> > > quite some time, since XFS was imported to mainline Linux. Once the
> > > VFS adopted RCUwalk path lookups (also some time ago), this behavior
> > > became slightly incompatible because the inode recycle path doesn't
> > > isolate concurrent access to the inode from the VFS.
> > > 
> > > This has recently manifested as problems in the VFS when XFS happens
> > > to change the type or properties of a recently unlinked inode while
> > > still involved in an RCU lookup. For example, if the VFS refers to a
> > > previous incarnation of a symlink inode, obtains the ->get_link()
> > > callback from inode_operations, and the latter happens to change to
> > > a non-symlink type via a recycle event, the ->get_link() callback
> > > pointer is reset to NULL and the lookup results in a crash.
> > > 
> > > To avoid this class of problem, isolate in-core inodes for recycling
> > > with an RCU grace period. This is the same level of protection the
> > > VFS expects for inactivated inodes that are never reused, and so
> > > guarantees no further concurrent access before the type or
> > > properties of the inode change. We don't want an unconditional
> > > synchronize_rcu() event here because that would result in a
> > > significant performance impact to mixed inode allocation workloads.
> > > 
> > > Fortunately, we can take advantage of the recently added deferred
> > > inactivation mechanism to mitigate the need for an RCU wait in most
> > > cases. Deferred inactivation queues and batches the on-disk freeing
> > > of recently destroyed inodes, and so significantly increases the
> > > likelihood that a grace period has elapsed by the time an inode is
> > > freed and observable by the allocation code as a reuse candidate.
> > > Capture the current RCU grace period cookie at inode destroy time
> > > and refer to it at allocation time to conditionally wait for an RCU
> > > grace period if one hadn't expired in the meantime.  Since only
> > > unlinked inodes are recycle candidates and unlinked inodes always
> > > require inactivation, we only need to poll and assign RCU state in
> > > the inactivation codepath. Slightly adjust struct xfs_inode to fit
> > > the new field into padding holes that conveniently preexist in the
> > > same cacheline as the deferred inactivation list.
> > > 
> > > Finally, note that the ideal long term solution here is to
> > > rearchitect bits of XFS' internal inode lifecycle management such
> > > that this additional stall point is not required, but this requires
> > > more thought, time and work to address. This approach restores
> > > functional correctness in the meantime.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > > 
> > > Hi all,
> > > 
> > > Here's the RCU fixup patch for inode reuse that I've been playing with,
> > > re: the vfs patch discussion [1]. I've put it in pretty much the most
> > > basic form, but I think there are a couple aspects worth thinking about:
> > > 
> > > 1. Use and frequency of start_poll_synchronize_rcu() (vs.
> > > get_state_synchronize_rcu()). The former is a bit more active than the
> > > latter in that it triggers the start of a grace period, when necessary.
> > > This currently invokes per inode, which is the ideal frequency in
> > > theory, but could be reduced, associated with the xfs_inogegc thresholds
> > > in some manner, etc., if there is good reason to do that.
> > > 
> > > 2. The rcu cookie lifecycle. This variant updates it on inactivation
> > > queue and nowhere else because the RCU docs imply that counter rollover
> > > is not a significant problem. In practice, I think this means that if an
> > > inode is stamped at least once, and the counter rolls over, future
> > > (non-inactivation, non-unlinked) eviction -> repopulation cycles could
> > > trigger rcu syncs. I think this would require repeated
> > > eviction/reinstantiation cycles within a small window to be noticeable,
> > > so I'm not sure how likely this is to occur. We could be more defensive
> > > by resetting or refreshing the cookie. E.g., refresh (or reset to zero)
> > > at recycle time, unconditionally refresh at destroy time (using
> > > get_state_synchronize_rcu() for non-inactivation), etc.
> > > 
> > > Otherwise testing is ongoing, but this version at least survives an
> > > fstests regression run.
> > > 
> > 
> > FYI, I modified my repeated alloc/free test to do some batching and form
> > it into something more able to measure the potential side effect / cost
> > of the grace period sync. The test is a single threaded, file alloc/free
> > loop using a variable per iteration batch size. The test runs for ~60s
> > and reports how many total files were allocated/freed in that period
> > with the specified batch size. Note that this particular test ran
> > without any background workload. Results are as follows:
> > 
> > 	files		baseline	test
> > 
> > 	1		38480		38437
> > 	4		126055		111080
> > 	8		218299		134469
> > 	16		306619		141968
> > 	32		397909		152267
> > 	64		418603		200875
> > 	128		469077		289365
> > 	256		684117		566016
> > 	512		931328		878933
> > 	1024		1126741		1118891
> 
> Can you post the test code, because 38,000 alloc/unlinks in 60s is
> extremely slow for a single tight open-unlink-close loop. I'd be
> expecting at least ~10,000 alloc/unlink iterations per second, not
> 650/second.
> 

Hm, Ok. My test was just a bash script doing a 'touch <files>; rm
<files>' loop. I know there was application overhead because if I
tweaked the script to open an fd directly rather than use touch, the
single file performance jumped up a bit, but it seemed to wash away as I
increased the file count so I kept running it with larger sizes. This
seems off so I'll port it over to C code and see how much the numbers
change.

> A quick test here with "batch size == 1" main loop on a vanilla
> 5.17-rc1 kernel:
> 
>         for (i = 0; i < iters; i++) {
>                 int fd = open(file, O_CREAT|O_RDWR, 0777);
> 
>                 if (fd < 0) {
>                         perror("open");
>                         exit(1);
>                 }
>                 unlink(file);
>                 close(fd);
>         }
> 
> 
> $ time ./open-unlink 10000 /mnt/scratch/blah
> 
> real    0m0.962s
> user    0m0.022s
> sys     0m0.775s
> 
> Shows pretty much 10,000 alloc/unlinks a second without any specific
> batching on my slow machine. And my "fast" machine (3yr old 2.1GHz
> Xeons)
> 
> $ time sudo ./open-unlink 40000 /mnt/scratch/foo
> 
> real    0m0.958s
> user    0m0.033s
> sys     0m0.770s
> 
> Runs single loop iterations at 40,000 alloc/unlink iterations per
> second.
> 
> So I'm either not understanding the test you are running and/or the
> kernel/patches that you are comparing here. Is the "baseline" just a
> vanilla, unmodified upstream kernel, or something else?
> 

Yeah, the baseline was just the XFS for-next branch.

> > That's just a test of a quick hack, however. Since there is no real
> > urgency to inactivate an unlinked inode (it has no potential users until
> > it's freed),
> 
> On the contrary, there is extreme urgency to inactivate inodes
> quickly.
> 

Ok, I think we're talking about slightly different things. What I mean
above is that if a task removes a file and goes off doing unrelated
$work, that inode will just sit on the percpu queue indefinitely. That's
fine, as there's no functional need for us to process it immediately
unless we're around -ENOSPC thresholds or some such that demand reclaim
of the inode. It sounds like what you're talking about is specifically
the behavior/performance of sustained file removal (which is important
obviously), where apparently there is a notable degradation if the
queues become deep enough to push the inode batches out of CPU cache. So
that makes sense...

> Darrick made the original assumption that we could delay
> inactivation indefinitely and so he allowed really deep queues of up
> to 64k deferred inactivations. But with queues this deep, we could
> never get that background inactivation code to perform anywhere near
> the original synchronous background inactivation code. e.g. I
> measured 60-70% performance degradataions on my scalability tests,
> and nothing stood out in the profiles until I started looking at
> CPU data cache misses.
> 

... but could you elaborate on the scalability tests involved here so I
can get a better sense of it in practice and perhaps observe the impact
of changes in this path?

Brian

> What we found was that if we don't run the background inactivation
> while the inodes are still hot in the CPU cache, the cost of bring
> the inodes back into the CPU cache at a later time is extremely
> expensive and cannot be avoided. That's where all the performance
> was lost and so this is exactly what the current per-cpu background
> inactivation implementation avoids. i.e. we have shallow queues,
> early throttling and CPU affinity to ensure that the inodes are
> processed before they are evicted from the CPU caches and ensure we
> don't take a performance hit.
> 
> IOWs, the deferred inactivation queues are designed to minimise
> inactivation delay, generally trying to delay inactivation for a
> couple of milliseconds at most during typical fast-path
> inactivations (i.e. an extent or two per inode needing to be freed,
> plus maybe the inode itself). Such inactivations generally take
> 50-100us of CPU time each to process, and we try to keep the
> inactivation batch size down to 32 inodes...
> 
> > I suspect that result can be further optimized to absorb
> > the cost of an rcu delay by deferring the steps that make the inode
> > available for reallocation in the first place.
> 
> A typical RCU grace period delay is longer than the latency we
> require to keep the inodes hot in cache for efficient background
> inactivation. We can't move the "we need to RCU delay inactivation"
> overhead to the background inactivation code without taking a
> global performance hit to the filesystem performance due to the CPU
> cache thrashing it will introduce....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-24 23:29     ` Brian Foster
@ 2022-01-25  0:31       ` Dave Chinner
  2022-01-25 14:40         ` Paul E. McKenney
  2022-01-25 18:30         ` Brian Foster
  0 siblings, 2 replies; 36+ messages in thread
From: Dave Chinner @ 2022-01-25  0:31 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Al Viro, Ian Kent, rcu

On Mon, Jan 24, 2022 at 06:29:18PM -0500, Brian Foster wrote:
> On Tue, Jan 25, 2022 at 09:08:53AM +1100, Dave Chinner wrote:
> > > FYI, I modified my repeated alloc/free test to do some batching and form
> > > it into something more able to measure the potential side effect / cost
> > > of the grace period sync. The test is a single threaded, file alloc/free
> > > loop using a variable per iteration batch size. The test runs for ~60s
> > > and reports how many total files were allocated/freed in that period
> > > with the specified batch size. Note that this particular test ran
> > > without any background workload. Results are as follows:
> > > 
> > > 	files		baseline	test
> > > 
> > > 	1		38480		38437
> > > 	4		126055		111080
> > > 	8		218299		134469
> > > 	16		306619		141968
> > > 	32		397909		152267
> > > 	64		418603		200875
> > > 	128		469077		289365
> > > 	256		684117		566016
> > > 	512		931328		878933
> > > 	1024		1126741		1118891
> > 
> > Can you post the test code, because 38,000 alloc/unlinks in 60s is
> > extremely slow for a single tight open-unlink-close loop. I'd be
> > expecting at least ~10,000 alloc/unlink iterations per second, not
> > 650/second.
> > 
> 
> Hm, Ok. My test was just a bash script doing a 'touch <files>; rm
> <files>' loop. I know there was application overhead because if I
> tweaked the script to open an fd directly rather than use touch, the
> single file performance jumped up a bit, but it seemed to wash away as I
> increased the file count so I kept running it with larger sizes. This
> seems off so I'll port it over to C code and see how much the numbers
> change.

Yeah, using touch/rm becomes fork/exec bound very quickly. You'll
find that using "echo > <file>" is much faster than "touch <file>"
because it runs a shell built-in operation without fork/exec
overhead to create the file. But you can't play tricks like that to
replace rm:

$ time for ((i=0;i<1000;i++)); do touch /mnt/scratch/foo; rm /mnt/scratch/foo ; done

real    0m2.653s
user    0m0.910s
sys     0m2.051s
$ time for ((i=0;i<1000;i++)); do echo > /mnt/scratch/foo; rm /mnt/scratch/foo ; done

real    0m1.260s
user    0m0.452s
sys     0m0.913s
$ time ./open-unlink 1000 /mnt/scratch/foo

real    0m0.037s
user    0m0.001s
sys     0m0.030s
$

Note the difference in system time between the three operations -
almost all the difference in system CPU time is the overhead of
fork/exec to run the touch/rm binaries, not do the filesystem
operations....

> > > That's just a test of a quick hack, however. Since there is no real
> > > urgency to inactivate an unlinked inode (it has no potential users until
> > > it's freed),
> > 
> > On the contrary, there is extreme urgency to inactivate inodes
> > quickly.
> > 
> 
> Ok, I think we're talking about slightly different things. What I mean
> above is that if a task removes a file and goes off doing unrelated
> $work, that inode will just sit on the percpu queue indefinitely. That's
> fine, as there's no functional need for us to process it immediately
> unless we're around -ENOSPC thresholds or some such that demand reclaim
> of the inode.

Yup, an occasional unlink sitting around for a while on an unlinked
list isn't going to cause a performance problem.  Indeed, such
workloads are more likely to benefit from the reduced unlink()
syscall overhead and won't even notice the increase in background
CPU overhead for inactivation of those occasional inodes.

> It sounds like what you're talking about is specifically
> the behavior/performance of sustained file removal (which is important
> obviously), where apparently there is a notable degradation if the
> queues become deep enough to push the inode batches out of CPU cache. So
> that makes sense...

Yup, sustained bulk throughput is where cache residency really
matters. And for unlink, sustained unlink workloads are quite
common; they often are something people wait for on the command line
or make up a performance critical component of a highly concurrent
workload so it's pretty important to get this part right.

> > Darrick made the original assumption that we could delay
> > inactivation indefinitely and so he allowed really deep queues of up
> > to 64k deferred inactivations. But with queues this deep, we could
> > never get that background inactivation code to perform anywhere near
> > the original synchronous background inactivation code. e.g. I
> > measured 60-70% performance degradataions on my scalability tests,
> > and nothing stood out in the profiles until I started looking at
> > CPU data cache misses.
> > 
> 
> ... but could you elaborate on the scalability tests involved here so I
> can get a better sense of it in practice and perhaps observe the impact
> of changes in this path?

The same conconrrent fsmark create/traverse/unlink workloads I've
been running for the past decade+ demonstrates it pretty simply. I
also saw regressions with dbench (both op latency and throughput) as
the clinet count (concurrency) increased, and with compilebench.  I
didn't look much further because all the common benchmarks I ran
showed perf degradations with arbitrary delays that went away with
the current code we have.  ISTR that parts of aim7/reaim scalability
workloads that the intel zero-day infrastructure runs are quite
sensitive to background inactivation delays as well because that's a
CPU bound workload and hence any reduction in cache residency
results in a reduction of the number of concurrent jobs that can be
run.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-25  0:31       ` Dave Chinner
@ 2022-01-25 14:40         ` Paul E. McKenney
  2022-01-25 22:36           ` Dave Chinner
  2022-01-25 18:30         ` Brian Foster
  1 sibling, 1 reply; 36+ messages in thread
From: Paul E. McKenney @ 2022-01-25 14:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs, Al Viro, Ian Kent, rcu

On Tue, Jan 25, 2022 at 11:31:20AM +1100, Dave Chinner wrote:
> On Mon, Jan 24, 2022 at 06:29:18PM -0500, Brian Foster wrote:
> > On Tue, Jan 25, 2022 at 09:08:53AM +1100, Dave Chinner wrote:
> > > > FYI, I modified my repeated alloc/free test to do some batching and form
> > > > it into something more able to measure the potential side effect / cost
> > > > of the grace period sync. The test is a single threaded, file alloc/free
> > > > loop using a variable per iteration batch size. The test runs for ~60s
> > > > and reports how many total files were allocated/freed in that period
> > > > with the specified batch size. Note that this particular test ran
> > > > without any background workload. Results are as follows:
> > > > 
> > > > 	files		baseline	test
> > > > 
> > > > 	1		38480		38437
> > > > 	4		126055		111080
> > > > 	8		218299		134469
> > > > 	16		306619		141968
> > > > 	32		397909		152267
> > > > 	64		418603		200875
> > > > 	128		469077		289365
> > > > 	256		684117		566016
> > > > 	512		931328		878933
> > > > 	1024		1126741		1118891
> > > 
> > > Can you post the test code, because 38,000 alloc/unlinks in 60s is
> > > extremely slow for a single tight open-unlink-close loop. I'd be
> > > expecting at least ~10,000 alloc/unlink iterations per second, not
> > > 650/second.
> > > 
> > 
> > Hm, Ok. My test was just a bash script doing a 'touch <files>; rm
> > <files>' loop. I know there was application overhead because if I
> > tweaked the script to open an fd directly rather than use touch, the
> > single file performance jumped up a bit, but it seemed to wash away as I
> > increased the file count so I kept running it with larger sizes. This
> > seems off so I'll port it over to C code and see how much the numbers
> > change.
> 
> Yeah, using touch/rm becomes fork/exec bound very quickly. You'll
> find that using "echo > <file>" is much faster than "touch <file>"
> because it runs a shell built-in operation without fork/exec
> overhead to create the file. But you can't play tricks like that to
> replace rm:
> 
> $ time for ((i=0;i<1000;i++)); do touch /mnt/scratch/foo; rm /mnt/scratch/foo ; done
> 
> real    0m2.653s
> user    0m0.910s
> sys     0m2.051s
> $ time for ((i=0;i<1000;i++)); do echo > /mnt/scratch/foo; rm /mnt/scratch/foo ; done
> 
> real    0m1.260s
> user    0m0.452s
> sys     0m0.913s
> $ time ./open-unlink 1000 /mnt/scratch/foo
> 
> real    0m0.037s
> user    0m0.001s
> sys     0m0.030s
> $
> 
> Note the difference in system time between the three operations -
> almost all the difference in system CPU time is the overhead of
> fork/exec to run the touch/rm binaries, not do the filesystem
> operations....
> 
> > > > That's just a test of a quick hack, however. Since there is no real
> > > > urgency to inactivate an unlinked inode (it has no potential users until
> > > > it's freed),
> > > 
> > > On the contrary, there is extreme urgency to inactivate inodes
> > > quickly.
> > > 
> > 
> > Ok, I think we're talking about slightly different things. What I mean
> > above is that if a task removes a file and goes off doing unrelated
> > $work, that inode will just sit on the percpu queue indefinitely. That's
> > fine, as there's no functional need for us to process it immediately
> > unless we're around -ENOSPC thresholds or some such that demand reclaim
> > of the inode.
> 
> Yup, an occasional unlink sitting around for a while on an unlinked
> list isn't going to cause a performance problem.  Indeed, such
> workloads are more likely to benefit from the reduced unlink()
> syscall overhead and won't even notice the increase in background
> CPU overhead for inactivation of those occasional inodes.
> 
> > It sounds like what you're talking about is specifically
> > the behavior/performance of sustained file removal (which is important
> > obviously), where apparently there is a notable degradation if the
> > queues become deep enough to push the inode batches out of CPU cache. So
> > that makes sense...
> 
> Yup, sustained bulk throughput is where cache residency really
> matters. And for unlink, sustained unlink workloads are quite
> common; they often are something people wait for on the command line
> or make up a performance critical component of a highly concurrent
> workload so it's pretty important to get this part right.
> 
> > > Darrick made the original assumption that we could delay
> > > inactivation indefinitely and so he allowed really deep queues of up
> > > to 64k deferred inactivations. But with queues this deep, we could
> > > never get that background inactivation code to perform anywhere near
> > > the original synchronous background inactivation code. e.g. I
> > > measured 60-70% performance degradataions on my scalability tests,
> > > and nothing stood out in the profiles until I started looking at
> > > CPU data cache misses.
> > > 
> > 
> > ... but could you elaborate on the scalability tests involved here so I
> > can get a better sense of it in practice and perhaps observe the impact
> > of changes in this path?
> 
> The same conconrrent fsmark create/traverse/unlink workloads I've
> been running for the past decade+ demonstrates it pretty simply. I
> also saw regressions with dbench (both op latency and throughput) as
> the clinet count (concurrency) increased, and with compilebench.  I
> didn't look much further because all the common benchmarks I ran
> showed perf degradations with arbitrary delays that went away with
> the current code we have.  ISTR that parts of aim7/reaim scalability
> workloads that the intel zero-day infrastructure runs are quite
> sensitive to background inactivation delays as well because that's a
> CPU bound workload and hence any reduction in cache residency
> results in a reduction of the number of concurrent jobs that can be
> run.

Curiosity and all that, but has this work produced any intuition on
the sensitivity of the performance/scalability to the delays?  As in
the effect of microseconds vs. tens of microsecond vs. hundreds of
microseconds?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-25 14:40         ` Paul E. McKenney
@ 2022-01-25 22:36           ` Dave Chinner
  2022-01-26  5:29             ` Paul E. McKenney
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2022-01-25 22:36 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Brian Foster, linux-xfs, Al Viro, Ian Kent, rcu

On Tue, Jan 25, 2022 at 06:40:44AM -0800, Paul E. McKenney wrote:
> On Tue, Jan 25, 2022 at 11:31:20AM +1100, Dave Chinner wrote:
> > > Ok, I think we're talking about slightly different things. What I mean
> > > above is that if a task removes a file and goes off doing unrelated
> > > $work, that inode will just sit on the percpu queue indefinitely. That's
> > > fine, as there's no functional need for us to process it immediately
> > > unless we're around -ENOSPC thresholds or some such that demand reclaim
> > > of the inode.
> > 
> > Yup, an occasional unlink sitting around for a while on an unlinked
> > list isn't going to cause a performance problem.  Indeed, such
> > workloads are more likely to benefit from the reduced unlink()
> > syscall overhead and won't even notice the increase in background
> > CPU overhead for inactivation of those occasional inodes.
> > 
> > > It sounds like what you're talking about is specifically
> > > the behavior/performance of sustained file removal (which is important
> > > obviously), where apparently there is a notable degradation if the
> > > queues become deep enough to push the inode batches out of CPU cache. So
> > > that makes sense...
> > 
> > Yup, sustained bulk throughput is where cache residency really
> > matters. And for unlink, sustained unlink workloads are quite
> > common; they often are something people wait for on the command line
> > or make up a performance critical component of a highly concurrent
> > workload so it's pretty important to get this part right.
> > 
> > > > Darrick made the original assumption that we could delay
> > > > inactivation indefinitely and so he allowed really deep queues of up
> > > > to 64k deferred inactivations. But with queues this deep, we could
> > > > never get that background inactivation code to perform anywhere near
> > > > the original synchronous background inactivation code. e.g. I
> > > > measured 60-70% performance degradataions on my scalability tests,
> > > > and nothing stood out in the profiles until I started looking at
> > > > CPU data cache misses.
> > > > 
> > > 
> > > ... but could you elaborate on the scalability tests involved here so I
> > > can get a better sense of it in practice and perhaps observe the impact
> > > of changes in this path?
> > 
> > The same conconrrent fsmark create/traverse/unlink workloads I've
> > been running for the past decade+ demonstrates it pretty simply. I
> > also saw regressions with dbench (both op latency and throughput) as
> > the clinet count (concurrency) increased, and with compilebench.  I
> > didn't look much further because all the common benchmarks I ran
> > showed perf degradations with arbitrary delays that went away with
> > the current code we have.  ISTR that parts of aim7/reaim scalability
> > workloads that the intel zero-day infrastructure runs are quite
> > sensitive to background inactivation delays as well because that's a
> > CPU bound workload and hence any reduction in cache residency
> > results in a reduction of the number of concurrent jobs that can be
> > run.
> 
> Curiosity and all that, but has this work produced any intuition on
> the sensitivity of the performance/scalability to the delays?  As in
> the effect of microseconds vs. tens of microsecond vs. hundreds of
> microseconds?

Some, yes.

The upper delay threshold where performance is measurably
impacted is in the order of single digit milliseconds, not
microseconds.

What I saw was that as the batch processing delay goes beyond ~5ms,
IPC starts to fall. The CPU usage profile does not change shape, nor
does the proportions of where CPU time is spent change. All I see if
data cache misses go up substantially and IPC drop substantially. If
I read my notes correctly, typical change from "fast" to "slow" in
IPC was 0.82 to 0.39 and LLC-load-misses from 3% to 12%. The IPC
degradation was all done by the time the background batch processing
times were longer than a typical scheduler tick (10ms).

Now, I've been testing on Xeon CPUs with 36-76MB of l2-l3 caches, so
there's a fair amount of data that these can hold. I expect that
with smaller caches, the inflection point will be at smaller batch
sizes rather than more. Hence while I could have used larger batches
for background processing (e.g. 64-128 inodes rather than 32), I
chose smaller batch sizes by default so that CPUs with smaller
caches are less likely to be adversely affected by the batch size
being too large. OTOH, I started to measure noticable degradation by
batch sizes of 256 inodes on my machines, which is why the hard
queue limit got set to 256 inodes.

Scaling the delay/batch size down towards single inode queuing also
resulted in perf degradation. This was largely because of all the
extra scheduling overhead that trying to switching between user task
and kernel worker task for every inode entailed. Context switch rate
went from a couple of thousand/sec to over 100,000/s for single
inode batches, and performance went backwards in proportion with the
amount of CPU then spent on context switches. It also lead to
increases in buffer lock contention (hence context switches) as both
user task and kworker try to access the same buffers...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-25 22:36           ` Dave Chinner
@ 2022-01-26  5:29             ` Paul E. McKenney
  2022-01-26 13:21               ` Brian Foster
  0 siblings, 1 reply; 36+ messages in thread
From: Paul E. McKenney @ 2022-01-26  5:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs, Al Viro, Ian Kent, rcu

On Wed, Jan 26, 2022 at 09:36:07AM +1100, Dave Chinner wrote:
> On Tue, Jan 25, 2022 at 06:40:44AM -0800, Paul E. McKenney wrote:
> > On Tue, Jan 25, 2022 at 11:31:20AM +1100, Dave Chinner wrote:
> > > > Ok, I think we're talking about slightly different things. What I mean
> > > > above is that if a task removes a file and goes off doing unrelated
> > > > $work, that inode will just sit on the percpu queue indefinitely. That's
> > > > fine, as there's no functional need for us to process it immediately
> > > > unless we're around -ENOSPC thresholds or some such that demand reclaim
> > > > of the inode.
> > > 
> > > Yup, an occasional unlink sitting around for a while on an unlinked
> > > list isn't going to cause a performance problem.  Indeed, such
> > > workloads are more likely to benefit from the reduced unlink()
> > > syscall overhead and won't even notice the increase in background
> > > CPU overhead for inactivation of those occasional inodes.
> > > 
> > > > It sounds like what you're talking about is specifically
> > > > the behavior/performance of sustained file removal (which is important
> > > > obviously), where apparently there is a notable degradation if the
> > > > queues become deep enough to push the inode batches out of CPU cache. So
> > > > that makes sense...
> > > 
> > > Yup, sustained bulk throughput is where cache residency really
> > > matters. And for unlink, sustained unlink workloads are quite
> > > common; they often are something people wait for on the command line
> > > or make up a performance critical component of a highly concurrent
> > > workload so it's pretty important to get this part right.
> > > 
> > > > > Darrick made the original assumption that we could delay
> > > > > inactivation indefinitely and so he allowed really deep queues of up
> > > > > to 64k deferred inactivations. But with queues this deep, we could
> > > > > never get that background inactivation code to perform anywhere near
> > > > > the original synchronous background inactivation code. e.g. I
> > > > > measured 60-70% performance degradataions on my scalability tests,
> > > > > and nothing stood out in the profiles until I started looking at
> > > > > CPU data cache misses.
> > > > > 
> > > > 
> > > > ... but could you elaborate on the scalability tests involved here so I
> > > > can get a better sense of it in practice and perhaps observe the impact
> > > > of changes in this path?
> > > 
> > > The same conconrrent fsmark create/traverse/unlink workloads I've
> > > been running for the past decade+ demonstrates it pretty simply. I
> > > also saw regressions with dbench (both op latency and throughput) as
> > > the clinet count (concurrency) increased, and with compilebench.  I
> > > didn't look much further because all the common benchmarks I ran
> > > showed perf degradations with arbitrary delays that went away with
> > > the current code we have.  ISTR that parts of aim7/reaim scalability
> > > workloads that the intel zero-day infrastructure runs are quite
> > > sensitive to background inactivation delays as well because that's a
> > > CPU bound workload and hence any reduction in cache residency
> > > results in a reduction of the number of concurrent jobs that can be
> > > run.
> > 
> > Curiosity and all that, but has this work produced any intuition on
> > the sensitivity of the performance/scalability to the delays?  As in
> > the effect of microseconds vs. tens of microsecond vs. hundreds of
> > microseconds?
> 
> Some, yes.
> 
> The upper delay threshold where performance is measurably
> impacted is in the order of single digit milliseconds, not
> microseconds.
> 
> What I saw was that as the batch processing delay goes beyond ~5ms,
> IPC starts to fall. The CPU usage profile does not change shape, nor
> does the proportions of where CPU time is spent change. All I see if
> data cache misses go up substantially and IPC drop substantially. If
> I read my notes correctly, typical change from "fast" to "slow" in
> IPC was 0.82 to 0.39 and LLC-load-misses from 3% to 12%. The IPC
> degradation was all done by the time the background batch processing
> times were longer than a typical scheduler tick (10ms).
> 
> Now, I've been testing on Xeon CPUs with 36-76MB of l2-l3 caches, so
> there's a fair amount of data that these can hold. I expect that
> with smaller caches, the inflection point will be at smaller batch
> sizes rather than more. Hence while I could have used larger batches
> for background processing (e.g. 64-128 inodes rather than 32), I
> chose smaller batch sizes by default so that CPUs with smaller
> caches are less likely to be adversely affected by the batch size
> being too large. OTOH, I started to measure noticable degradation by
> batch sizes of 256 inodes on my machines, which is why the hard
> queue limit got set to 256 inodes.
> 
> Scaling the delay/batch size down towards single inode queuing also
> resulted in perf degradation. This was largely because of all the
> extra scheduling overhead that trying to switching between user task
> and kernel worker task for every inode entailed. Context switch rate
> went from a couple of thousand/sec to over 100,000/s for single
> inode batches, and performance went backwards in proportion with the
> amount of CPU then spent on context switches. It also lead to
> increases in buffer lock contention (hence context switches) as both
> user task and kworker try to access the same buffers...

Makes sense.  Never a guarantee of easy answers.  ;-)

If it would help, I could create expedited-grace-period counterparts
of get_state_synchronize_rcu(), start_poll_synchronize_rcu(),
poll_state_synchronize_rcu(), and cond_synchronize_rcu().  These would
provide sub-millisecond grace periods, in fact, sub-100-microsecond
grace periods on smaller systems.

Of course, nothing comes for free.  Although expedited grace periods
are way way cheaper than they used to be, they still IPI non-idle
non-nohz_full-userspace CPUs, which translates to roughly the CPU overhead
of a wakeup on each IPIed CPU.  And of course disruption to aggressive
non-nohz_full real-time applications.  Shorter latencies also translates
to fewer updates over which to amortize grace-period overhead.

But it should get well under your single-digit milliseconds of delay.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-26  5:29             ` Paul E. McKenney
@ 2022-01-26 13:21               ` Brian Foster
  0 siblings, 0 replies; 36+ messages in thread
From: Brian Foster @ 2022-01-26 13:21 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Dave Chinner, linux-xfs, Al Viro, Ian Kent, rcu

On Tue, Jan 25, 2022 at 09:29:10PM -0800, Paul E. McKenney wrote:
> On Wed, Jan 26, 2022 at 09:36:07AM +1100, Dave Chinner wrote:
> > On Tue, Jan 25, 2022 at 06:40:44AM -0800, Paul E. McKenney wrote:
> > > On Tue, Jan 25, 2022 at 11:31:20AM +1100, Dave Chinner wrote:
> > > > > Ok, I think we're talking about slightly different things. What I mean
> > > > > above is that if a task removes a file and goes off doing unrelated
> > > > > $work, that inode will just sit on the percpu queue indefinitely. That's
> > > > > fine, as there's no functional need for us to process it immediately
> > > > > unless we're around -ENOSPC thresholds or some such that demand reclaim
> > > > > of the inode.
> > > > 
> > > > Yup, an occasional unlink sitting around for a while on an unlinked
> > > > list isn't going to cause a performance problem.  Indeed, such
> > > > workloads are more likely to benefit from the reduced unlink()
> > > > syscall overhead and won't even notice the increase in background
> > > > CPU overhead for inactivation of those occasional inodes.
> > > > 
> > > > > It sounds like what you're talking about is specifically
> > > > > the behavior/performance of sustained file removal (which is important
> > > > > obviously), where apparently there is a notable degradation if the
> > > > > queues become deep enough to push the inode batches out of CPU cache. So
> > > > > that makes sense...
> > > > 
> > > > Yup, sustained bulk throughput is where cache residency really
> > > > matters. And for unlink, sustained unlink workloads are quite
> > > > common; they often are something people wait for on the command line
> > > > or make up a performance critical component of a highly concurrent
> > > > workload so it's pretty important to get this part right.
> > > > 
> > > > > > Darrick made the original assumption that we could delay
> > > > > > inactivation indefinitely and so he allowed really deep queues of up
> > > > > > to 64k deferred inactivations. But with queues this deep, we could
> > > > > > never get that background inactivation code to perform anywhere near
> > > > > > the original synchronous background inactivation code. e.g. I
> > > > > > measured 60-70% performance degradataions on my scalability tests,
> > > > > > and nothing stood out in the profiles until I started looking at
> > > > > > CPU data cache misses.
> > > > > > 
> > > > > 
> > > > > ... but could you elaborate on the scalability tests involved here so I
> > > > > can get a better sense of it in practice and perhaps observe the impact
> > > > > of changes in this path?
> > > > 
> > > > The same conconrrent fsmark create/traverse/unlink workloads I've
> > > > been running for the past decade+ demonstrates it pretty simply. I
> > > > also saw regressions with dbench (both op latency and throughput) as
> > > > the clinet count (concurrency) increased, and with compilebench.  I
> > > > didn't look much further because all the common benchmarks I ran
> > > > showed perf degradations with arbitrary delays that went away with
> > > > the current code we have.  ISTR that parts of aim7/reaim scalability
> > > > workloads that the intel zero-day infrastructure runs are quite
> > > > sensitive to background inactivation delays as well because that's a
> > > > CPU bound workload and hence any reduction in cache residency
> > > > results in a reduction of the number of concurrent jobs that can be
> > > > run.
> > > 
> > > Curiosity and all that, but has this work produced any intuition on
> > > the sensitivity of the performance/scalability to the delays?  As in
> > > the effect of microseconds vs. tens of microsecond vs. hundreds of
> > > microseconds?
> > 
> > Some, yes.
> > 
> > The upper delay threshold where performance is measurably
> > impacted is in the order of single digit milliseconds, not
> > microseconds.
> > 
> > What I saw was that as the batch processing delay goes beyond ~5ms,
> > IPC starts to fall. The CPU usage profile does not change shape, nor
> > does the proportions of where CPU time is spent change. All I see if
> > data cache misses go up substantially and IPC drop substantially. If
> > I read my notes correctly, typical change from "fast" to "slow" in
> > IPC was 0.82 to 0.39 and LLC-load-misses from 3% to 12%. The IPC
> > degradation was all done by the time the background batch processing
> > times were longer than a typical scheduler tick (10ms).
> > 
> > Now, I've been testing on Xeon CPUs with 36-76MB of l2-l3 caches, so
> > there's a fair amount of data that these can hold. I expect that
> > with smaller caches, the inflection point will be at smaller batch
> > sizes rather than more. Hence while I could have used larger batches
> > for background processing (e.g. 64-128 inodes rather than 32), I
> > chose smaller batch sizes by default so that CPUs with smaller
> > caches are less likely to be adversely affected by the batch size
> > being too large. OTOH, I started to measure noticable degradation by
> > batch sizes of 256 inodes on my machines, which is why the hard
> > queue limit got set to 256 inodes.
> > 
> > Scaling the delay/batch size down towards single inode queuing also
> > resulted in perf degradation. This was largely because of all the
> > extra scheduling overhead that trying to switching between user task
> > and kernel worker task for every inode entailed. Context switch rate
> > went from a couple of thousand/sec to over 100,000/s for single
> > inode batches, and performance went backwards in proportion with the
> > amount of CPU then spent on context switches. It also lead to
> > increases in buffer lock contention (hence context switches) as both
> > user task and kworker try to access the same buffers...
> 
> Makes sense.  Never a guarantee of easy answers.  ;-)
> 
> If it would help, I could create expedited-grace-period counterparts
> of get_state_synchronize_rcu(), start_poll_synchronize_rcu(),
> poll_state_synchronize_rcu(), and cond_synchronize_rcu().  These would
> provide sub-millisecond grace periods, in fact, sub-100-microsecond
> grace periods on smaller systems.
> 

If you have something with enough basic functionality, I'd be interested
in converting this patch over to an expedited variant to run some
tests/experiments. As it is, it seems the current approach is kind of
playing wack-a-mole between disrupting allocation performance by
populating the free inode pool with too many free but "pending rcu grace
period" inodes and sustained remove performance by pushing the internal
inactivation queues too deep and thus losing CPU cache, as Dave
describes above. So if an expedited grace period is possible that fits
within the time window on paper, it certainly seems worthwhile to test.

Otherwise the only thing that comes to mind right now is to start
playing around with the physical inode allocation algorithm to avoid
such pending inodes. I think a scanning approach may ultimately run into
the same problems with the right workload (i.e. such that all free
inodes are pending), so I suspect what this really means is either
figuring a nice enough way to efficiently locate expired inodes (maybe
via our own internal rcu callback to explicitly tag now expired inodes
as good allocation candidates), or to determine when to proceed with
inode chunk allocations when scanning is unlikely to succeed, or
something similar along those general lines..

> Of course, nothing comes for free.  Although expedited grace periods
> are way way cheaper than they used to be, they still IPI non-idle
> non-nohz_full-userspace CPUs, which translates to roughly the CPU overhead
> of a wakeup on each IPIed CPU.  And of course disruption to aggressive
> non-nohz_full real-time applications.  Shorter latencies also translates
> to fewer updates over which to amortize grace-period overhead.
> 
> But it should get well under your single-digit milliseconds of delay.
> 

If the expedited variant were sufficient for the fast path case, I
suppose it might be interesting to see if we could throttle down to
non-expedited variants either based on heuristic or feedback from
allocation side stalls.

Brian

> 							Thanx, Paul
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-25  0:31       ` Dave Chinner
  2022-01-25 14:40         ` Paul E. McKenney
@ 2022-01-25 18:30         ` Brian Foster
  2022-01-25 20:07           ` Brian Foster
  2022-01-25 22:45           ` Dave Chinner
  1 sibling, 2 replies; 36+ messages in thread
From: Brian Foster @ 2022-01-25 18:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Al Viro, Ian Kent, rcu

On Tue, Jan 25, 2022 at 11:31:20AM +1100, Dave Chinner wrote:
> On Mon, Jan 24, 2022 at 06:29:18PM -0500, Brian Foster wrote:
> > On Tue, Jan 25, 2022 at 09:08:53AM +1100, Dave Chinner wrote:
> > > > FYI, I modified my repeated alloc/free test to do some batching and form
> > > > it into something more able to measure the potential side effect / cost
> > > > of the grace period sync. The test is a single threaded, file alloc/free
> > > > loop using a variable per iteration batch size. The test runs for ~60s
> > > > and reports how many total files were allocated/freed in that period
> > > > with the specified batch size. Note that this particular test ran
> > > > without any background workload. Results are as follows:
> > > > 
> > > > 	files		baseline	test
> > > > 
> > > > 	1		38480		38437
> > > > 	4		126055		111080
> > > > 	8		218299		134469
> > > > 	16		306619		141968
> > > > 	32		397909		152267
> > > > 	64		418603		200875
> > > > 	128		469077		289365
> > > > 	256		684117		566016
> > > > 	512		931328		878933
> > > > 	1024		1126741		1118891
> > > 
> > > Can you post the test code, because 38,000 alloc/unlinks in 60s is
> > > extremely slow for a single tight open-unlink-close loop. I'd be
> > > expecting at least ~10,000 alloc/unlink iterations per second, not
> > > 650/second.
> > > 
> > 
> > Hm, Ok. My test was just a bash script doing a 'touch <files>; rm
> > <files>' loop. I know there was application overhead because if I
> > tweaked the script to open an fd directly rather than use touch, the
> > single file performance jumped up a bit, but it seemed to wash away as I
> > increased the file count so I kept running it with larger sizes. This
> > seems off so I'll port it over to C code and see how much the numbers
> > change.
> 
> Yeah, using touch/rm becomes fork/exec bound very quickly. You'll
> find that using "echo > <file>" is much faster than "touch <file>"
> because it runs a shell built-in operation without fork/exec
> overhead to create the file. But you can't play tricks like that to
> replace rm:
> 

I had used 'exec' to open an fd (same idea) in the single file case and
tested with that, saw that the increase was consistent and took that
along with the increasing performance as batch sizes increased to mean
that the application overhead wasn't a factor as the test scaled. That
was clearly wrong, because if I port the whole thing to a C program the
baseline numbers are way off. I think what also threw me off is that the
single file test kernel case is actually fairly accurate between the two
tests. Anyways, here's a series of (single run, no averaging, etc.) test
runs with the updated test. Note that I reduced the runtime to 10s here
since the test was running so much faster. Otherwise this is the same
batched open/close -> unlink behavior:

                baseline        test
batch:  1       files:  893579  files:  41841
batch:  2       files:  912502  files:  41922
batch:  4       files:  930424  files:  42084
batch:  8       files:  932072  files:  41536
batch:  16      files:  930624  files:  41616
batch:  32      files:  777088  files:  41120
batch:  64      files:  567936  files:  57216
batch:  128     files:  579840  files:  96256
batch:  256     files:  548608  files:  174080
batch:  512     files:  546816  files:  246784
batch:  1024    files:  509952  files:  328704
batch:  2048    files:  505856  files:  399360
batch:  4096    files:  479232  files:  438272

So this shows that the performance delta is actually massive from the
start. For reference, a single threaded, empty file, non syncing,
fs_mark workload stabilizes at around ~55k files/sec on this fs. Both
kernels sort of converge to that rate as the batch size increases, only
the baseline kernel starts much faster and normalizes while the test
kernel starts much slower and improves (and still really doesn't hit the
mark even at a 4k batch size).

My takeaway from this is that we may need to find a way to mitigate this
overhead somewhat better than what the current patch does. Otherwise,
this is a significant dropoff from even a pure allocation workload in
simple mixed workload scenarios...

> $ time for ((i=0;i<1000;i++)); do touch /mnt/scratch/foo; rm /mnt/scratch/foo ; done
> 
> real    0m2.653s
> user    0m0.910s
> sys     0m2.051s
> $ time for ((i=0;i<1000;i++)); do echo > /mnt/scratch/foo; rm /mnt/scratch/foo ; done
> 
> real    0m1.260s
> user    0m0.452s
> sys     0m0.913s
> $ time ./open-unlink 1000 /mnt/scratch/foo
> 
> real    0m0.037s
> user    0m0.001s
> sys     0m0.030s
> $
> 
> Note the difference in system time between the three operations -
> almost all the difference in system CPU time is the overhead of
> fork/exec to run the touch/rm binaries, not do the filesystem
> operations....
> 
> > > > That's just a test of a quick hack, however. Since there is no real
> > > > urgency to inactivate an unlinked inode (it has no potential users until
> > > > it's freed),
> > > 
> > > On the contrary, there is extreme urgency to inactivate inodes
> > > quickly.
> > > 
> > 
> > Ok, I think we're talking about slightly different things. What I mean
> > above is that if a task removes a file and goes off doing unrelated
> > $work, that inode will just sit on the percpu queue indefinitely. That's
> > fine, as there's no functional need for us to process it immediately
> > unless we're around -ENOSPC thresholds or some such that demand reclaim
> > of the inode.
> 
> Yup, an occasional unlink sitting around for a while on an unlinked
> list isn't going to cause a performance problem.  Indeed, such
> workloads are more likely to benefit from the reduced unlink()
> syscall overhead and won't even notice the increase in background
> CPU overhead for inactivation of those occasional inodes.
> 
> > It sounds like what you're talking about is specifically
> > the behavior/performance of sustained file removal (which is important
> > obviously), where apparently there is a notable degradation if the
> > queues become deep enough to push the inode batches out of CPU cache. So
> > that makes sense...
> 
> Yup, sustained bulk throughput is where cache residency really
> matters. And for unlink, sustained unlink workloads are quite
> common; they often are something people wait for on the command line
> or make up a performance critical component of a highly concurrent
> workload so it's pretty important to get this part right.
> 
> > > Darrick made the original assumption that we could delay
> > > inactivation indefinitely and so he allowed really deep queues of up
> > > to 64k deferred inactivations. But with queues this deep, we could
> > > never get that background inactivation code to perform anywhere near
> > > the original synchronous background inactivation code. e.g. I
> > > measured 60-70% performance degradataions on my scalability tests,
> > > and nothing stood out in the profiles until I started looking at
> > > CPU data cache misses.
> > > 
> > 
> > ... but could you elaborate on the scalability tests involved here so I
> > can get a better sense of it in practice and perhaps observe the impact
> > of changes in this path?
> 
> The same conconrrent fsmark create/traverse/unlink workloads I've
> been running for the past decade+ demonstrates it pretty simply. I
> also saw regressions with dbench (both op latency and throughput) as
> the clinet count (concurrency) increased, and with compilebench.  I
> didn't look much further because all the common benchmarks I ran
> showed perf degradations with arbitrary delays that went away with
> the current code we have.  ISTR that parts of aim7/reaim scalability
> workloads that the intel zero-day infrastructure runs are quite
> sensitive to background inactivation delays as well because that's a
> CPU bound workload and hence any reduction in cache residency
> results in a reduction of the number of concurrent jobs that can be
> run.
> 

Ok, so if I (single threaded) create (via fs_mark), sync and remove 5m
empty files, the remove takes about a minute. If I just bump out the
current queue and block thresholds by 10x and repeat, that time
increases to about ~1m24s. If I hack up a kernel to disable queueing
entirely (i.e. fully synchronous inactivation), then I'm back to about a
minute again. So I'm not producing any performance benefit with
queueing/batching in this single threaded scenario, but I suspect the
10x threshold delta is at least measuring the negative effect of poor
caching..? (Any decent way to confirm that..?).

And of course if I take the baseline kernel and stick a
cond_synchronize_rcu() in xfs_inactive_ifree() it brings the batch test
numbers right back but slows the removal test way down. What I find
interesting however is that if I hack up something more mild like invoke
cond_synchronize_rcu() on the oldest inode in the current inactivation
batch, bump out the blocking threshold as above (but leave the queueing
threshold at 32), and leave the iget side cond_sync_rcu() to catch
whatever falls through, my 5m file remove test now completes ~5-10s
faster than baseline and I see the following results from the batched
alloc/free test:

batch:  1       files:  731923
batch:  2       files:  693020
batch:  4       files:  750948
batch:  8       files:  743296
batch:  16      files:  738720
batch:  32      files:  746240
batch:  64      files:  598464
batch:  128     files:  672896
batch:  256     files:  633856
batch:  512     files:  605184
batch:  1024    files:  569344
batch:  2048    files:  555008
batch:  4096    files:  524288

Hm?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-25 18:30         ` Brian Foster
@ 2022-01-25 20:07           ` Brian Foster
  2022-01-25 22:45           ` Dave Chinner
  1 sibling, 0 replies; 36+ messages in thread
From: Brian Foster @ 2022-01-25 20:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Al Viro, Ian Kent, rcu

On Tue, Jan 25, 2022 at 01:30:36PM -0500, Brian Foster wrote:
> On Tue, Jan 25, 2022 at 11:31:20AM +1100, Dave Chinner wrote:
> > On Mon, Jan 24, 2022 at 06:29:18PM -0500, Brian Foster wrote:
> > > On Tue, Jan 25, 2022 at 09:08:53AM +1100, Dave Chinner wrote:
> > > > > FYI, I modified my repeated alloc/free test to do some batching and form
> > > > > it into something more able to measure the potential side effect / cost
> > > > > of the grace period sync. The test is a single threaded, file alloc/free
> > > > > loop using a variable per iteration batch size. The test runs for ~60s
> > > > > and reports how many total files were allocated/freed in that period
> > > > > with the specified batch size. Note that this particular test ran
> > > > > without any background workload. Results are as follows:
> > > > > 
> > > > > 	files		baseline	test
> > > > > 
> > > > > 	1		38480		38437
> > > > > 	4		126055		111080
> > > > > 	8		218299		134469
> > > > > 	16		306619		141968
> > > > > 	32		397909		152267
> > > > > 	64		418603		200875
> > > > > 	128		469077		289365
> > > > > 	256		684117		566016
> > > > > 	512		931328		878933
> > > > > 	1024		1126741		1118891
> > > > 
> > > > Can you post the test code, because 38,000 alloc/unlinks in 60s is
> > > > extremely slow for a single tight open-unlink-close loop. I'd be
> > > > expecting at least ~10,000 alloc/unlink iterations per second, not
> > > > 650/second.
> > > > 
> > > 
> > > Hm, Ok. My test was just a bash script doing a 'touch <files>; rm
> > > <files>' loop. I know there was application overhead because if I
> > > tweaked the script to open an fd directly rather than use touch, the
> > > single file performance jumped up a bit, but it seemed to wash away as I
> > > increased the file count so I kept running it with larger sizes. This
> > > seems off so I'll port it over to C code and see how much the numbers
> > > change.
> > 
> > Yeah, using touch/rm becomes fork/exec bound very quickly. You'll
> > find that using "echo > <file>" is much faster than "touch <file>"
> > because it runs a shell built-in operation without fork/exec
> > overhead to create the file. But you can't play tricks like that to
> > replace rm:
> > 
> 
> I had used 'exec' to open an fd (same idea) in the single file case and
> tested with that, saw that the increase was consistent and took that
> along with the increasing performance as batch sizes increased to mean
> that the application overhead wasn't a factor as the test scaled. That
> was clearly wrong, because if I port the whole thing to a C program the
> baseline numbers are way off. I think what also threw me off is that the
> single file test kernel case is actually fairly accurate between the two
> tests. Anyways, here's a series of (single run, no averaging, etc.) test
> runs with the updated test. Note that I reduced the runtime to 10s here
> since the test was running so much faster. Otherwise this is the same
> batched open/close -> unlink behavior:
> 
>                 baseline        test
> batch:  1       files:  893579  files:  41841
> batch:  2       files:  912502  files:  41922
> batch:  4       files:  930424  files:  42084
> batch:  8       files:  932072  files:  41536
> batch:  16      files:  930624  files:  41616
> batch:  32      files:  777088  files:  41120
> batch:  64      files:  567936  files:  57216
> batch:  128     files:  579840  files:  96256
> batch:  256     files:  548608  files:  174080
> batch:  512     files:  546816  files:  246784
> batch:  1024    files:  509952  files:  328704
> batch:  2048    files:  505856  files:  399360
> batch:  4096    files:  479232  files:  438272
> 
> So this shows that the performance delta is actually massive from the
> start. For reference, a single threaded, empty file, non syncing,
> fs_mark workload stabilizes at around ~55k files/sec on this fs. Both
> kernels sort of converge to that rate as the batch size increases, only
> the baseline kernel starts much faster and normalizes while the test
> kernel starts much slower and improves (and still really doesn't hit the
> mark even at a 4k batch size).
> 
> My takeaway from this is that we may need to find a way to mitigate this
> overhead somewhat better than what the current patch does. Otherwise,
> this is a significant dropoff from even a pure allocation workload in
> simple mixed workload scenarios...
> 
> > $ time for ((i=0;i<1000;i++)); do touch /mnt/scratch/foo; rm /mnt/scratch/foo ; done
> > 
> > real    0m2.653s
> > user    0m0.910s
> > sys     0m2.051s
> > $ time for ((i=0;i<1000;i++)); do echo > /mnt/scratch/foo; rm /mnt/scratch/foo ; done
> > 
> > real    0m1.260s
> > user    0m0.452s
> > sys     0m0.913s
> > $ time ./open-unlink 1000 /mnt/scratch/foo
> > 
> > real    0m0.037s
> > user    0m0.001s
> > sys     0m0.030s
> > $
> > 
> > Note the difference in system time between the three operations -
> > almost all the difference in system CPU time is the overhead of
> > fork/exec to run the touch/rm binaries, not do the filesystem
> > operations....
> > 
> > > > > That's just a test of a quick hack, however. Since there is no real
> > > > > urgency to inactivate an unlinked inode (it has no potential users until
> > > > > it's freed),
> > > > 
> > > > On the contrary, there is extreme urgency to inactivate inodes
> > > > quickly.
> > > > 
> > > 
> > > Ok, I think we're talking about slightly different things. What I mean
> > > above is that if a task removes a file and goes off doing unrelated
> > > $work, that inode will just sit on the percpu queue indefinitely. That's
> > > fine, as there's no functional need for us to process it immediately
> > > unless we're around -ENOSPC thresholds or some such that demand reclaim
> > > of the inode.
> > 
> > Yup, an occasional unlink sitting around for a while on an unlinked
> > list isn't going to cause a performance problem.  Indeed, such
> > workloads are more likely to benefit from the reduced unlink()
> > syscall overhead and won't even notice the increase in background
> > CPU overhead for inactivation of those occasional inodes.
> > 
> > > It sounds like what you're talking about is specifically
> > > the behavior/performance of sustained file removal (which is important
> > > obviously), where apparently there is a notable degradation if the
> > > queues become deep enough to push the inode batches out of CPU cache. So
> > > that makes sense...
> > 
> > Yup, sustained bulk throughput is where cache residency really
> > matters. And for unlink, sustained unlink workloads are quite
> > common; they often are something people wait for on the command line
> > or make up a performance critical component of a highly concurrent
> > workload so it's pretty important to get this part right.
> > 
> > > > Darrick made the original assumption that we could delay
> > > > inactivation indefinitely and so he allowed really deep queues of up
> > > > to 64k deferred inactivations. But with queues this deep, we could
> > > > never get that background inactivation code to perform anywhere near
> > > > the original synchronous background inactivation code. e.g. I
> > > > measured 60-70% performance degradataions on my scalability tests,
> > > > and nothing stood out in the profiles until I started looking at
> > > > CPU data cache misses.
> > > > 
> > > 
> > > ... but could you elaborate on the scalability tests involved here so I
> > > can get a better sense of it in practice and perhaps observe the impact
> > > of changes in this path?
> > 
> > The same conconrrent fsmark create/traverse/unlink workloads I've
> > been running for the past decade+ demonstrates it pretty simply. I
> > also saw regressions with dbench (both op latency and throughput) as
> > the clinet count (concurrency) increased, and with compilebench.  I
> > didn't look much further because all the common benchmarks I ran
> > showed perf degradations with arbitrary delays that went away with
> > the current code we have.  ISTR that parts of aim7/reaim scalability
> > workloads that the intel zero-day infrastructure runs are quite
> > sensitive to background inactivation delays as well because that's a
> > CPU bound workload and hence any reduction in cache residency
> > results in a reduction of the number of concurrent jobs that can be
> > run.
> > 
> 
> Ok, so if I (single threaded) create (via fs_mark), sync and remove 5m
> empty files, the remove takes about a minute. If I just bump out the
> current queue and block thresholds by 10x and repeat, that time
> increases to about ~1m24s. If I hack up a kernel to disable queueing
> entirely (i.e. fully synchronous inactivation), then I'm back to about a
> minute again. So I'm not producing any performance benefit with
> queueing/batching in this single threaded scenario, but I suspect the
> 10x threshold delta is at least measuring the negative effect of poor
> caching..? (Any decent way to confirm that..?).
> 
> And of course if I take the baseline kernel and stick a
> cond_synchronize_rcu() in xfs_inactive_ifree() it brings the batch test
> numbers right back but slows the removal test way down. What I find
> interesting however is that if I hack up something more mild like invoke
> cond_synchronize_rcu() on the oldest inode in the current inactivation
> batch, bump out the blocking threshold as above (but leave the queueing
> threshold at 32), and leave the iget side cond_sync_rcu() to catch
> whatever falls through, my 5m file remove test now completes ~5-10s
> faster than baseline and I see the following results from the batched
> alloc/free test:
> 
> batch:  1       files:  731923
> batch:  2       files:  693020
> batch:  4       files:  750948
> batch:  8       files:  743296
> batch:  16      files:  738720
> batch:  32      files:  746240
> batch:  64      files:  598464
> batch:  128     files:  672896
> batch:  256     files:  633856
> batch:  512     files:  605184
> batch:  1024    files:  569344
> batch:  2048    files:  555008
> batch:  4096    files:  524288
> 

This experiment had a bug that was dropping some inactivations on the
floor. With that fixed, the numbers aren't quite as good. The batch test
numbers still improve significantly from the posted patch (i.e. up in
the range of 38-45k files/sec), but still lag the normal allocation
rate, and the large rm test goes up to 1m40s (instead of 1m on
baseline).

Brian

> Hm?
> 
> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-25 18:30         ` Brian Foster
  2022-01-25 20:07           ` Brian Foster
@ 2022-01-25 22:45           ` Dave Chinner
  2022-01-27  4:19             ` Al Viro
  1 sibling, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2022-01-25 22:45 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Al Viro, Ian Kent, rcu

On Tue, Jan 25, 2022 at 01:30:36PM -0500, Brian Foster wrote:
> On Tue, Jan 25, 2022 at 11:31:20AM +1100, Dave Chinner wrote:
> > On Mon, Jan 24, 2022 at 06:29:18PM -0500, Brian Foster wrote:
> > > On Tue, Jan 25, 2022 at 09:08:53AM +1100, Dave Chinner wrote:
> > > ... but could you elaborate on the scalability tests involved here so I
> > > can get a better sense of it in practice and perhaps observe the impact
> > > of changes in this path?
> > 
> > The same conconrrent fsmark create/traverse/unlink workloads I've
> > been running for the past decade+ demonstrates it pretty simply. I
> > also saw regressions with dbench (both op latency and throughput) as
> > the clinet count (concurrency) increased, and with compilebench.  I
> > didn't look much further because all the common benchmarks I ran
> > showed perf degradations with arbitrary delays that went away with
> > the current code we have.  ISTR that parts of aim7/reaim scalability
> > workloads that the intel zero-day infrastructure runs are quite
> > sensitive to background inactivation delays as well because that's a
> > CPU bound workload and hence any reduction in cache residency
> > results in a reduction of the number of concurrent jobs that can be
> > run.
> > 
> 
> Ok, so if I (single threaded) create (via fs_mark), sync and remove 5m
> empty files, the remove takes about a minute. If I just bump out the
> current queue and block thresholds by 10x and repeat, that time
> increases to about ~1m24s. If I hack up a kernel to disable queueing
> entirely (i.e. fully synchronous inactivation), then I'm back to about a
> minute again. So I'm not producing any performance benefit with
> queueing/batching in this single threaded scenario, but I suspect the
> 10x threshold delta is at least measuring the negative effect of poor
> caching..? (Any decent way to confirm that..?).

Right, background inactivation does not improve performance - it's
necessary to get the transactions out of the evict() path. All we
wanted was to ensure that there were no performance degradations as
a result of background inactivation, not that it was faster.

If you want to confirm that there is an increase in cold cache
access when the batch size is increased, cpu profiles with 'perf
top'/'perf record/report' and CPU cache performance metric reporting
via 'perf stat -dddd' are your friend. See elsewhere in the thread
where I mention those things to Paul.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-25 22:45           ` Dave Chinner
@ 2022-01-27  4:19             ` Al Viro
  2022-01-27  5:26               ` Dave Chinner
  0 siblings, 1 reply; 36+ messages in thread
From: Al Viro @ 2022-01-27  4:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs, Ian Kent, rcu

On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote:

> Right, background inactivation does not improve performance - it's
> necessary to get the transactions out of the evict() path. All we
> wanted was to ensure that there were no performance degradations as
> a result of background inactivation, not that it was faster.
> 
> If you want to confirm that there is an increase in cold cache
> access when the batch size is increased, cpu profiles with 'perf
> top'/'perf record/report' and CPU cache performance metric reporting
> via 'perf stat -dddd' are your friend. See elsewhere in the thread
> where I mention those things to Paul.

Dave, do you see a plausible way to eventually drop Ian's bandaid?
I'm not asking for that to happen this cycle and for backports Ian's
patch is obviously fine.

What I really want to avoid is the situation when we are stuck with
keeping that bandaid in fs/namei.c, since all ways to avoid seeing
reused inodes would hurt XFS too badly.  And the benchmarks in this
thread do look like that.

Are there any realistic prospects of having xfs_iget() deal with
reuse case by allocating new in-core inode and flipping whatever
references you've got in XFS journalling data structures to the
new copy?  If I understood what you said on IRC correctly, that is...

Again, I'm not asking if it can be done this cycle; having a
realistic path to doing that eventually would be fine by me.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-27  4:19             ` Al Viro
@ 2022-01-27  5:26               ` Dave Chinner
  2022-01-27 19:01                 ` Brian Foster
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2022-01-27  5:26 UTC (permalink / raw)
  To: Al Viro; +Cc: Brian Foster, linux-xfs, Ian Kent, rcu

On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote:
> On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote:
> 
> > Right, background inactivation does not improve performance - it's
> > necessary to get the transactions out of the evict() path. All we
> > wanted was to ensure that there were no performance degradations as
> > a result of background inactivation, not that it was faster.
> > 
> > If you want to confirm that there is an increase in cold cache
> > access when the batch size is increased, cpu profiles with 'perf
> > top'/'perf record/report' and CPU cache performance metric reporting
> > via 'perf stat -dddd' are your friend. See elsewhere in the thread
> > where I mention those things to Paul.
> 
> Dave, do you see a plausible way to eventually drop Ian's bandaid?
> I'm not asking for that to happen this cycle and for backports Ian's
> patch is obviously fine.

Yes, but not in the near term.

> What I really want to avoid is the situation when we are stuck with
> keeping that bandaid in fs/namei.c, since all ways to avoid seeing
> reused inodes would hurt XFS too badly.  And the benchmarks in this
> thread do look like that.

The simplest way I think is to have the XFS inode allocation track
"busy inodes" in the same way we track "busy extents". A busy extent
is an extent that has been freed by the user, but is not yet marked
free in the journal/on disk. If we try to reallocate that busy
extent, we either select a different free extent to allocate, or if
we can't find any we force the journal to disk, wait for it to
complete (hence unbusying the extents) and retry the allocation
again.

We can do something similar for inode allocation - it's actually a
lockless tag lookup on the radix tree entry for the candidate inode
number. If we find the reclaimable radix tree tag set, the we select
a different inode. If we can't allocate a new inode, then we kick
synchronize_rcu() and retry the allocation, allowing inodes to be
recycled this time.

> Are there any realistic prospects of having xfs_iget() deal with
> reuse case by allocating new in-core inode and flipping whatever
> references you've got in XFS journalling data structures to the
> new copy?  If I understood what you said on IRC correctly, that is...

That's ... much harder.

One of the problems is that once an inode has a log item attached to
it, it assumes that it can be accessed without specific locking,
etc. see xfs_inode_clean(), for example. So there's some life-cycle
stuff that needs to be taken care of in XFS first, and the inode <->
log item relationship is tangled.

I've been working towards removing that tangle - but taht stuff is
quite a distance down my logging rework patch queue. THat queue has
been stuck now for a year trying to get the first handful of rework
and scalability modifications reviewed and merged, so I'm not
holding my breathe as to how long a more substantial rework of
internal logging code will take to review and merge.

Really, though, we need the inactivation stuff to be done as part of
the VFS inode lifecycle. I have some ideas on what to do here, but I
suspect we'll need some changes to iput_final()/evict() to allow us
to process final unlinks in the bakground and then call evict()
ourselves when the unlink completes. That way ->destroy_inode() can
just call xfs_reclaim_inode() to free it directly, which also helps
us get rid of background inode freeing and hence inode recycling
from XFS altogether. I think we _might_ be able to do this without
needing to change any of the logging code in XFS, but I haven't
looked any further than this into it as yet.

> Again, I'm not asking if it can be done this cycle; having a
> realistic path to doing that eventually would be fine by me.

We're talking a year at least, probably two, before we get there...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-27  5:26               ` Dave Chinner
@ 2022-01-27 19:01                 ` Brian Foster
  2022-01-27 22:18                   ` Dave Chinner
  2022-01-28 21:39                   ` Paul E. McKenney
  0 siblings, 2 replies; 36+ messages in thread
From: Brian Foster @ 2022-01-27 19:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, linux-xfs, Ian Kent, rcu

On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote:
> On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote:
> > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote:
> > 
> > > Right, background inactivation does not improve performance - it's
> > > necessary to get the transactions out of the evict() path. All we
> > > wanted was to ensure that there were no performance degradations as
> > > a result of background inactivation, not that it was faster.
> > > 
> > > If you want to confirm that there is an increase in cold cache
> > > access when the batch size is increased, cpu profiles with 'perf
> > > top'/'perf record/report' and CPU cache performance metric reporting
> > > via 'perf stat -dddd' are your friend. See elsewhere in the thread
> > > where I mention those things to Paul.
> > 
> > Dave, do you see a plausible way to eventually drop Ian's bandaid?
> > I'm not asking for that to happen this cycle and for backports Ian's
> > patch is obviously fine.
> 
> Yes, but not in the near term.
> 
> > What I really want to avoid is the situation when we are stuck with
> > keeping that bandaid in fs/namei.c, since all ways to avoid seeing
> > reused inodes would hurt XFS too badly.  And the benchmarks in this
> > thread do look like that.
> 
> The simplest way I think is to have the XFS inode allocation track
> "busy inodes" in the same way we track "busy extents". A busy extent
> is an extent that has been freed by the user, but is not yet marked
> free in the journal/on disk. If we try to reallocate that busy
> extent, we either select a different free extent to allocate, or if
> we can't find any we force the journal to disk, wait for it to
> complete (hence unbusying the extents) and retry the allocation
> again.
> 
> We can do something similar for inode allocation - it's actually a
> lockless tag lookup on the radix tree entry for the candidate inode
> number. If we find the reclaimable radix tree tag set, the we select
> a different inode. If we can't allocate a new inode, then we kick
> synchronize_rcu() and retry the allocation, allowing inodes to be
> recycled this time.
> 

I'm starting to poke around this area since it's become clear that the
currently proposed scheme just involves too much latency (unless Paul
chimes in with his expedited grace period variant, at which point I will
revisit) in the fast allocation/recycle path. ISTM so far that a simple
"skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will
have pretty much the same pattern of behavior as this patch: one
synchronize_rcu() per batch.

IOW, background reclaim only kicks in after 30s by default, so the pool
of free inodes pretty much always consists of 100% reclaimable inodes.
On top of that, at smaller batch sizes, the pool tends to have a uniform
(!elapsed) grace period cookie, so a stall is required to be able to
allocate any of them. As the batch size increases, I do see the
population of free inodes start to contain a mix of expired and
non-expired grace period cookies. It's fairly easy to hack up an
internal icwalk scan to locate already expired inodes, but the problem
is that the recycle rate is so much faster than the grace period latency
that it doesn't really matter. We'll still have to stall by the time we
get to the non-expired inodes, and so we're back to one stall per batch
and the same general performance characteristic of this patch.

So given all of this, I'm wondering about something like the following
high level inode allocation algorithm:

1. If the AG has any reclaimable inodes, scan for one with an expired
grace period. If found, target that inode for physical allocation.

2. If the AG free inode count == the AG reclaimable count and we know
all reclaimable inodes are most likely pending a grace period (because
the previous step failed), allocate a new inode chunk (and target it in
this allocation).

3. If the AG free inode count > the reclaimable count, scan the finobt
for an inode that is not present in the radix tree (i.e. Dave's logic
above).

Each of those steps could involve some heuristics to maintain
predictable behavior and avoid large scans and such, but the general
idea is that the repeated alloc/free inode workload naturally populates
the AG with enough physical inodes to always be able to satisfy an
allocation without waiting on a grace period. IOW, this is effectively
similar behavior to if physical inode freeing was delayed to an rcu
callback, with the tradeoff of complicating the allocation path rather
than stalling in the inactivation pipeline. Thoughts?

This of course is more involved than this patch (or similarly simple
variants of RCU delaying preexisting bits of code) and requires some
more investigation, but certainly shouldn't be a multi-year thing. The
question is probably more of whether it's enough complexity to justify
in the meantime...

> > Are there any realistic prospects of having xfs_iget() deal with
> > reuse case by allocating new in-core inode and flipping whatever
> > references you've got in XFS journalling data structures to the
> > new copy?  If I understood what you said on IRC correctly, that is...
> 
> That's ... much harder.
> 
> One of the problems is that once an inode has a log item attached to
> it, it assumes that it can be accessed without specific locking,
> etc. see xfs_inode_clean(), for example. So there's some life-cycle
> stuff that needs to be taken care of in XFS first, and the inode <->
> log item relationship is tangled.
> 
> I've been working towards removing that tangle - but taht stuff is
> quite a distance down my logging rework patch queue. THat queue has
> been stuck now for a year trying to get the first handful of rework
> and scalability modifications reviewed and merged, so I'm not
> holding my breathe as to how long a more substantial rework of
> internal logging code will take to review and merge.
> 
> Really, though, we need the inactivation stuff to be done as part of
> the VFS inode lifecycle. I have some ideas on what to do here, but I
> suspect we'll need some changes to iput_final()/evict() to allow us
> to process final unlinks in the bakground and then call evict()
> ourselves when the unlink completes. That way ->destroy_inode() can
> just call xfs_reclaim_inode() to free it directly, which also helps
> us get rid of background inode freeing and hence inode recycling
> from XFS altogether. I think we _might_ be able to do this without
> needing to change any of the logging code in XFS, but I haven't
> looked any further than this into it as yet.
> 

... of whatever this ends up looking like.

Can you elaborate on what you mean by processing unlinks in the
background? I can see the value of being able to eliminate the recycle
code in XFS, but wouldn't we still have to limit and throttle against
background work to maintain sustained removal performance? IOW, what's
the general teardown behavior you're getting at here, aside from what
parts push into the vfs or not?

Brian

> > Again, I'm not asking if it can be done this cycle; having a
> > realistic path to doing that eventually would be fine by me.
> 
> We're talking a year at least, probably two, before we get there...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-27 19:01                 ` Brian Foster
@ 2022-01-27 22:18                   ` Dave Chinner
  2022-01-28 14:11                     ` Brian Foster
  2022-01-28 21:39                   ` Paul E. McKenney
  1 sibling, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2022-01-27 22:18 UTC (permalink / raw)
  To: Brian Foster; +Cc: Al Viro, linux-xfs, Ian Kent, rcu

On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote:
> On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote:
> > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote:
> > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote:
> > > 
> > > > Right, background inactivation does not improve performance - it's
> > > > necessary to get the transactions out of the evict() path. All we
> > > > wanted was to ensure that there were no performance degradations as
> > > > a result of background inactivation, not that it was faster.
> > > > 
> > > > If you want to confirm that there is an increase in cold cache
> > > > access when the batch size is increased, cpu profiles with 'perf
> > > > top'/'perf record/report' and CPU cache performance metric reporting
> > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread
> > > > where I mention those things to Paul.
> > > 
> > > Dave, do you see a plausible way to eventually drop Ian's bandaid?
> > > I'm not asking for that to happen this cycle and for backports Ian's
> > > patch is obviously fine.
> > 
> > Yes, but not in the near term.
> > 
> > > What I really want to avoid is the situation when we are stuck with
> > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing
> > > reused inodes would hurt XFS too badly.  And the benchmarks in this
> > > thread do look like that.
> > 
> > The simplest way I think is to have the XFS inode allocation track
> > "busy inodes" in the same way we track "busy extents". A busy extent
> > is an extent that has been freed by the user, but is not yet marked
> > free in the journal/on disk. If we try to reallocate that busy
> > extent, we either select a different free extent to allocate, or if
> > we can't find any we force the journal to disk, wait for it to
> > complete (hence unbusying the extents) and retry the allocation
> > again.
> > 
> > We can do something similar for inode allocation - it's actually a
> > lockless tag lookup on the radix tree entry for the candidate inode
> > number. If we find the reclaimable radix tree tag set, the we select
> > a different inode. If we can't allocate a new inode, then we kick
> > synchronize_rcu() and retry the allocation, allowing inodes to be
> > recycled this time.
> > 
> 
> I'm starting to poke around this area since it's become clear that the
> currently proposed scheme just involves too much latency (unless Paul
> chimes in with his expedited grace period variant, at which point I will
> revisit) in the fast allocation/recycle path. ISTM so far that a simple
> "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will
> have pretty much the same pattern of behavior as this patch: one
> synchronize_rcu() per batch.

That's not really what I proposed - what I suggested was that if we
can't allocate a usable inode from the finobt, and we can't allocate
a new inode cluster from the AG (i.e. populate the finobt with more
inodes), only then call synchronise_rcu() and recycle an inode.

We don't need to scan the inode cache or the finobt to determine if
there are reclaimable inodes immediately available - do a gang tag
lookup on the radix tree for newino.
If it comes back with an inode number that is not
equal to the node number we looked up, then we can allocate an
newino immediately.

If it comes back with newino, then check the first inode in the
finobt. If that comes back with an inode that is not the first inode
in the finobt, we can immediately allocate the first inode in the
finobt. If not, check the last inode. if that fails, assume all
inodes in the finobt need recycling and allocate a new cluster,
pointing newino at it.

Then we get another 64 inodes starting at the newino cursor we can
allocate from while we wait for the current RCU grace period to
expire for inodes already in the reclaimable state. An algorithm
like this will allow the free inode pool to resize automatically
based on the unlink frequency of the workload and RCU grace period
latency...

> IOW, background reclaim only kicks in after 30s by default,

5 seconds, by default, not 30s.

> so the pool
> of free inodes pretty much always consists of 100% reclaimable inodes.
> On top of that, at smaller batch sizes, the pool tends to have a uniform
> (!elapsed) grace period cookie, so a stall is required to be able to
> allocate any of them. As the batch size increases, I do see the
> population of free inodes start to contain a mix of expired and
> non-expired grace period cookies. It's fairly easy to hack up an
> internal icwalk scan to locate already expired inodes,

We don't want or need to do exhaustive, exactly correct scans here.
We want *fast and loose* because this is a critical performance fast
path. We don't care if we skip the occasional recyclable inode, what
we need to to is minimise the CPU overhead and search latency for
the case where recycling will never occur.

> but the problem
> is that the recycle rate is so much faster than the grace period latency
> that it doesn't really matter. We'll still have to stall by the time we
> get to the non-expired inodes, and so we're back to one stall per batch
> and the same general performance characteristic of this patch.

Yes, but that's why I suggested that we allocate a new inode cluster
rather than calling synchronise_rcu() when we don't have a
recyclable inode candidate.

> So given all of this, I'm wondering about something like the following
> high level inode allocation algorithm:
> 
> 1. If the AG has any reclaimable inodes, scan for one with an expired
> grace period. If found, target that inode for physical allocation.

How do you efficiently discriminate between "reclaimable w/ nlink >
0" and "reclaimable w/ nlink == 0" so we don't get hung up searching
millions of reclaimable inodes for the one that has been unlinked
and has an expired grace period?

Also, this will need to be done on every inode allocation when we
have inodes in reclaimable state (which is almost always on a busy
system).  Workloads with sequential allocation (as per untar, rsync,
git checkout, cp -r, etc) will do this scan unnecessarily as they
will almost never hit this inode recycle path as there aren't a lot
of unlinks occurring while they are working.

> 2. If the AG free inode count == the AG reclaimable count and we know
> all reclaimable inodes are most likely pending a grace period (because
> the previous step failed), allocate a new inode chunk (and target it in
> this allocation).

That's good for the allocation that allocates the chunk, but...

> 3. If the AG free inode count > the reclaimable count, scan the finobt
> for an inode that is not present in the radix tree (i.e. Dave's logic
> above).

... now we are repeating the radix tree walk that we've already done
in #1 to find the newly allocated inodes we allocated in #2.

We don't need to walk the inodes in the inode radix tree to look at
individual inode state - we can use the reclaimable radix tree tag
to shortcut those walks and minimise the number of actual lookups we
need to do. By definition, and inode in the finobt and
XFS_IRECLAIMABLE state is an inode that needs recycling, so we can
just use the finobt and the inode radix tree tags to avoid inodes
that need recycling altogether.  i.e. If we fail a tag lookup, we
have no reclaimable inodes in the range we asked the lookup to
search so we can immediately allocate - we don't need to actually
need to look at the inode in the fast path no-recycling case at all. 

Keep in mind that the fast path we really care about is not the
unlink/allocate looping case, it's the allocation case where no
recycling will ever occur and so that's the one we really have to
try hard to minimise the overhead for. The moment we get into
reclaimable inodes within the finobt range  we're hitting the "lots
of temp files" use case, so we can detect that and keep the overhead
of that algorithm as separate as we possibly can.

Hence we need the initial "can we allocate this inode number"
decision to be as fast and as low overhead as possible so we can
determine which algorithm we need to run. A lockless radix tree gang
tag lookup will give us that and if the lookup finds a reclaimable
inode only then do we move into the "recycle RCU avoidance"
algorithm path....

> > > Are there any realistic prospects of having xfs_iget() deal with
> > > reuse case by allocating new in-core inode and flipping whatever
> > > references you've got in XFS journalling data structures to the
> > > new copy?  If I understood what you said on IRC correctly, that is...
> > 
> > That's ... much harder.
> > 
> > One of the problems is that once an inode has a log item attached to
> > it, it assumes that it can be accessed without specific locking,
> > etc. see xfs_inode_clean(), for example. So there's some life-cycle
> > stuff that needs to be taken care of in XFS first, and the inode <->
> > log item relationship is tangled.
> > 
> > I've been working towards removing that tangle - but taht stuff is
> > quite a distance down my logging rework patch queue. THat queue has
> > been stuck now for a year trying to get the first handful of rework
> > and scalability modifications reviewed and merged, so I'm not
> > holding my breathe as to how long a more substantial rework of
> > internal logging code will take to review and merge.
> > 
> > Really, though, we need the inactivation stuff to be done as part of
> > the VFS inode lifecycle. I have some ideas on what to do here, but I
> > suspect we'll need some changes to iput_final()/evict() to allow us
> > to process final unlinks in the bakground and then call evict()
> > ourselves when the unlink completes. That way ->destroy_inode() can
> > just call xfs_reclaim_inode() to free it directly, which also helps
> > us get rid of background inode freeing and hence inode recycling
> > from XFS altogether. I think we _might_ be able to do this without
> > needing to change any of the logging code in XFS, but I haven't
> > looked any further than this into it as yet.
> > 
> 
> ... of whatever this ends up looking like.
> 
> Can you elaborate on what you mean by processing unlinks in the
> background? I can see the value of being able to eliminate the recycle
> code in XFS, but wouldn't we still have to limit and throttle against
> background work to maintain sustained removal performance?

Yes, but that's irrelevant because all we would be doing is slightly
changing where that throttling occurs (i.e. in
iput_final->drop_inode instead of iput_final->evict->destroy_inode).

However, moving the throttling up the stack is a good thing because
it gets rid of the current problem with the inactivation throttling
blocking the shrinker via shrinker->super_cache_scan->
prune_icache_sb->dispose_list->evict-> destroy_inode->throttle on
full inactivation queue because all the inodes need EOF block
trimming to be done.

> IOW, what's
> the general teardown behavior you're getting at here, aside from what
> parts push into the vfs or not?

->drop_inode() triggers background inactivation for both blockgc and
inode unlink. For unlink, we set I_WILL_FREE so the VFS will not
attempt to re-use it, add the inode # to the internal AG "busy
inode" tree and return drop = true and the VFS then stops processing
that inode. For blockgc, we queue the work and return drop = false
and the VFS puts it onto the LRU. Now we have asynchronous
inactivation while the inode is still present and visible at the VFS
level.

For background blockgc - that now happens while the inode is idle on
the LRU before it gets reclaimed by the shrinker. i.e. we trigger
block gc when the last reference to the inode goes away instead of
when it gets removed from memory by the shrinker.

For unlink, that now runs in the bacgrkoud until the inode unlink
has been journalled and the cleared inode written to the backing
inode cluster buffer. The inode is then no longer visisble to the
journal and it can't be reallocated because it is still busy. We
then change the inode state from I_WILL_FREE to I_FREEING and call
evict(). The inode then gets torn down, and in ->destroy_inode we
remove the inode from the radix tree, clear the per-ag busy record
and free the inode via RCU as expected by the VFS.

Another possible mechanism instead of exporting evict() is that
background inactivation takes a new reference to the inode from
->drop_inode so that even if we put it on the LRU the inode cache
shrinker will skip it while we are doing background inactivation.
That would mean that when background inactivation is done, we call
iput_final() again. The inode will either then be left on the LRU or
go through the normal evict() path.

This also it gets the memory demand and overhead of EOF block
trimming out of the memory reclaim path, and it also gets rid of
the need for the special superblock shrinker hooks that XFS has for
reclaiming it's internal inode cache.

Overall, lifting this stuff up to the VFS is full of "less
complexity in XFS" wins if we can make it work...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-27 22:18                   ` Dave Chinner
@ 2022-01-28 14:11                     ` Brian Foster
  2022-01-28 23:53                       ` Dave Chinner
  0 siblings, 1 reply; 36+ messages in thread
From: Brian Foster @ 2022-01-28 14:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, linux-xfs, Ian Kent, rcu

On Fri, Jan 28, 2022 at 09:18:17AM +1100, Dave Chinner wrote:
> On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote:
> > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote:
> > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote:
> > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote:
> > > > 
> > > > > Right, background inactivation does not improve performance - it's
> > > > > necessary to get the transactions out of the evict() path. All we
> > > > > wanted was to ensure that there were no performance degradations as
> > > > > a result of background inactivation, not that it was faster.
> > > > > 
> > > > > If you want to confirm that there is an increase in cold cache
> > > > > access when the batch size is increased, cpu profiles with 'perf
> > > > > top'/'perf record/report' and CPU cache performance metric reporting
> > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread
> > > > > where I mention those things to Paul.
> > > > 
> > > > Dave, do you see a plausible way to eventually drop Ian's bandaid?
> > > > I'm not asking for that to happen this cycle and for backports Ian's
> > > > patch is obviously fine.
> > > 
> > > Yes, but not in the near term.
> > > 
> > > > What I really want to avoid is the situation when we are stuck with
> > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing
> > > > reused inodes would hurt XFS too badly.  And the benchmarks in this
> > > > thread do look like that.
> > > 
> > > The simplest way I think is to have the XFS inode allocation track
> > > "busy inodes" in the same way we track "busy extents". A busy extent
> > > is an extent that has been freed by the user, but is not yet marked
> > > free in the journal/on disk. If we try to reallocate that busy
> > > extent, we either select a different free extent to allocate, or if
> > > we can't find any we force the journal to disk, wait for it to
> > > complete (hence unbusying the extents) and retry the allocation
> > > again.
> > > 
> > > We can do something similar for inode allocation - it's actually a
> > > lockless tag lookup on the radix tree entry for the candidate inode
> > > number. If we find the reclaimable radix tree tag set, the we select
> > > a different inode. If we can't allocate a new inode, then we kick
> > > synchronize_rcu() and retry the allocation, allowing inodes to be
> > > recycled this time.
> > > 
> > 
> > I'm starting to poke around this area since it's become clear that the
> > currently proposed scheme just involves too much latency (unless Paul
> > chimes in with his expedited grace period variant, at which point I will
> > revisit) in the fast allocation/recycle path. ISTM so far that a simple
> > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will
> > have pretty much the same pattern of behavior as this patch: one
> > synchronize_rcu() per batch.
> 
> That's not really what I proposed - what I suggested was that if we
> can't allocate a usable inode from the finobt, and we can't allocate
> a new inode cluster from the AG (i.e. populate the finobt with more
> inodes), only then call synchronise_rcu() and recycle an inode.
> 

That's not how I read it... Regardless, that was my suggestion as well,
so we're on the same page on that front.

> We don't need to scan the inode cache or the finobt to determine if
> there are reclaimable inodes immediately available - do a gang tag
> lookup on the radix tree for newino.
> If it comes back with an inode number that is not
> equal to the node number we looked up, then we can allocate an
> newino immediately.
> 
> If it comes back with newino, then check the first inode in the
> finobt. If that comes back with an inode that is not the first inode
> in the finobt, we can immediately allocate the first inode in the
> finobt. If not, check the last inode. if that fails, assume all
> inodes in the finobt need recycling and allocate a new cluster,
> pointing newino at it.
> 

Hrm, I'll have to think about this some more. I don't mind something
like this as a possible scanning allocation algorithm, but I don't love
the idea of doing a few predictable btree/radix tree lookups and
inferring broader AG state from that, particularly when I think it's
possible to get more accurate information in a way that's easier and
probably more efficient.

For example, we already have counts of the number of reclaimable and
free inodes in the perag. We could fairly easily add a counter to track
the subset of reclaimable inodes that are unlinked. With something like
that, it's easier to make higher level decisions like when to just
allocate a new inode chunk (because the free inode pool consists mostly
of reclaimable inodes) or just scanning through the finobt for a good
candidate (because there are none or very few unlinked reclaimable
inodes relative to the number of free inodes in the btree).

So in general I think the two obvious ends of the spectrum (i.e. the
repeated alloc/free workload I'm testing above vs. the tar/cp use case
where there are many allocs and few unlinks) are probably the most
straightforward to handle and don't require major search algorithm
changes.  It's the middle ground (i.e. a large number of free inodes
with half or whatever more sitting in the radix tree) that I think
requires some more thought and I don't quite have an answer for atm. I
don't want to go off allocating new inode chunks too aggressively, but
also don't want to turn the finobt allocation algorithm into something
like the historical inobt search algorithm with poor worst case
behavior.

> Then we get another 64 inodes starting at the newino cursor we can
> allocate from while we wait for the current RCU grace period to
> expire for inodes already in the reclaimable state. An algorithm
> like this will allow the free inode pool to resize automatically
> based on the unlink frequency of the workload and RCU grace period
> latency...
> 
> > IOW, background reclaim only kicks in after 30s by default,
> 
> 5 seconds, by default, not 30s.
> 

xfs_reclaim_work_queue() keys off xfs_syncd_centisecs, which corresponds
to xfs_params.syncd_timer, which is initialized as:

        .syncd_timer    = {     1*100,          30*100,         7200*100},

Am I missing something? Not that it really matters much for this
discussion anyways. Whether it's 30s or 5s, either way the reallocation
workload is going to pretty much always recycle these inodes long before
background reclaim gets to them.

> > so the pool
> > of free inodes pretty much always consists of 100% reclaimable inodes.
> > On top of that, at smaller batch sizes, the pool tends to have a uniform
> > (!elapsed) grace period cookie, so a stall is required to be able to
> > allocate any of them. As the batch size increases, I do see the
> > population of free inodes start to contain a mix of expired and
> > non-expired grace period cookies. It's fairly easy to hack up an
> > internal icwalk scan to locate already expired inodes,
> 
> We don't want or need to do exhaustive, exactly correct scans here.
> We want *fast and loose* because this is a critical performance fast
> path. We don't care if we skip the occasional recyclable inode, what
> we need to to is minimise the CPU overhead and search latency for
> the case where recycling will never occur.
> 

Agreed. That's what I meant by my comment about having heuristics to
avoid large/long scans.

> > but the problem
> > is that the recycle rate is so much faster than the grace period latency
> > that it doesn't really matter. We'll still have to stall by the time we
> > get to the non-expired inodes, and so we're back to one stall per batch
> > and the same general performance characteristic of this patch.
> 
> Yes, but that's why I suggested that we allocate a new inode cluster
> rather than calling synchronise_rcu() when we don't have a
> recyclable inode candidate.
> 

Ok.

> > So given all of this, I'm wondering about something like the following
> > high level inode allocation algorithm:
> > 
> > 1. If the AG has any reclaimable inodes, scan for one with an expired
> > grace period. If found, target that inode for physical allocation.
> 
> How do you efficiently discriminate between "reclaimable w/ nlink >
> 0" and "reclaimable w/ nlink == 0" so we don't get hung up searching
> millions of reclaimable inodes for the one that has been unlinked
> and has an expired grace period?
> 

A counter or some other form of hinting structure..

> Also, this will need to be done on every inode allocation when we
> have inodes in reclaimable state (which is almost always on a busy
> system).  Workloads with sequential allocation (as per untar, rsync,
> git checkout, cp -r, etc) will do this scan unnecessarily as they
> will almost never hit this inode recycle path as there aren't a lot
> of unlinks occurring while they are working.
> 

I'm not necessarily suggesting a full radix tree scan per inode
allocation. I was more thinking about an occasionally updated hinting
structure to efficiently locate the least recently freed inode numbers,
or something similar. This would serve no purpose in scenarios where it
just makes more sense to allocate new chunks, but otherwise could just
serve as an allocation target, a metric to determine likelihood of
reclaimable inodes w/ expired grace periods being present, or just a
starting point for a finobt search algorithm like what you describe
above, etc.

> > 2. If the AG free inode count == the AG reclaimable count and we know
> > all reclaimable inodes are most likely pending a grace period (because
> > the previous step failed), allocate a new inode chunk (and target it in
> > this allocation).
> 
> That's good for the allocation that allocates the chunk, but...
> 
> > 3. If the AG free inode count > the reclaimable count, scan the finobt
> > for an inode that is not present in the radix tree (i.e. Dave's logic
> > above).
> 
> ... now we are repeating the radix tree walk that we've already done
> in #1 to find the newly allocated inodes we allocated in #2.
> 
> We don't need to walk the inodes in the inode radix tree to look at
> individual inode state - we can use the reclaimable radix tree tag
> to shortcut those walks and minimise the number of actual lookups we
> need to do. By definition, and inode in the finobt and
> XFS_IRECLAIMABLE state is an inode that needs recycling, so we can
> just use the finobt and the inode radix tree tags to avoid inodes
> that need recycling altogether.  i.e. If we fail a tag lookup, we
> have no reclaimable inodes in the range we asked the lookup to
> search so we can immediately allocate - we don't need to actually
> need to look at the inode in the fast path no-recycling case at all. 
> 

This is starting to make some odd (to me) assumptions about thus far
undefined implementation details. For example, the very little amount of
code I have already for experimentation purposes only scans tagged
reclaimable inodes, so that you suggest doing exactly that instead of
full radix tree scans suggests to me that there are some details here
that are clearly not getting across in email. ;)

That's fine, I'm not trying to cover details. Details are easier to work
through with code, and TBH I don't have enough concrete ideas to hash
through details in email just yet anyways. The primary concepts in my
previous description were that we should prioritize allocation of new
chunks over taking RCU stalls whenever possible, and that there might be
ways to use existing radix tree state to maintain predictable worst case
performance for finobt searches (TBD). With regard to the general
principles you mention of avoiding repeated large scans, maintaing
common workload and fast path performance, etc., I think we're pretty
much on the same page.

> Keep in mind that the fast path we really care about is not the
> unlink/allocate looping case, it's the allocation case where no
> recycling will ever occur and so that's the one we really have to
> try hard to minimise the overhead for. The moment we get into
> reclaimable inodes within the finobt range  we're hitting the "lots
> of temp files" use case, so we can detect that and keep the overhead
> of that algorithm as separate as we possibly can.
> 
> Hence we need the initial "can we allocate this inode number"
> decision to be as fast and as low overhead as possible so we can
> determine which algorithm we need to run. A lockless radix tree gang
> tag lookup will give us that and if the lookup finds a reclaimable
> inode only then do we move into the "recycle RCU avoidance"
> algorithm path....
> 
> > > > Are there any realistic prospects of having xfs_iget() deal with
> > > > reuse case by allocating new in-core inode and flipping whatever
> > > > references you've got in XFS journalling data structures to the
> > > > new copy?  If I understood what you said on IRC correctly, that is...
> > > 
> > > That's ... much harder.
> > > 
> > > One of the problems is that once an inode has a log item attached to
> > > it, it assumes that it can be accessed without specific locking,
> > > etc. see xfs_inode_clean(), for example. So there's some life-cycle
> > > stuff that needs to be taken care of in XFS first, and the inode <->
> > > log item relationship is tangled.
> > > 
> > > I've been working towards removing that tangle - but taht stuff is
> > > quite a distance down my logging rework patch queue. THat queue has
> > > been stuck now for a year trying to get the first handful of rework
> > > and scalability modifications reviewed and merged, so I'm not
> > > holding my breathe as to how long a more substantial rework of
> > > internal logging code will take to review and merge.
> > > 
> > > Really, though, we need the inactivation stuff to be done as part of
> > > the VFS inode lifecycle. I have some ideas on what to do here, but I
> > > suspect we'll need some changes to iput_final()/evict() to allow us
> > > to process final unlinks in the bakground and then call evict()
> > > ourselves when the unlink completes. That way ->destroy_inode() can
> > > just call xfs_reclaim_inode() to free it directly, which also helps
> > > us get rid of background inode freeing and hence inode recycling
> > > from XFS altogether. I think we _might_ be able to do this without
> > > needing to change any of the logging code in XFS, but I haven't
> > > looked any further than this into it as yet.
> > > 
> > 
> > ... of whatever this ends up looking like.
> > 
> > Can you elaborate on what you mean by processing unlinks in the
> > background? I can see the value of being able to eliminate the recycle
> > code in XFS, but wouldn't we still have to limit and throttle against
> > background work to maintain sustained removal performance?
> 
> Yes, but that's irrelevant because all we would be doing is slightly
> changing where that throttling occurs (i.e. in
> iput_final->drop_inode instead of iput_final->evict->destroy_inode).
> 
> However, moving the throttling up the stack is a good thing because
> it gets rid of the current problem with the inactivation throttling
> blocking the shrinker via shrinker->super_cache_scan->
> prune_icache_sb->dispose_list->evict-> destroy_inode->throttle on
> full inactivation queue because all the inodes need EOF block
> trimming to be done.
> 

What I'm trying to understand is whether inodes will have cycled through
the requisite grace period before ->destroy_inode() or not, and if so,
how that is done to avoid the sustained removal performance problem
we've run into here (caused by the extra latency leading to increasing
cacheline misses)..?

> > IOW, what's
> > the general teardown behavior you're getting at here, aside from what
> > parts push into the vfs or not?
> 
> ->drop_inode() triggers background inactivation for both blockgc and
> inode unlink. For unlink, we set I_WILL_FREE so the VFS will not
> attempt to re-use it, add the inode # to the internal AG "busy
> inode" tree and return drop = true and the VFS then stops processing
> that inode. For blockgc, we queue the work and return drop = false
> and the VFS puts it onto the LRU. Now we have asynchronous
> inactivation while the inode is still present and visible at the VFS
> level.
> 
> For background blockgc - that now happens while the inode is idle on
> the LRU before it gets reclaimed by the shrinker. i.e. we trigger
> block gc when the last reference to the inode goes away instead of
> when it gets removed from memory by the shrinker.
> 
> For unlink, that now runs in the bacgrkoud until the inode unlink
> has been journalled and the cleared inode written to the backing
> inode cluster buffer. The inode is then no longer visisble to the
> journal and it can't be reallocated because it is still busy. We
> then change the inode state from I_WILL_FREE to I_FREEING and call
> evict(). The inode then gets torn down, and in ->destroy_inode we
> remove the inode from the radix tree, clear the per-ag busy record
> and free the inode via RCU as expected by the VFS.
> 

Ok, so this sort of sounds like these are separate things. I'm all for
creating more flexibility with the VFS to allow XFS to remove or
simplify codepaths, but this still depends on some form of grace period
tracking to avoid allocation of inodes that are free in the btrees but
still might have in-core struct inode's laying around, yes?

The reason I'm asking about this is because as this patch to avoid
recycling non-expired inodes becomes more complex in order to satisfy
performance requirements, longer term usefulness becomes more relevant.
I don't want us to come up with some complex scheme to avoid RCU stalls
when there's already a plan to rip it out and replace it in a year or
so. OTOH if the resulting logic is part of that longer term strategy,
then this is less of a concern.

Brian

> Another possible mechanism instead of exporting evict() is that
> background inactivation takes a new reference to the inode from
> ->drop_inode so that even if we put it on the LRU the inode cache
> shrinker will skip it while we are doing background inactivation.
> That would mean that when background inactivation is done, we call
> iput_final() again. The inode will either then be left on the LRU or
> go through the normal evict() path.
> 
> This also it gets the memory demand and overhead of EOF block
> trimming out of the memory reclaim path, and it also gets rid of
> the need for the special superblock shrinker hooks that XFS has for
> reclaiming it's internal inode cache.
> 
> Overall, lifting this stuff up to the VFS is full of "less
> complexity in XFS" wins if we can make it work...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-28 14:11                     ` Brian Foster
@ 2022-01-28 23:53                       ` Dave Chinner
  2022-01-31 13:28                         ` Brian Foster
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2022-01-28 23:53 UTC (permalink / raw)
  To: Brian Foster; +Cc: Al Viro, linux-xfs, Ian Kent, rcu

On Fri, Jan 28, 2022 at 09:11:07AM -0500, Brian Foster wrote:
> On Fri, Jan 28, 2022 at 09:18:17AM +1100, Dave Chinner wrote:
> > On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote:
> > > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote:
> > > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote:
> > > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote:
> > > > > 
> > > > > > Right, background inactivation does not improve performance - it's
> > > > > > necessary to get the transactions out of the evict() path. All we
> > > > > > wanted was to ensure that there were no performance degradations as
> > > > > > a result of background inactivation, not that it was faster.
> > > > > > 
> > > > > > If you want to confirm that there is an increase in cold cache
> > > > > > access when the batch size is increased, cpu profiles with 'perf
> > > > > > top'/'perf record/report' and CPU cache performance metric reporting
> > > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread
> > > > > > where I mention those things to Paul.
> > > > > 
> > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid?
> > > > > I'm not asking for that to happen this cycle and for backports Ian's
> > > > > patch is obviously fine.
> > > > 
> > > > Yes, but not in the near term.
> > > > 
> > > > > What I really want to avoid is the situation when we are stuck with
> > > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing
> > > > > reused inodes would hurt XFS too badly.  And the benchmarks in this
> > > > > thread do look like that.
> > > > 
> > > > The simplest way I think is to have the XFS inode allocation track
> > > > "busy inodes" in the same way we track "busy extents". A busy extent
> > > > is an extent that has been freed by the user, but is not yet marked
> > > > free in the journal/on disk. If we try to reallocate that busy
> > > > extent, we either select a different free extent to allocate, or if
> > > > we can't find any we force the journal to disk, wait for it to
> > > > complete (hence unbusying the extents) and retry the allocation
> > > > again.
> > > > 
> > > > We can do something similar for inode allocation - it's actually a
> > > > lockless tag lookup on the radix tree entry for the candidate inode
> > > > number. If we find the reclaimable radix tree tag set, the we select
> > > > a different inode. If we can't allocate a new inode, then we kick
> > > > synchronize_rcu() and retry the allocation, allowing inodes to be
> > > > recycled this time.
> > > > 
> > > 
> > > I'm starting to poke around this area since it's become clear that the
> > > currently proposed scheme just involves too much latency (unless Paul
> > > chimes in with his expedited grace period variant, at which point I will
> > > revisit) in the fast allocation/recycle path. ISTM so far that a simple
> > > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will
> > > have pretty much the same pattern of behavior as this patch: one
> > > synchronize_rcu() per batch.
> > 
> > That's not really what I proposed - what I suggested was that if we
> > can't allocate a usable inode from the finobt, and we can't allocate
> > a new inode cluster from the AG (i.e. populate the finobt with more
> > inodes), only then call synchronise_rcu() and recycle an inode.
> > 
> 
> That's not how I read it... Regardless, that was my suggestion as well,
> so we're on the same page on that front.
> 
> > We don't need to scan the inode cache or the finobt to determine if
> > there are reclaimable inodes immediately available - do a gang tag
> > lookup on the radix tree for newino.
> > If it comes back with an inode number that is not
> > equal to the node number we looked up, then we can allocate an
> > newino immediately.
> > 
> > If it comes back with newino, then check the first inode in the
> > finobt. If that comes back with an inode that is not the first inode
> > in the finobt, we can immediately allocate the first inode in the
> > finobt. If not, check the last inode. if that fails, assume all
> > inodes in the finobt need recycling and allocate a new cluster,
> > pointing newino at it.
> > 
> 
> Hrm, I'll have to think about this some more. I don't mind something
> like this as a possible scanning allocation algorithm, but I don't love
> the idea of doing a few predictable btree/radix tree lookups and
> inferring broader AG state from that, particularly when I think it's
> possible to get more accurate information in a way that's easier and
> probably more efficient.
> 
> For example, we already have counts of the number of reclaimable and
> free inodes in the perag. We could fairly easily add a counter to track
> the subset of reclaimable inodes that are unlinked. With something like
> that, it's easier to make higher level decisions like when to just
> allocate a new inode chunk (because the free inode pool consists mostly
> of reclaimable inodes) or just scanning through the finobt for a good
> candidate (because there are none or very few unlinked reclaimable
> inodes relative to the number of free inodes in the btree).
> 
> So in general I think the two obvious ends of the spectrum (i.e. the
> repeated alloc/free workload I'm testing above vs. the tar/cp use case
> where there are many allocs and few unlinks) are probably the most
> straightforward to handle and don't require major search algorithm
> changes.  It's the middle ground (i.e. a large number of free inodes
> with half or whatever more sitting in the radix tree) that I think
> requires some more thought and I don't quite have an answer for atm. I
> don't want to go off allocating new inode chunks too aggressively, but
> also don't want to turn the finobt allocation algorithm into something
> like the historical inobt search algorithm with poor worst case
> behavior.
> 
> > Then we get another 64 inodes starting at the newino cursor we can
> > allocate from while we wait for the current RCU grace period to
> > expire for inodes already in the reclaimable state. An algorithm
> > like this will allow the free inode pool to resize automatically
> > based on the unlink frequency of the workload and RCU grace period
> > latency...
> > 
> > > IOW, background reclaim only kicks in after 30s by default,
> > 
> > 5 seconds, by default, not 30s.
> > 
> 
> xfs_reclaim_work_queue() keys off xfs_syncd_centisecs, which corresponds
> to xfs_params.syncd_timer, which is initialized as:
> 
>         .syncd_timer    = {     1*100,          30*100,         7200*100},
> 
> Am I missing something?

static void
xfs_reclaim_work_queue(
        struct xfs_mount        *mp)
{

        rcu_read_lock();
        if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_RECLAIM_TAG)) {
                queue_delayed_work(mp->m_reclaim_workqueue, &mp->m_reclaim_work,
                        msecs_to_jiffies(xfs_syncd_centisecs / 6 * 10));
        }
        rcu_read_unlock();
}

....

> > > > Really, though, we need the inactivation stuff to be done as part of
> > > > the VFS inode lifecycle. I have some ideas on what to do here, but I
> > > > suspect we'll need some changes to iput_final()/evict() to allow us
> > > > to process final unlinks in the bakground and then call evict()
> > > > ourselves when the unlink completes. That way ->destroy_inode() can
> > > > just call xfs_reclaim_inode() to free it directly, which also helps
> > > > us get rid of background inode freeing and hence inode recycling
> > > > from XFS altogether. I think we _might_ be able to do this without
> > > > needing to change any of the logging code in XFS, but I haven't
> > > > looked any further than this into it as yet.
> > > > 
> > > 
> > > ... of whatever this ends up looking like.
> > > 
> > > Can you elaborate on what you mean by processing unlinks in the
> > > background? I can see the value of being able to eliminate the recycle
> > > code in XFS, but wouldn't we still have to limit and throttle against
> > > background work to maintain sustained removal performance?
> > 
> > Yes, but that's irrelevant because all we would be doing is slightly
> > changing where that throttling occurs (i.e. in
> > iput_final->drop_inode instead of iput_final->evict->destroy_inode).
> > 
> > However, moving the throttling up the stack is a good thing because
> > it gets rid of the current problem with the inactivation throttling
> > blocking the shrinker via shrinker->super_cache_scan->
> > prune_icache_sb->dispose_list->evict-> destroy_inode->throttle on
> > full inactivation queue because all the inodes need EOF block
> > trimming to be done.
> > 
> 
> What I'm trying to understand is whether inodes will have cycled through
> the requisite grace period before ->destroy_inode() or not, and if so,

The whole point of moving stuff up in the VFS is that inodes
don't get recycled by XFS at all so we don't even have to think
about RCU grace periods anywhere inside XFS.

> how that is done to avoid the sustained removal performance problem
> we've run into here (caused by the extra latency leading to increasing
> cacheline misses)..?

The background work is done _before_ evict() is called by the VFS to
get the inode freed via RCU callbacks. The perf constraints are
unchanged, we just change the layer at which the background work is
performance.

> > > IOW, what's
> > > the general teardown behavior you're getting at here, aside from what
> > > parts push into the vfs or not?
> > 
> > ->drop_inode() triggers background inactivation for both blockgc and
> > inode unlink. For unlink, we set I_WILL_FREE so the VFS will not
> > attempt to re-use it, add the inode # to the internal AG "busy
> > inode" tree and return drop = true and the VFS then stops processing
> > that inode. For blockgc, we queue the work and return drop = false
> > and the VFS puts it onto the LRU. Now we have asynchronous
> > inactivation while the inode is still present and visible at the VFS
> > level.
> > 
> > For background blockgc - that now happens while the inode is idle on
> > the LRU before it gets reclaimed by the shrinker. i.e. we trigger
> > block gc when the last reference to the inode goes away instead of
> > when it gets removed from memory by the shrinker.
> > 
> > For unlink, that now runs in the bacgrkoud until the inode unlink
> > has been journalled and the cleared inode written to the backing
> > inode cluster buffer. The inode is then no longer visisble to the
> > journal and it can't be reallocated because it is still busy. We
> > then change the inode state from I_WILL_FREE to I_FREEING and call
> > evict(). The inode then gets torn down, and in ->destroy_inode we
> > remove the inode from the radix tree, clear the per-ag busy record
> > and free the inode via RCU as expected by the VFS.
> > 
> 
> Ok, so this sort of sounds like these are separate things. I'm all for
> creating more flexibility with the VFS to allow XFS to remove or
> simplify codepaths, but this still depends on some form of grace period
> tracking to avoid allocation of inodes that are free in the btrees but
> still might have in-core struct inode's laying around, yes?

> The reason I'm asking about this is because as this patch to avoid
> recycling non-expired inodes becomes more complex in order to satisfy
> performance requirements, longer term usefulness becomes more relevant.

You say this like I haven't already thought about this....

> I don't want us to come up with some complex scheme to avoid RCU stalls
> when there's already a plan to rip it out and replace it in a year or
> so. OTOH if the resulting logic is part of that longer term strategy,
> then this is less of a concern.

.... and so maybe you haven't realised why I keep suggesting
something along the lines of a busy inode mechanism similar to busy
extent tracking?

Essentially, we can't reallocate the inode until the previous use
has been retired. Which means we'd create the busy inode record in
xfs_inactive() before we free the inode and xfs_reclaim_inode()
would remove the inode from the busy tree when it reclaims the inode
and removes it from the radix tree after marking it dead for RCU
lookup purposes. That would prevent reallocation of the inode until
we can allocate a new in-core inode structure for the inode.

In the lifted VFS case I describe, ->drop_inode() would result in
background inactivation inserting the inode into the busy tree. Once
that is all done and we call evict() on the inode, ->destroy_inode
calls xfs-reclaim_inode() directly. IOWs, the busy inode mechanism
works for both existing and future inactivation mechanisms.

Now, lets take a step further back from this, and consider the
current inode cache implementation.  The fast and dirty method for
tracking busy inodes is to use the fact that a busy inode is defined
as being in the finobt whilst the in-core inode is in an
IRECLAIMABLE state.

Hence, at least initially, we don't need a separate tree to
determine if an inode is "busy" efficiently. The allocation policy
that selects the inode to allocate doesn't care what mechanism we
use to determine if an inode is busy - it's just concerned with
finding a non-busy inode efficiently. Hence we can use a simple
"best, first, last" hueristic to determine if the finobt is likely
to be largely made up of busy inodes and decide to allocate new
inode chunks instead of searching the finobt for an unbusy inode.

IOWs, the "busy extent tracking" implementation will need to change
to be something more explicit as we move inactivation up in the VFS
because the IRCELAIMABLE state goes away, but that doesn't change
the allocation algorithm or heuristics that are based on detecting
busy inodes at allocation time.


Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-28 23:53                       ` Dave Chinner
@ 2022-01-31 13:28                         ` Brian Foster
  0 siblings, 0 replies; 36+ messages in thread
From: Brian Foster @ 2022-01-31 13:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, linux-xfs, Ian Kent, rcu

On Sat, Jan 29, 2022 at 10:53:13AM +1100, Dave Chinner wrote:
> On Fri, Jan 28, 2022 at 09:11:07AM -0500, Brian Foster wrote:
> > On Fri, Jan 28, 2022 at 09:18:17AM +1100, Dave Chinner wrote:
> > > On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote:
> > > > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote:
> > > > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote:
> > > > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote:
> > > > > > 
> > > > > > > Right, background inactivation does not improve performance - it's
> > > > > > > necessary to get the transactions out of the evict() path. All we
> > > > > > > wanted was to ensure that there were no performance degradations as
> > > > > > > a result of background inactivation, not that it was faster.
> > > > > > > 
> > > > > > > If you want to confirm that there is an increase in cold cache
> > > > > > > access when the batch size is increased, cpu profiles with 'perf
> > > > > > > top'/'perf record/report' and CPU cache performance metric reporting
> > > > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread
> > > > > > > where I mention those things to Paul.
> > > > > > 
> > > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid?
> > > > > > I'm not asking for that to happen this cycle and for backports Ian's
> > > > > > patch is obviously fine.
> > > > > 
> > > > > Yes, but not in the near term.
> > > > > 
> > > > > > What I really want to avoid is the situation when we are stuck with
> > > > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing
> > > > > > reused inodes would hurt XFS too badly.  And the benchmarks in this
> > > > > > thread do look like that.
> > > > > 
> > > > > The simplest way I think is to have the XFS inode allocation track
> > > > > "busy inodes" in the same way we track "busy extents". A busy extent
> > > > > is an extent that has been freed by the user, but is not yet marked
> > > > > free in the journal/on disk. If we try to reallocate that busy
> > > > > extent, we either select a different free extent to allocate, or if
> > > > > we can't find any we force the journal to disk, wait for it to
> > > > > complete (hence unbusying the extents) and retry the allocation
> > > > > again.
> > > > > 
> > > > > We can do something similar for inode allocation - it's actually a
> > > > > lockless tag lookup on the radix tree entry for the candidate inode
> > > > > number. If we find the reclaimable radix tree tag set, the we select
> > > > > a different inode. If we can't allocate a new inode, then we kick
> > > > > synchronize_rcu() and retry the allocation, allowing inodes to be
> > > > > recycled this time.
> > > > > 
> > > > 
> > > > I'm starting to poke around this area since it's become clear that the
> > > > currently proposed scheme just involves too much latency (unless Paul
> > > > chimes in with his expedited grace period variant, at which point I will
> > > > revisit) in the fast allocation/recycle path. ISTM so far that a simple
> > > > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will
> > > > have pretty much the same pattern of behavior as this patch: one
> > > > synchronize_rcu() per batch.
> > > 
> > > That's not really what I proposed - what I suggested was that if we
> > > can't allocate a usable inode from the finobt, and we can't allocate
> > > a new inode cluster from the AG (i.e. populate the finobt with more
> > > inodes), only then call synchronise_rcu() and recycle an inode.
> > > 
> > 
> > That's not how I read it... Regardless, that was my suggestion as well,
> > so we're on the same page on that front.
> > 
> > > We don't need to scan the inode cache or the finobt to determine if
> > > there are reclaimable inodes immediately available - do a gang tag
> > > lookup on the radix tree for newino.
> > > If it comes back with an inode number that is not
> > > equal to the node number we looked up, then we can allocate an
> > > newino immediately.
> > > 
> > > If it comes back with newino, then check the first inode in the
> > > finobt. If that comes back with an inode that is not the first inode
> > > in the finobt, we can immediately allocate the first inode in the
> > > finobt. If not, check the last inode. if that fails, assume all
> > > inodes in the finobt need recycling and allocate a new cluster,
> > > pointing newino at it.
> > > 
> > 
> > Hrm, I'll have to think about this some more. I don't mind something
> > like this as a possible scanning allocation algorithm, but I don't love
> > the idea of doing a few predictable btree/radix tree lookups and
> > inferring broader AG state from that, particularly when I think it's
> > possible to get more accurate information in a way that's easier and
> > probably more efficient.
> > 
> > For example, we already have counts of the number of reclaimable and
> > free inodes in the perag. We could fairly easily add a counter to track
> > the subset of reclaimable inodes that are unlinked. With something like
> > that, it's easier to make higher level decisions like when to just
> > allocate a new inode chunk (because the free inode pool consists mostly
> > of reclaimable inodes) or just scanning through the finobt for a good
> > candidate (because there are none or very few unlinked reclaimable
> > inodes relative to the number of free inodes in the btree).
> > 
> > So in general I think the two obvious ends of the spectrum (i.e. the
> > repeated alloc/free workload I'm testing above vs. the tar/cp use case
> > where there are many allocs and few unlinks) are probably the most
> > straightforward to handle and don't require major search algorithm
> > changes.  It's the middle ground (i.e. a large number of free inodes
> > with half or whatever more sitting in the radix tree) that I think
> > requires some more thought and I don't quite have an answer for atm. I
> > don't want to go off allocating new inode chunks too aggressively, but
> > also don't want to turn the finobt allocation algorithm into something
> > like the historical inobt search algorithm with poor worst case
> > behavior.
> > 
> > > Then we get another 64 inodes starting at the newino cursor we can
> > > allocate from while we wait for the current RCU grace period to
> > > expire for inodes already in the reclaimable state. An algorithm
> > > like this will allow the free inode pool to resize automatically
> > > based on the unlink frequency of the workload and RCU grace period
> > > latency...
> > > 
> > > > IOW, background reclaim only kicks in after 30s by default,
> > > 
> > > 5 seconds, by default, not 30s.
> > > 
> > 
> > xfs_reclaim_work_queue() keys off xfs_syncd_centisecs, which corresponds
> > to xfs_params.syncd_timer, which is initialized as:
> > 
> >         .syncd_timer    = {     1*100,          30*100,         7200*100},
> > 
> > Am I missing something?
> 
> static void
> xfs_reclaim_work_queue(
>         struct xfs_mount        *mp)
> {
> 
>         rcu_read_lock();
>         if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_RECLAIM_TAG)) {
>                 queue_delayed_work(mp->m_reclaim_workqueue, &mp->m_reclaim_work,
>                         msecs_to_jiffies(xfs_syncd_centisecs / 6 * 10));
>         }
>         rcu_read_unlock();
> }
> 

Ah, thanks.

> ....
> 
> > > > > Really, though, we need the inactivation stuff to be done as part of
> > > > > the VFS inode lifecycle. I have some ideas on what to do here, but I
> > > > > suspect we'll need some changes to iput_final()/evict() to allow us
> > > > > to process final unlinks in the bakground and then call evict()
> > > > > ourselves when the unlink completes. That way ->destroy_inode() can
> > > > > just call xfs_reclaim_inode() to free it directly, which also helps
> > > > > us get rid of background inode freeing and hence inode recycling
> > > > > from XFS altogether. I think we _might_ be able to do this without
> > > > > needing to change any of the logging code in XFS, but I haven't
> > > > > looked any further than this into it as yet.
> > > > > 
> > > > 
> > > > ... of whatever this ends up looking like.
> > > > 
> > > > Can you elaborate on what you mean by processing unlinks in the
> > > > background? I can see the value of being able to eliminate the recycle
> > > > code in XFS, but wouldn't we still have to limit and throttle against
> > > > background work to maintain sustained removal performance?
> > > 
> > > Yes, but that's irrelevant because all we would be doing is slightly
> > > changing where that throttling occurs (i.e. in
> > > iput_final->drop_inode instead of iput_final->evict->destroy_inode).
> > > 
> > > However, moving the throttling up the stack is a good thing because
> > > it gets rid of the current problem with the inactivation throttling
> > > blocking the shrinker via shrinker->super_cache_scan->
> > > prune_icache_sb->dispose_list->evict-> destroy_inode->throttle on
> > > full inactivation queue because all the inodes need EOF block
> > > trimming to be done.
> > > 
> > 
> > What I'm trying to understand is whether inodes will have cycled through
> > the requisite grace period before ->destroy_inode() or not, and if so,
> 
> The whole point of moving stuff up in the VFS is that inodes
> don't get recycled by XFS at all so we don't even have to think
> about RCU grace periods anywhere inside XFS.
> 
> > how that is done to avoid the sustained removal performance problem
> > we've run into here (caused by the extra latency leading to increasing
> > cacheline misses)..?
> 
> The background work is done _before_ evict() is called by the VFS to
> get the inode freed via RCU callbacks. The perf constraints are
> unchanged, we just change the layer at which the background work is
> performance.
> 

Ok.

> > > > IOW, what's
> > > > the general teardown behavior you're getting at here, aside from what
> > > > parts push into the vfs or not?
> > > 
> > > ->drop_inode() triggers background inactivation for both blockgc and
> > > inode unlink. For unlink, we set I_WILL_FREE so the VFS will not
> > > attempt to re-use it, add the inode # to the internal AG "busy
> > > inode" tree and return drop = true and the VFS then stops processing
> > > that inode. For blockgc, we queue the work and return drop = false
> > > and the VFS puts it onto the LRU. Now we have asynchronous
> > > inactivation while the inode is still present and visible at the VFS
> > > level.
> > > 
> > > For background blockgc - that now happens while the inode is idle on
> > > the LRU before it gets reclaimed by the shrinker. i.e. we trigger
> > > block gc when the last reference to the inode goes away instead of
> > > when it gets removed from memory by the shrinker.
> > > 
> > > For unlink, that now runs in the bacgrkoud until the inode unlink
> > > has been journalled and the cleared inode written to the backing
> > > inode cluster buffer. The inode is then no longer visisble to the
> > > journal and it can't be reallocated because it is still busy. We
> > > then change the inode state from I_WILL_FREE to I_FREEING and call
> > > evict(). The inode then gets torn down, and in ->destroy_inode we
> > > remove the inode from the radix tree, clear the per-ag busy record
> > > and free the inode via RCU as expected by the VFS.
> > > 
> > 
> > Ok, so this sort of sounds like these are separate things. I'm all for
> > creating more flexibility with the VFS to allow XFS to remove or
> > simplify codepaths, but this still depends on some form of grace period
> > tracking to avoid allocation of inodes that are free in the btrees but
> > still might have in-core struct inode's laying around, yes?
> 
> > The reason I'm asking about this is because as this patch to avoid
> > recycling non-expired inodes becomes more complex in order to satisfy
> > performance requirements, longer term usefulness becomes more relevant.
> 
> You say this like I haven't already thought about this....
> 
> > I don't want us to come up with some complex scheme to avoid RCU stalls
> > when there's already a plan to rip it out and replace it in a year or
> > so. OTOH if the resulting logic is part of that longer term strategy,
> > then this is less of a concern.
> 
> .... and so maybe you haven't realised why I keep suggesting
> something along the lines of a busy inode mechanism similar to busy
> extent tracking?
> 
> Essentially, we can't reallocate the inode until the previous use
> has been retired. Which means we'd create the busy inode record in
> xfs_inactive() before we free the inode and xfs_reclaim_inode()
> would remove the inode from the busy tree when it reclaims the inode
> and removes it from the radix tree after marking it dead for RCU
> lookup purposes. That would prevent reallocation of the inode until
> we can allocate a new in-core inode structure for the inode.
> 
> In the lifted VFS case I describe, ->drop_inode() would result in
> background inactivation inserting the inode into the busy tree. Once
> that is all done and we call evict() on the inode, ->destroy_inode
> calls xfs-reclaim_inode() directly. IOWs, the busy inode mechanism
> works for both existing and future inactivation mechanisms.
> 

This is what I was trying to understand. The discussion to this point
around eventually moving lifecycle bits into the VFS gave the impression
that the grace period sequence would essentially be hidden from XFS, so
that's why I've been asking how we expect to accomplish that. ISTM
that's not necessarily the case... the notion of a free (on disk) inode
that cannot be used due to a pending grace period still exists, it's
just abstracted as a "busy inode" and used to implement a rule that such
inodes cannot be reallocated until the VFS indicates so. At that point
we reclaim the struct inode so this presumably eliminates the need for
the recycling logic and perhaps various other lifecycle related bits
(that I've not thought through) in XFS, providing further simplification
opportunities, etc.

If I'm following the general idea correctly, this makes more sense to
me. Thanks.

Brian

> Now, lets take a step further back from this, and consider the
> current inode cache implementation.  The fast and dirty method for
> tracking busy inodes is to use the fact that a busy inode is defined
> as being in the finobt whilst the in-core inode is in an
> IRECLAIMABLE state.
> 
> Hence, at least initially, we don't need a separate tree to
> determine if an inode is "busy" efficiently. The allocation policy
> that selects the inode to allocate doesn't care what mechanism we
> use to determine if an inode is busy - it's just concerned with
> finding a non-busy inode efficiently. Hence we can use a simple
> "best, first, last" hueristic to determine if the finobt is likely
> to be largely made up of busy inodes and decide to allocate new
> inode chunks instead of searching the finobt for an unbusy inode.
> 
> IOWs, the "busy extent tracking" implementation will need to change
> to be something more explicit as we move inactivation up in the VFS
> because the IRCELAIMABLE state goes away, but that doesn't change
> the allocation algorithm or heuristics that are based on detecting
> busy inodes at allocation time.
> 
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-27 19:01                 ` Brian Foster
  2022-01-27 22:18                   ` Dave Chinner
@ 2022-01-28 21:39                   ` Paul E. McKenney
  2022-01-31 13:22                     ` Brian Foster
  1 sibling, 1 reply; 36+ messages in thread
From: Paul E. McKenney @ 2022-01-28 21:39 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, Al Viro, linux-xfs, Ian Kent, rcu

On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote:
> On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote:
> > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote:
> > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote:
> > > 
> > > > Right, background inactivation does not improve performance - it's
> > > > necessary to get the transactions out of the evict() path. All we
> > > > wanted was to ensure that there were no performance degradations as
> > > > a result of background inactivation, not that it was faster.
> > > > 
> > > > If you want to confirm that there is an increase in cold cache
> > > > access when the batch size is increased, cpu profiles with 'perf
> > > > top'/'perf record/report' and CPU cache performance metric reporting
> > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread
> > > > where I mention those things to Paul.
> > > 
> > > Dave, do you see a plausible way to eventually drop Ian's bandaid?
> > > I'm not asking for that to happen this cycle and for backports Ian's
> > > patch is obviously fine.
> > 
> > Yes, but not in the near term.
> > 
> > > What I really want to avoid is the situation when we are stuck with
> > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing
> > > reused inodes would hurt XFS too badly.  And the benchmarks in this
> > > thread do look like that.
> > 
> > The simplest way I think is to have the XFS inode allocation track
> > "busy inodes" in the same way we track "busy extents". A busy extent
> > is an extent that has been freed by the user, but is not yet marked
> > free in the journal/on disk. If we try to reallocate that busy
> > extent, we either select a different free extent to allocate, or if
> > we can't find any we force the journal to disk, wait for it to
> > complete (hence unbusying the extents) and retry the allocation
> > again.
> > 
> > We can do something similar for inode allocation - it's actually a
> > lockless tag lookup on the radix tree entry for the candidate inode
> > number. If we find the reclaimable radix tree tag set, the we select
> > a different inode. If we can't allocate a new inode, then we kick
> > synchronize_rcu() and retry the allocation, allowing inodes to be
> > recycled this time.
> 
> I'm starting to poke around this area since it's become clear that the
> currently proposed scheme just involves too much latency (unless Paul
> chimes in with his expedited grace period variant, at which point I will
> revisit) in the fast allocation/recycle path. ISTM so far that a simple
> "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will
> have pretty much the same pattern of behavior as this patch: one
> synchronize_rcu() per batch.

Apologies for being slow, but there have been some distractions.
One of the distractions was trying to put together atheoretically
attractive but massively overcomplicated implementation of
poll_state_synchronize_rcu_expedited().  It currently looks like a
somewhat suboptimal but much simpler approach is available.  This
assumes that XFS is not in the picture until after both the scheduler
and workqueues are operational.

And yes, the complicated version might prove necessary, but let's
see if this whole thing is even useful first.  ;-)

In the meantime, if you want to look at an extremely unbaked view,
here you go:

https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing

							Thanx, Paul

> IOW, background reclaim only kicks in after 30s by default, so the pool
> of free inodes pretty much always consists of 100% reclaimable inodes.
> On top of that, at smaller batch sizes, the pool tends to have a uniform
> (!elapsed) grace period cookie, so a stall is required to be able to
> allocate any of them. As the batch size increases, I do see the
> population of free inodes start to contain a mix of expired and
> non-expired grace period cookies. It's fairly easy to hack up an
> internal icwalk scan to locate already expired inodes, but the problem
> is that the recycle rate is so much faster than the grace period latency
> that it doesn't really matter. We'll still have to stall by the time we
> get to the non-expired inodes, and so we're back to one stall per batch
> and the same general performance characteristic of this patch.
> 
> So given all of this, I'm wondering about something like the following
> high level inode allocation algorithm:
> 
> 1. If the AG has any reclaimable inodes, scan for one with an expired
> grace period. If found, target that inode for physical allocation.
> 
> 2. If the AG free inode count == the AG reclaimable count and we know
> all reclaimable inodes are most likely pending a grace period (because
> the previous step failed), allocate a new inode chunk (and target it in
> this allocation).
> 
> 3. If the AG free inode count > the reclaimable count, scan the finobt
> for an inode that is not present in the radix tree (i.e. Dave's logic
> above).
> 
> Each of those steps could involve some heuristics to maintain
> predictable behavior and avoid large scans and such, but the general
> idea is that the repeated alloc/free inode workload naturally populates
> the AG with enough physical inodes to always be able to satisfy an
> allocation without waiting on a grace period. IOW, this is effectively
> similar behavior to if physical inode freeing was delayed to an rcu
> callback, with the tradeoff of complicating the allocation path rather
> than stalling in the inactivation pipeline. Thoughts?
> 
> This of course is more involved than this patch (or similarly simple
> variants of RCU delaying preexisting bits of code) and requires some
> more investigation, but certainly shouldn't be a multi-year thing. The
> question is probably more of whether it's enough complexity to justify
> in the meantime...
> 
> > > Are there any realistic prospects of having xfs_iget() deal with
> > > reuse case by allocating new in-core inode and flipping whatever
> > > references you've got in XFS journalling data structures to the
> > > new copy?  If I understood what you said on IRC correctly, that is...
> > 
> > That's ... much harder.
> > 
> > One of the problems is that once an inode has a log item attached to
> > it, it assumes that it can be accessed without specific locking,
> > etc. see xfs_inode_clean(), for example. So there's some life-cycle
> > stuff that needs to be taken care of in XFS first, and the inode <->
> > log item relationship is tangled.
> > 
> > I've been working towards removing that tangle - but taht stuff is
> > quite a distance down my logging rework patch queue. THat queue has
> > been stuck now for a year trying to get the first handful of rework
> > and scalability modifications reviewed and merged, so I'm not
> > holding my breathe as to how long a more substantial rework of
> > internal logging code will take to review and merge.
> > 
> > Really, though, we need the inactivation stuff to be done as part of
> > the VFS inode lifecycle. I have some ideas on what to do here, but I
> > suspect we'll need some changes to iput_final()/evict() to allow us
> > to process final unlinks in the bakground and then call evict()
> > ourselves when the unlink completes. That way ->destroy_inode() can
> > just call xfs_reclaim_inode() to free it directly, which also helps
> > us get rid of background inode freeing and hence inode recycling
> > from XFS altogether. I think we _might_ be able to do this without
> > needing to change any of the logging code in XFS, but I haven't
> > looked any further than this into it as yet.
> > 
> 
> ... of whatever this ends up looking like.
> 
> Can you elaborate on what you mean by processing unlinks in the
> background? I can see the value of being able to eliminate the recycle
> code in XFS, but wouldn't we still have to limit and throttle against
> background work to maintain sustained removal performance? IOW, what's
> the general teardown behavior you're getting at here, aside from what
> parts push into the vfs or not?
> 
> Brian
> 
> > > Again, I'm not asking if it can be done this cycle; having a
> > > realistic path to doing that eventually would be fine by me.
> > 
> > We're talking a year at least, probably two, before we get there...
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-28 21:39                   ` Paul E. McKenney
@ 2022-01-31 13:22                     ` Brian Foster
  2022-02-01 22:00                       ` Paul E. McKenney
  0 siblings, 1 reply; 36+ messages in thread
From: Brian Foster @ 2022-01-31 13:22 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Dave Chinner, Al Viro, linux-xfs, Ian Kent, rcu

On Fri, Jan 28, 2022 at 01:39:11PM -0800, Paul E. McKenney wrote:
> On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote:
> > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote:
> > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote:
> > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote:
> > > > 
> > > > > Right, background inactivation does not improve performance - it's
> > > > > necessary to get the transactions out of the evict() path. All we
> > > > > wanted was to ensure that there were no performance degradations as
> > > > > a result of background inactivation, not that it was faster.
> > > > > 
> > > > > If you want to confirm that there is an increase in cold cache
> > > > > access when the batch size is increased, cpu profiles with 'perf
> > > > > top'/'perf record/report' and CPU cache performance metric reporting
> > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread
> > > > > where I mention those things to Paul.
> > > > 
> > > > Dave, do you see a plausible way to eventually drop Ian's bandaid?
> > > > I'm not asking for that to happen this cycle and for backports Ian's
> > > > patch is obviously fine.
> > > 
> > > Yes, but not in the near term.
> > > 
> > > > What I really want to avoid is the situation when we are stuck with
> > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing
> > > > reused inodes would hurt XFS too badly.  And the benchmarks in this
> > > > thread do look like that.
> > > 
> > > The simplest way I think is to have the XFS inode allocation track
> > > "busy inodes" in the same way we track "busy extents". A busy extent
> > > is an extent that has been freed by the user, but is not yet marked
> > > free in the journal/on disk. If we try to reallocate that busy
> > > extent, we either select a different free extent to allocate, or if
> > > we can't find any we force the journal to disk, wait for it to
> > > complete (hence unbusying the extents) and retry the allocation
> > > again.
> > > 
> > > We can do something similar for inode allocation - it's actually a
> > > lockless tag lookup on the radix tree entry for the candidate inode
> > > number. If we find the reclaimable radix tree tag set, the we select
> > > a different inode. If we can't allocate a new inode, then we kick
> > > synchronize_rcu() and retry the allocation, allowing inodes to be
> > > recycled this time.
> > 
> > I'm starting to poke around this area since it's become clear that the
> > currently proposed scheme just involves too much latency (unless Paul
> > chimes in with his expedited grace period variant, at which point I will
> > revisit) in the fast allocation/recycle path. ISTM so far that a simple
> > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will
> > have pretty much the same pattern of behavior as this patch: one
> > synchronize_rcu() per batch.
> 
> Apologies for being slow, but there have been some distractions.
> One of the distractions was trying to put together atheoretically
> attractive but massively overcomplicated implementation of
> poll_state_synchronize_rcu_expedited().  It currently looks like a
> somewhat suboptimal but much simpler approach is available.  This
> assumes that XFS is not in the picture until after both the scheduler
> and workqueues are operational.
> 

No worries.. I don't think that would be a roadblock for us. ;)

> And yes, the complicated version might prove necessary, but let's
> see if this whole thing is even useful first.  ;-)
> 

Indeed. This patch only really requires a single poll/sync pair of
calls, so assuming the expedited grace period usage plays nice enough
with typical !expedited usage elsewhere in the kernel for some basic
tests, it would be fairly trivial to port this over and at least get an
idea of what the worst case behavior might be with expedited grace
periods, whether it satisfies the existing latency requirements, etc.

Brian

> In the meantime, if you want to look at an extremely unbaked view,
> here you go:
> 
> https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
> 
> 							Thanx, Paul
> 
> > IOW, background reclaim only kicks in after 30s by default, so the pool
> > of free inodes pretty much always consists of 100% reclaimable inodes.
> > On top of that, at smaller batch sizes, the pool tends to have a uniform
> > (!elapsed) grace period cookie, so a stall is required to be able to
> > allocate any of them. As the batch size increases, I do see the
> > population of free inodes start to contain a mix of expired and
> > non-expired grace period cookies. It's fairly easy to hack up an
> > internal icwalk scan to locate already expired inodes, but the problem
> > is that the recycle rate is so much faster than the grace period latency
> > that it doesn't really matter. We'll still have to stall by the time we
> > get to the non-expired inodes, and so we're back to one stall per batch
> > and the same general performance characteristic of this patch.
> > 
> > So given all of this, I'm wondering about something like the following
> > high level inode allocation algorithm:
> > 
> > 1. If the AG has any reclaimable inodes, scan for one with an expired
> > grace period. If found, target that inode for physical allocation.
> > 
> > 2. If the AG free inode count == the AG reclaimable count and we know
> > all reclaimable inodes are most likely pending a grace period (because
> > the previous step failed), allocate a new inode chunk (and target it in
> > this allocation).
> > 
> > 3. If the AG free inode count > the reclaimable count, scan the finobt
> > for an inode that is not present in the radix tree (i.e. Dave's logic
> > above).
> > 
> > Each of those steps could involve some heuristics to maintain
> > predictable behavior and avoid large scans and such, but the general
> > idea is that the repeated alloc/free inode workload naturally populates
> > the AG with enough physical inodes to always be able to satisfy an
> > allocation without waiting on a grace period. IOW, this is effectively
> > similar behavior to if physical inode freeing was delayed to an rcu
> > callback, with the tradeoff of complicating the allocation path rather
> > than stalling in the inactivation pipeline. Thoughts?
> > 
> > This of course is more involved than this patch (or similarly simple
> > variants of RCU delaying preexisting bits of code) and requires some
> > more investigation, but certainly shouldn't be a multi-year thing. The
> > question is probably more of whether it's enough complexity to justify
> > in the meantime...
> > 
> > > > Are there any realistic prospects of having xfs_iget() deal with
> > > > reuse case by allocating new in-core inode and flipping whatever
> > > > references you've got in XFS journalling data structures to the
> > > > new copy?  If I understood what you said on IRC correctly, that is...
> > > 
> > > That's ... much harder.
> > > 
> > > One of the problems is that once an inode has a log item attached to
> > > it, it assumes that it can be accessed without specific locking,
> > > etc. see xfs_inode_clean(), for example. So there's some life-cycle
> > > stuff that needs to be taken care of in XFS first, and the inode <->
> > > log item relationship is tangled.
> > > 
> > > I've been working towards removing that tangle - but taht stuff is
> > > quite a distance down my logging rework patch queue. THat queue has
> > > been stuck now for a year trying to get the first handful of rework
> > > and scalability modifications reviewed and merged, so I'm not
> > > holding my breathe as to how long a more substantial rework of
> > > internal logging code will take to review and merge.
> > > 
> > > Really, though, we need the inactivation stuff to be done as part of
> > > the VFS inode lifecycle. I have some ideas on what to do here, but I
> > > suspect we'll need some changes to iput_final()/evict() to allow us
> > > to process final unlinks in the bakground and then call evict()
> > > ourselves when the unlink completes. That way ->destroy_inode() can
> > > just call xfs_reclaim_inode() to free it directly, which also helps
> > > us get rid of background inode freeing and hence inode recycling
> > > from XFS altogether. I think we _might_ be able to do this without
> > > needing to change any of the logging code in XFS, but I haven't
> > > looked any further than this into it as yet.
> > > 
> > 
> > ... of whatever this ends up looking like.
> > 
> > Can you elaborate on what you mean by processing unlinks in the
> > background? I can see the value of being able to eliminate the recycle
> > code in XFS, but wouldn't we still have to limit and throttle against
> > background work to maintain sustained removal performance? IOW, what's
> > the general teardown behavior you're getting at here, aside from what
> > parts push into the vfs or not?
> > 
> > Brian
> > 
> > > > Again, I'm not asking if it can be done this cycle; having a
> > > > realistic path to doing that eventually would be fine by me.
> > > 
> > > We're talking a year at least, probably two, before we get there...
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-01-31 13:22                     ` Brian Foster
@ 2022-02-01 22:00                       ` Paul E. McKenney
  2022-02-03 18:49                         ` Paul E. McKenney
  2022-02-07 13:30                         ` Brian Foster
  0 siblings, 2 replies; 36+ messages in thread
From: Paul E. McKenney @ 2022-02-01 22:00 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, Al Viro, linux-xfs, Ian Kent, rcu

On Mon, Jan 31, 2022 at 08:22:43AM -0500, Brian Foster wrote:
> On Fri, Jan 28, 2022 at 01:39:11PM -0800, Paul E. McKenney wrote:
> > On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote:
> > > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote:
> > > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote:
> > > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote:
> > > > > 
> > > > > > Right, background inactivation does not improve performance - it's
> > > > > > necessary to get the transactions out of the evict() path. All we
> > > > > > wanted was to ensure that there were no performance degradations as
> > > > > > a result of background inactivation, not that it was faster.
> > > > > > 
> > > > > > If you want to confirm that there is an increase in cold cache
> > > > > > access when the batch size is increased, cpu profiles with 'perf
> > > > > > top'/'perf record/report' and CPU cache performance metric reporting
> > > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread
> > > > > > where I mention those things to Paul.
> > > > > 
> > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid?
> > > > > I'm not asking for that to happen this cycle and for backports Ian's
> > > > > patch is obviously fine.
> > > > 
> > > > Yes, but not in the near term.
> > > > 
> > > > > What I really want to avoid is the situation when we are stuck with
> > > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing
> > > > > reused inodes would hurt XFS too badly.  And the benchmarks in this
> > > > > thread do look like that.
> > > > 
> > > > The simplest way I think is to have the XFS inode allocation track
> > > > "busy inodes" in the same way we track "busy extents". A busy extent
> > > > is an extent that has been freed by the user, but is not yet marked
> > > > free in the journal/on disk. If we try to reallocate that busy
> > > > extent, we either select a different free extent to allocate, or if
> > > > we can't find any we force the journal to disk, wait for it to
> > > > complete (hence unbusying the extents) and retry the allocation
> > > > again.
> > > > 
> > > > We can do something similar for inode allocation - it's actually a
> > > > lockless tag lookup on the radix tree entry for the candidate inode
> > > > number. If we find the reclaimable radix tree tag set, the we select
> > > > a different inode. If we can't allocate a new inode, then we kick
> > > > synchronize_rcu() and retry the allocation, allowing inodes to be
> > > > recycled this time.
> > > 
> > > I'm starting to poke around this area since it's become clear that the
> > > currently proposed scheme just involves too much latency (unless Paul
> > > chimes in with his expedited grace period variant, at which point I will
> > > revisit) in the fast allocation/recycle path. ISTM so far that a simple
> > > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will
> > > have pretty much the same pattern of behavior as this patch: one
> > > synchronize_rcu() per batch.
> > 
> > Apologies for being slow, but there have been some distractions.
> > One of the distractions was trying to put together atheoretically
> > attractive but massively overcomplicated implementation of
> > poll_state_synchronize_rcu_expedited().  It currently looks like a
> > somewhat suboptimal but much simpler approach is available.  This
> > assumes that XFS is not in the picture until after both the scheduler
> > and workqueues are operational.
> > 
> 
> No worries.. I don't think that would be a roadblock for us. ;)
> 
> > And yes, the complicated version might prove necessary, but let's
> > see if this whole thing is even useful first.  ;-)
> > 
> 
> Indeed. This patch only really requires a single poll/sync pair of
> calls, so assuming the expedited grace period usage plays nice enough
> with typical !expedited usage elsewhere in the kernel for some basic
> tests, it would be fairly trivial to port this over and at least get an
> idea of what the worst case behavior might be with expedited grace
> periods, whether it satisfies the existing latency requirements, etc.
> 
> Brian
> 
> > In the meantime, if you want to look at an extremely unbaked view,
> > here you go:
> > 
> > https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing

And here is a version that passes moderate rcutorture testing.  So no
obvious bugs.  Probably a few non-obvious ones, though!  ;-)

This commit is on -rcu's "dev" branch along with this rcutorture
addition:

cd7bd64af59f ("EXP rcutorture: Test polled expedited grace-period primitives")

I will carry these in -rcu's "dev" branch until at least the upcoming
merge window, fixing bugs as and when they becom apparent.  If I don't
hear otherwise by that time, I will create a tag for it and leave
it behind.

The backport to v5.17-rc2 just requires removing:

	mutex_init(&rnp->boost_kthread_mutex);

From rcu_init_one().  This line is added by this -rcu commit:

02a50b09c31f ("rcu: Add mutex for rcu boost kthread spawning and affinity setting")

Please let me know how it goes!

							Thanx, Paul

------------------------------------------------------------------------

commit dd896a86aebc5b225ceee13fcf1375c7542a5e2d
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Mon Jan 31 16:55:52 2022 -0800

    EXP rcu: Add polled expedited grace-period primitives
    
    This is an experimental proof of concept of polled expedited grace-period
    functions.  These functions are get_state_synchronize_rcu_expedited(),
    start_poll_synchronize_rcu_expedited(), poll_state_synchronize_rcu_expedited(),
    and cond_synchronize_rcu_expedited(), which are similar to
    get_state_synchronize_rcu(), start_poll_synchronize_rcu(),
    poll_state_synchronize_rcu(), and cond_synchronize_rcu(), respectively.
    
    One limitation is that start_poll_synchronize_rcu_expedited() cannot
    be invoked before workqueues are initialized.
    
    Cc: Brian Foster <bfoster@redhat.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Ian Kent <raven@themaw.net>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 858f4d429946d..ca139b4b2d25f 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -23,6 +23,26 @@ static inline void cond_synchronize_rcu(unsigned long oldstate)
 	might_sleep();
 }
 
+static inline unsigned long get_state_synchronize_rcu_expedited(void)
+{
+	return get_state_synchronize_rcu();
+}
+
+static inline unsigned long start_poll_synchronize_rcu_expedited(void)
+{
+	return start_poll_synchronize_rcu();
+}
+
+static inline bool poll_state_synchronize_rcu_expedited(unsigned long oldstate)
+{
+	return poll_state_synchronize_rcu(oldstate);
+}
+
+static inline void cond_synchronize_rcu_expedited(unsigned long oldstate)
+{
+	cond_synchronize_rcu(oldstate);
+}
+
 extern void rcu_barrier(void);
 
 static inline void synchronize_rcu_expedited(void)
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 76665db179fa1..eb774e9be21bf 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -40,6 +40,10 @@ bool rcu_eqs_special_set(int cpu);
 void rcu_momentary_dyntick_idle(void);
 void kfree_rcu_scheduler_running(void);
 bool rcu_gp_might_be_stalled(void);
+unsigned long get_state_synchronize_rcu_expedited(void);
+unsigned long start_poll_synchronize_rcu_expedited(void);
+bool poll_state_synchronize_rcu_expedited(unsigned long oldstate);
+void cond_synchronize_rcu_expedited(unsigned long oldstate);
 unsigned long get_state_synchronize_rcu(void);
 unsigned long start_poll_synchronize_rcu(void);
 bool poll_state_synchronize_rcu(unsigned long oldstate);
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 24b5f2c2de87b..5b61cf20c91e9 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -23,6 +23,13 @@
 #define RCU_SEQ_CTR_SHIFT	2
 #define RCU_SEQ_STATE_MASK	((1 << RCU_SEQ_CTR_SHIFT) - 1)
 
+/*
+ * Low-order bit definitions for polled grace-period APIs.
+ */
+#define RCU_GET_STATE_FROM_EXPEDITED	0x1
+#define RCU_GET_STATE_USE_NORMAL	0x2
+#define RCU_GET_STATE_BAD_FOR_NORMAL	(RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL)
+
 /*
  * Return the counter portion of a sequence number previously returned
  * by rcu_seq_snap() or rcu_seq_current().
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e6ad532cffe78..5de36abcd7da1 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3871,7 +3871,8 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu);
  */
 bool poll_state_synchronize_rcu(unsigned long oldstate)
 {
-	if (rcu_seq_done(&rcu_state.gp_seq, oldstate)) {
+	if (rcu_seq_done(&rcu_state.gp_seq, oldstate) &&
+	    !WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL)) {
 		smp_mb(); /* Ensure GP ends before subsequent accesses. */
 		return true;
 	}
@@ -3900,7 +3901,8 @@ EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu);
  */
 void cond_synchronize_rcu(unsigned long oldstate)
 {
-	if (!poll_state_synchronize_rcu(oldstate))
+	if (!poll_state_synchronize_rcu(oldstate) &&
+	    !WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL))
 		synchronize_rcu();
 }
 EXPORT_SYMBOL_GPL(cond_synchronize_rcu);
@@ -4593,6 +4595,9 @@ static void __init rcu_init_one(void)
 			init_waitqueue_head(&rnp->exp_wq[3]);
 			spin_lock_init(&rnp->exp_lock);
 			mutex_init(&rnp->boost_kthread_mutex);
+			raw_spin_lock_init(&rnp->exp_poll_lock);
+			rnp->exp_seq_poll_rq = 0x1;
+			INIT_WORK(&rnp->exp_poll_wq, sync_rcu_do_polled_gp);
 		}
 	}
 
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 926673ebe355f..19fc9acce3ce2 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -128,6 +128,10 @@ struct rcu_node {
 	wait_queue_head_t exp_wq[4];
 	struct rcu_exp_work rew;
 	bool exp_need_flush;	/* Need to flush workitem? */
+	raw_spinlock_t exp_poll_lock;
+				/* Lock and data for polled expedited grace periods. */
+	unsigned long exp_seq_poll_rq;
+	struct work_struct exp_poll_wq;
 } ____cacheline_internodealigned_in_smp;
 
 /*
@@ -476,3 +480,6 @@ static void rcu_iw_handler(struct irq_work *iwp);
 static void check_cpu_stall(struct rcu_data *rdp);
 static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
 				     const unsigned long gpssdelay);
+
+/* Forward declarations for tree_exp.h. */
+static void sync_rcu_do_polled_gp(struct work_struct *wp);
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 1a45667402260..728896f374fee 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -871,3 +871,154 @@ void synchronize_rcu_expedited(void)
 		destroy_work_on_stack(&rew.rew_work);
 }
 EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
+
+/**
+ * get_state_synchronize_rcu_expedited - Snapshot current expedited RCU state
+ *
+ * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited()
+ * or poll_state_synchronize_rcu_expedited(), allowing them to determine
+ * whether or not a full expedited grace period has elapsed in the meantime.
+ */
+unsigned long get_state_synchronize_rcu_expedited(void)
+{
+	if (rcu_gp_is_normal())
+	return get_state_synchronize_rcu() |
+	       RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL;
+
+	// Any prior manipulation of RCU-protected data must happen
+	// before the load from ->expedited_sequence.
+	smp_mb();  /* ^^^ */
+	return rcu_exp_gp_seq_snap() | RCU_GET_STATE_FROM_EXPEDITED;
+}
+EXPORT_SYMBOL_GPL(get_state_synchronize_rcu_expedited);
+
+/*
+ * Ensure that start_poll_synchronize_rcu_expedited() has the expedited
+ * RCU grace periods that it needs.
+ */
+static void sync_rcu_do_polled_gp(struct work_struct *wp)
+{
+	unsigned long flags;
+	struct rcu_node *rnp = container_of(wp, struct rcu_node, exp_poll_wq);
+	unsigned long s;
+
+	raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
+	s = rnp->exp_seq_poll_rq;
+	rnp->exp_seq_poll_rq |= 0x1;
+	raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
+	if (s & 0x1)
+		return;
+	while (!sync_exp_work_done(s))
+		synchronize_rcu_expedited();
+	raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
+	s = rnp->exp_seq_poll_rq;
+	if (!(s & 0x1) && !sync_exp_work_done(s))
+		queue_work(rcu_gp_wq, &rnp->exp_poll_wq);
+	else
+		rnp->exp_seq_poll_rq |= 0x1;
+	raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
+}
+
+/**
+ * start_poll_synchronize_rcu_expedited - Snapshot current expedited RCU state and start grace period
+ *
+ * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited()
+ * or poll_state_synchronize_rcu_expedited(), allowing them to determine
+ * whether or not a full expedited grace period has elapsed in the meantime.
+ * If the needed grace period is not already slated to start, initiates
+ * that grace period.
+ */
+
+unsigned long start_poll_synchronize_rcu_expedited(void)
+{
+	unsigned long flags;
+	struct rcu_data *rdp;
+	struct rcu_node *rnp;
+	unsigned long s;
+
+	if (rcu_gp_is_normal())
+		return start_poll_synchronize_rcu_expedited() |
+		       RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL;
+
+	s = rcu_exp_gp_seq_snap();
+	rdp = per_cpu_ptr(&rcu_data, raw_smp_processor_id());
+	rnp = rdp->mynode;
+	raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
+	if ((rnp->exp_seq_poll_rq & 0x1) || ULONG_CMP_LT(rnp->exp_seq_poll_rq, s)) {
+		rnp->exp_seq_poll_rq = s;
+		queue_work(rcu_gp_wq, &rnp->exp_poll_wq);
+	}
+	raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
+
+	return s | RCU_GET_STATE_FROM_EXPEDITED;
+}
+EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu_expedited);
+
+/**
+ * poll_state_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period
+ *
+ * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited()
+ *
+ * If a full expedited RCU grace period has elapsed since the earlier call
+ * from which oldstate was obtained, return @true, otherwise return @false.
+ * If @false is returned, it is the caller's responsibility to invoke
+ * this function later on until it does return @true.  Alternatively,
+ * the caller can explicitly wait for a grace period, for example, by
+ * passing @oldstate to cond_synchronize_rcu_expedited() or by directly
+ * invoking synchronize_rcu_expedited().
+ *
+ * Yes, this function does not take counter wrap into account.
+ * But counter wrap is harmless.  If the counter wraps, we have waited for
+ * more than 2 billion grace periods (and way more on a 64-bit system!).
+ * Those needing to keep oldstate values for very long time periods
+ * (several hours even on 32-bit systems) should check them occasionally
+ * and either refresh them or set a flag indicating that the grace period
+ * has completed.
+ *
+ * This function provides the same memory-ordering guarantees that would
+ * be provided by a synchronize_rcu_expedited() that was invoked at the
+ * call to the function that provided @oldstate, and that returned at the
+ * end of this function.
+ */
+bool poll_state_synchronize_rcu_expedited(unsigned long oldstate)
+{
+	WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED));
+	if (oldstate & RCU_GET_STATE_USE_NORMAL)
+		return poll_state_synchronize_rcu(oldstate & ~RCU_GET_STATE_BAD_FOR_NORMAL);
+	if (!rcu_exp_gp_seq_done(oldstate & ~RCU_SEQ_STATE_MASK))
+		return false;
+	smp_mb(); /* Ensure GP ends before subsequent accesses. */
+	return true;
+}
+EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu_expedited);
+
+/**
+ * cond_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period
+ *
+ * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited()
+ *
+ * If a full expedited RCU grace period has elapsed since the earlier
+ * call from which oldstate was obtained, just return.  Otherwise, invoke
+ * synchronize_rcu_expedited() to wait for a full grace period.
+ *
+ * Yes, this function does not take counter wrap into account.  But
+ * counter wrap is harmless.  If the counter wraps, we have waited for
+ * more than 2 billion grace periods (and way more on a 64-bit system!),
+ * so waiting for one additional grace period should be just fine.
+ *
+ * This function provides the same memory-ordering guarantees that would
+ * be provided by a synchronize_rcu_expedited() that was invoked at the
+ * call to the function that provided @oldstate, and that returned at the
+ * end of this function.
+ */
+void cond_synchronize_rcu_expedited(unsigned long oldstate)
+{
+	WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED));
+	if (poll_state_synchronize_rcu_expedited(oldstate))
+		return;
+	if (oldstate & RCU_GET_STATE_USE_NORMAL)
+		synchronize_rcu_expedited();
+	else
+		synchronize_rcu();
+}
+EXPORT_SYMBOL_GPL(cond_synchronize_rcu_expedited);

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-02-01 22:00                       ` Paul E. McKenney
@ 2022-02-03 18:49                         ` Paul E. McKenney
  2022-02-07 13:30                         ` Brian Foster
  1 sibling, 0 replies; 36+ messages in thread
From: Paul E. McKenney @ 2022-02-03 18:49 UTC (permalink / raw)
  To: Brian Foster
  Cc: Dave Chinner, Al Viro, linux-xfs, Ian Kent, rcu, quic_neeraju

On Tue, Feb 01, 2022 at 02:00:28PM -0800, Paul E. McKenney wrote:
> On Mon, Jan 31, 2022 at 08:22:43AM -0500, Brian Foster wrote:
> > On Fri, Jan 28, 2022 at 01:39:11PM -0800, Paul E. McKenney wrote:
> > > On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote:
> > > > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote:
> > > > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote:
> > > > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote:
> > > > > > 
> > > > > > > Right, background inactivation does not improve performance - it's
> > > > > > > necessary to get the transactions out of the evict() path. All we
> > > > > > > wanted was to ensure that there were no performance degradations as
> > > > > > > a result of background inactivation, not that it was faster.
> > > > > > > 
> > > > > > > If you want to confirm that there is an increase in cold cache
> > > > > > > access when the batch size is increased, cpu profiles with 'perf
> > > > > > > top'/'perf record/report' and CPU cache performance metric reporting
> > > > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread
> > > > > > > where I mention those things to Paul.
> > > > > > 
> > > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid?
> > > > > > I'm not asking for that to happen this cycle and for backports Ian's
> > > > > > patch is obviously fine.
> > > > > 
> > > > > Yes, but not in the near term.
> > > > > 
> > > > > > What I really want to avoid is the situation when we are stuck with
> > > > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing
> > > > > > reused inodes would hurt XFS too badly.  And the benchmarks in this
> > > > > > thread do look like that.
> > > > > 
> > > > > The simplest way I think is to have the XFS inode allocation track
> > > > > "busy inodes" in the same way we track "busy extents". A busy extent
> > > > > is an extent that has been freed by the user, but is not yet marked
> > > > > free in the journal/on disk. If we try to reallocate that busy
> > > > > extent, we either select a different free extent to allocate, or if
> > > > > we can't find any we force the journal to disk, wait for it to
> > > > > complete (hence unbusying the extents) and retry the allocation
> > > > > again.
> > > > > 
> > > > > We can do something similar for inode allocation - it's actually a
> > > > > lockless tag lookup on the radix tree entry for the candidate inode
> > > > > number. If we find the reclaimable radix tree tag set, the we select
> > > > > a different inode. If we can't allocate a new inode, then we kick
> > > > > synchronize_rcu() and retry the allocation, allowing inodes to be
> > > > > recycled this time.
> > > > 
> > > > I'm starting to poke around this area since it's become clear that the
> > > > currently proposed scheme just involves too much latency (unless Paul
> > > > chimes in with his expedited grace period variant, at which point I will
> > > > revisit) in the fast allocation/recycle path. ISTM so far that a simple
> > > > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will
> > > > have pretty much the same pattern of behavior as this patch: one
> > > > synchronize_rcu() per batch.
> > > 
> > > Apologies for being slow, but there have been some distractions.
> > > One of the distractions was trying to put together atheoretically
> > > attractive but massively overcomplicated implementation of
> > > poll_state_synchronize_rcu_expedited().  It currently looks like a
> > > somewhat suboptimal but much simpler approach is available.  This
> > > assumes that XFS is not in the picture until after both the scheduler
> > > and workqueues are operational.
> > > 
> > 
> > No worries.. I don't think that would be a roadblock for us. ;)
> > 
> > > And yes, the complicated version might prove necessary, but let's
> > > see if this whole thing is even useful first.  ;-)
> > > 
> > 
> > Indeed. This patch only really requires a single poll/sync pair of
> > calls, so assuming the expedited grace period usage plays nice enough
> > with typical !expedited usage elsewhere in the kernel for some basic
> > tests, it would be fairly trivial to port this over and at least get an
> > idea of what the worst case behavior might be with expedited grace
> > periods, whether it satisfies the existing latency requirements, etc.
> > 
> > Brian
> > 
> > > In the meantime, if you want to look at an extremely unbaked view,
> > > here you go:
> > > 
> > > https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
> 
> And here is a version that passes moderate rcutorture testing.  So no
> obvious bugs.  Probably a few non-obvious ones, though!  ;-)
> 
> This commit is on -rcu's "dev" branch along with this rcutorture
> addition:
> 
> cd7bd64af59f ("EXP rcutorture: Test polled expedited grace-period primitives")
> 
> I will carry these in -rcu's "dev" branch until at least the upcoming
> merge window, fixing bugs as and when they becom apparent.  If I don't
> hear otherwise by that time, I will create a tag for it and leave
> it behind.
> 
> The backport to v5.17-rc2 just requires removing:
> 
> 	mutex_init(&rnp->boost_kthread_mutex);
> 
> From rcu_init_one().  This line is added by this -rcu commit:
> 
> 02a50b09c31f ("rcu: Add mutex for rcu boost kthread spawning and affinity setting")

And with some alleged fixes of issues Neeraj found when reviewing this,
perhaps most notably the ability to run on real-time kernels booted
with rcupdate.rcu_normal=1.  This version passes reasonably heavy-duty
rcutorture testing.  Must mean bugs in rcutorture...  :-/

f93fa07011bd ("EXP rcu: Add polled expedited grace-period primitives")

Again, please let me know how it goes!

							Thanx, Paul

------------------------------------------------------------------------

commit f93fa07011bd2460f222e570d17968baff21fa90
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Mon Jan 31 16:55:52 2022 -0800

    EXP rcu: Add polled expedited grace-period primitives
    
    This is an experimental proof of concept of polled expedited grace-period
    functions.  These functions are get_state_synchronize_rcu_expedited(),
    start_poll_synchronize_rcu_expedited(), poll_state_synchronize_rcu_expedited(),
    and cond_synchronize_rcu_expedited(), which are similar to
    get_state_synchronize_rcu(), start_poll_synchronize_rcu(),
    poll_state_synchronize_rcu(), and cond_synchronize_rcu(), respectively.
    
    One limitation is that start_poll_synchronize_rcu_expedited() cannot
    be invoked before workqueues are initialized.
    
    Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/
    Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
    Cc: Brian Foster <bfoster@redhat.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Ian Kent <raven@themaw.net>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 858f4d429946d..ca139b4b2d25f 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -23,6 +23,26 @@ static inline void cond_synchronize_rcu(unsigned long oldstate)
 	might_sleep();
 }
 
+static inline unsigned long get_state_synchronize_rcu_expedited(void)
+{
+	return get_state_synchronize_rcu();
+}
+
+static inline unsigned long start_poll_synchronize_rcu_expedited(void)
+{
+	return start_poll_synchronize_rcu();
+}
+
+static inline bool poll_state_synchronize_rcu_expedited(unsigned long oldstate)
+{
+	return poll_state_synchronize_rcu(oldstate);
+}
+
+static inline void cond_synchronize_rcu_expedited(unsigned long oldstate)
+{
+	cond_synchronize_rcu(oldstate);
+}
+
 extern void rcu_barrier(void);
 
 static inline void synchronize_rcu_expedited(void)
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 76665db179fa1..eb774e9be21bf 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -40,6 +40,10 @@ bool rcu_eqs_special_set(int cpu);
 void rcu_momentary_dyntick_idle(void);
 void kfree_rcu_scheduler_running(void);
 bool rcu_gp_might_be_stalled(void);
+unsigned long get_state_synchronize_rcu_expedited(void);
+unsigned long start_poll_synchronize_rcu_expedited(void);
+bool poll_state_synchronize_rcu_expedited(unsigned long oldstate);
+void cond_synchronize_rcu_expedited(unsigned long oldstate);
 unsigned long get_state_synchronize_rcu(void);
 unsigned long start_poll_synchronize_rcu(void);
 bool poll_state_synchronize_rcu(unsigned long oldstate);
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 24b5f2c2de87b..5b61cf20c91e9 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -23,6 +23,13 @@
 #define RCU_SEQ_CTR_SHIFT	2
 #define RCU_SEQ_STATE_MASK	((1 << RCU_SEQ_CTR_SHIFT) - 1)
 
+/*
+ * Low-order bit definitions for polled grace-period APIs.
+ */
+#define RCU_GET_STATE_FROM_EXPEDITED	0x1
+#define RCU_GET_STATE_USE_NORMAL	0x2
+#define RCU_GET_STATE_BAD_FOR_NORMAL	(RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL)
+
 /*
  * Return the counter portion of a sequence number previously returned
  * by rcu_seq_snap() or rcu_seq_current().
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e6ad532cffe78..135d5e2bce879 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3871,7 +3871,8 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu);
  */
 bool poll_state_synchronize_rcu(unsigned long oldstate)
 {
-	if (rcu_seq_done(&rcu_state.gp_seq, oldstate)) {
+	if (rcu_seq_done(&rcu_state.gp_seq, oldstate) &&
+	    !WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL)) {
 		smp_mb(); /* Ensure GP ends before subsequent accesses. */
 		return true;
 	}
@@ -3900,7 +3901,8 @@ EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu);
  */
 void cond_synchronize_rcu(unsigned long oldstate)
 {
-	if (!poll_state_synchronize_rcu(oldstate))
+	if (!poll_state_synchronize_rcu(oldstate) ||
+	    WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL))
 		synchronize_rcu();
 }
 EXPORT_SYMBOL_GPL(cond_synchronize_rcu);
@@ -4593,6 +4595,9 @@ static void __init rcu_init_one(void)
 			init_waitqueue_head(&rnp->exp_wq[3]);
 			spin_lock_init(&rnp->exp_lock);
 			mutex_init(&rnp->boost_kthread_mutex);
+			raw_spin_lock_init(&rnp->exp_poll_lock);
+			rnp->exp_seq_poll_rq = 0x1;
+			INIT_WORK(&rnp->exp_poll_wq, sync_rcu_do_polled_gp);
 		}
 	}
 
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 926673ebe355f..19fc9acce3ce2 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -128,6 +128,10 @@ struct rcu_node {
 	wait_queue_head_t exp_wq[4];
 	struct rcu_exp_work rew;
 	bool exp_need_flush;	/* Need to flush workitem? */
+	raw_spinlock_t exp_poll_lock;
+				/* Lock and data for polled expedited grace periods. */
+	unsigned long exp_seq_poll_rq;
+	struct work_struct exp_poll_wq;
 } ____cacheline_internodealigned_in_smp;
 
 /*
@@ -476,3 +480,6 @@ static void rcu_iw_handler(struct irq_work *iwp);
 static void check_cpu_stall(struct rcu_data *rdp);
 static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
 				     const unsigned long gpssdelay);
+
+/* Forward declarations for tree_exp.h. */
+static void sync_rcu_do_polled_gp(struct work_struct *wp);
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 1a45667402260..4041988086830 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -871,3 +871,154 @@ void synchronize_rcu_expedited(void)
 		destroy_work_on_stack(&rew.rew_work);
 }
 EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
+
+/**
+ * get_state_synchronize_rcu_expedited - Snapshot current expedited RCU state
+ *
+ * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited()
+ * or poll_state_synchronize_rcu_expedited(), allowing them to determine
+ * whether or not a full expedited grace period has elapsed in the meantime.
+ */
+unsigned long get_state_synchronize_rcu_expedited(void)
+{
+	if (rcu_gp_is_normal())
+		return get_state_synchronize_rcu() |
+		       RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL;
+
+	// Any prior manipulation of RCU-protected data must happen
+	// before the load from ->expedited_sequence, and this ordering is
+	// provided by rcu_exp_gp_seq_snap().
+	return rcu_exp_gp_seq_snap() | RCU_GET_STATE_FROM_EXPEDITED;
+}
+EXPORT_SYMBOL_GPL(get_state_synchronize_rcu_expedited);
+
+/*
+ * Ensure that start_poll_synchronize_rcu_expedited() has the expedited
+ * RCU grace periods that it needs.
+ */
+static void sync_rcu_do_polled_gp(struct work_struct *wp)
+{
+	unsigned long flags;
+	struct rcu_node *rnp = container_of(wp, struct rcu_node, exp_poll_wq);
+	unsigned long s;
+
+	raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
+	s = rnp->exp_seq_poll_rq;
+	rnp->exp_seq_poll_rq |= 0x1;
+	raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
+	if (s & 0x1)
+		return;
+	while (!sync_exp_work_done(s))
+		synchronize_rcu_expedited();
+	raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
+	s = rnp->exp_seq_poll_rq;
+	if (!(s & 0x1) && !sync_exp_work_done(s))
+		queue_work(rcu_gp_wq, &rnp->exp_poll_wq);
+	else
+		rnp->exp_seq_poll_rq |= 0x1;
+	raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
+}
+
+/**
+ * start_poll_synchronize_rcu_expedited - Snapshot current expedited RCU state and start grace period
+ *
+ * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited()
+ * or poll_state_synchronize_rcu_expedited(), allowing them to determine
+ * whether or not a full expedited grace period has elapsed in the meantime.
+ * If the needed grace period is not already slated to start, initiates
+ * that grace period.
+ */
+
+unsigned long start_poll_synchronize_rcu_expedited(void)
+{
+	unsigned long flags;
+	struct rcu_data *rdp;
+	struct rcu_node *rnp;
+	unsigned long s;
+
+	if (rcu_gp_is_normal())
+		return start_poll_synchronize_rcu() |
+		       RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL;
+
+	s = rcu_exp_gp_seq_snap();
+	rdp = per_cpu_ptr(&rcu_data, raw_smp_processor_id());
+	rnp = rdp->mynode;
+	raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
+	if ((rnp->exp_seq_poll_rq & 0x1) || ULONG_CMP_LT(rnp->exp_seq_poll_rq, s)) {
+		rnp->exp_seq_poll_rq = s;
+		queue_work(rcu_gp_wq, &rnp->exp_poll_wq);
+	}
+	raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
+
+	return s | RCU_GET_STATE_FROM_EXPEDITED;
+}
+EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu_expedited);
+
+/**
+ * poll_state_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period
+ *
+ * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited()
+ *
+ * If a full expedited RCU grace period has elapsed since the earlier call
+ * from which oldstate was obtained, return @true, otherwise return @false.
+ * If @false is returned, it is the caller's responsibility to invoke
+ * this function later on until it does return @true.  Alternatively,
+ * the caller can explicitly wait for a grace period, for example, by
+ * passing @oldstate to cond_synchronize_rcu_expedited() or by directly
+ * invoking synchronize_rcu_expedited().
+ *
+ * Yes, this function does not take counter wrap into account.
+ * But counter wrap is harmless.  If the counter wraps, we have waited for
+ * more than 2 billion grace periods (and way more on a 64-bit system!).
+ * Those needing to keep oldstate values for very long time periods
+ * (several hours even on 32-bit systems) should check them occasionally
+ * and either refresh them or set a flag indicating that the grace period
+ * has completed.
+ *
+ * This function provides the same memory-ordering guarantees that would
+ * be provided by a synchronize_rcu_expedited() that was invoked at the
+ * call to the function that provided @oldstate, and that returned at the
+ * end of this function.
+ */
+bool poll_state_synchronize_rcu_expedited(unsigned long oldstate)
+{
+	WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED));
+	if (oldstate & RCU_GET_STATE_USE_NORMAL)
+		return poll_state_synchronize_rcu(oldstate & ~RCU_GET_STATE_BAD_FOR_NORMAL);
+	if (!rcu_exp_gp_seq_done(oldstate & ~RCU_SEQ_STATE_MASK))
+		return false;
+	smp_mb(); /* Ensure GP ends before subsequent accesses. */
+	return true;
+}
+EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu_expedited);
+
+/**
+ * cond_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period
+ *
+ * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited()
+ *
+ * If a full expedited RCU grace period has elapsed since the earlier
+ * call from which oldstate was obtained, just return.  Otherwise, invoke
+ * synchronize_rcu_expedited() to wait for a full grace period.
+ *
+ * Yes, this function does not take counter wrap into account.  But
+ * counter wrap is harmless.  If the counter wraps, we have waited for
+ * more than 2 billion grace periods (and way more on a 64-bit system!),
+ * so waiting for one additional grace period should be just fine.
+ *
+ * This function provides the same memory-ordering guarantees that would
+ * be provided by a synchronize_rcu_expedited() that was invoked at the
+ * call to the function that provided @oldstate, and that returned at the
+ * end of this function.
+ */
+void cond_synchronize_rcu_expedited(unsigned long oldstate)
+{
+	WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED));
+	if (poll_state_synchronize_rcu_expedited(oldstate))
+		return;
+	if (oldstate & RCU_GET_STATE_USE_NORMAL)
+		synchronize_rcu();
+	else
+		synchronize_rcu_expedited();
+}
+EXPORT_SYMBOL_GPL(cond_synchronize_rcu_expedited);

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-02-01 22:00                       ` Paul E. McKenney
  2022-02-03 18:49                         ` Paul E. McKenney
@ 2022-02-07 13:30                         ` Brian Foster
  2022-02-07 16:36                           ` Paul E. McKenney
  1 sibling, 1 reply; 36+ messages in thread
From: Brian Foster @ 2022-02-07 13:30 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Dave Chinner, Al Viro, linux-xfs, Ian Kent, rcu

On Tue, Feb 01, 2022 at 02:00:28PM -0800, Paul E. McKenney wrote:
> On Mon, Jan 31, 2022 at 08:22:43AM -0500, Brian Foster wrote:
> > On Fri, Jan 28, 2022 at 01:39:11PM -0800, Paul E. McKenney wrote:
> > > On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote:
> > > > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote:
> > > > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote:
> > > > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote:
> > > > > > 
> > > > > > > Right, background inactivation does not improve performance - it's
> > > > > > > necessary to get the transactions out of the evict() path. All we
> > > > > > > wanted was to ensure that there were no performance degradations as
> > > > > > > a result of background inactivation, not that it was faster.
> > > > > > > 
> > > > > > > If you want to confirm that there is an increase in cold cache
> > > > > > > access when the batch size is increased, cpu profiles with 'perf
> > > > > > > top'/'perf record/report' and CPU cache performance metric reporting
> > > > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread
> > > > > > > where I mention those things to Paul.
> > > > > > 
> > > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid?
> > > > > > I'm not asking for that to happen this cycle and for backports Ian's
> > > > > > patch is obviously fine.
> > > > > 
> > > > > Yes, but not in the near term.
> > > > > 
> > > > > > What I really want to avoid is the situation when we are stuck with
> > > > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing
> > > > > > reused inodes would hurt XFS too badly.  And the benchmarks in this
> > > > > > thread do look like that.
> > > > > 
> > > > > The simplest way I think is to have the XFS inode allocation track
> > > > > "busy inodes" in the same way we track "busy extents". A busy extent
> > > > > is an extent that has been freed by the user, but is not yet marked
> > > > > free in the journal/on disk. If we try to reallocate that busy
> > > > > extent, we either select a different free extent to allocate, or if
> > > > > we can't find any we force the journal to disk, wait for it to
> > > > > complete (hence unbusying the extents) and retry the allocation
> > > > > again.
> > > > > 
> > > > > We can do something similar for inode allocation - it's actually a
> > > > > lockless tag lookup on the radix tree entry for the candidate inode
> > > > > number. If we find the reclaimable radix tree tag set, the we select
> > > > > a different inode. If we can't allocate a new inode, then we kick
> > > > > synchronize_rcu() and retry the allocation, allowing inodes to be
> > > > > recycled this time.
> > > > 
> > > > I'm starting to poke around this area since it's become clear that the
> > > > currently proposed scheme just involves too much latency (unless Paul
> > > > chimes in with his expedited grace period variant, at which point I will
> > > > revisit) in the fast allocation/recycle path. ISTM so far that a simple
> > > > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will
> > > > have pretty much the same pattern of behavior as this patch: one
> > > > synchronize_rcu() per batch.
> > > 
> > > Apologies for being slow, but there have been some distractions.
> > > One of the distractions was trying to put together atheoretically
> > > attractive but massively overcomplicated implementation of
> > > poll_state_synchronize_rcu_expedited().  It currently looks like a
> > > somewhat suboptimal but much simpler approach is available.  This
> > > assumes that XFS is not in the picture until after both the scheduler
> > > and workqueues are operational.
> > > 
> > 
> > No worries.. I don't think that would be a roadblock for us. ;)
> > 
> > > And yes, the complicated version might prove necessary, but let's
> > > see if this whole thing is even useful first.  ;-)
> > > 
> > 
> > Indeed. This patch only really requires a single poll/sync pair of
> > calls, so assuming the expedited grace period usage plays nice enough
> > with typical !expedited usage elsewhere in the kernel for some basic
> > tests, it would be fairly trivial to port this over and at least get an
> > idea of what the worst case behavior might be with expedited grace
> > periods, whether it satisfies the existing latency requirements, etc.
> > 
> > Brian
> > 
> > > In the meantime, if you want to look at an extremely unbaked view,
> > > here you go:
> > > 
> > > https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
> 
> And here is a version that passes moderate rcutorture testing.  So no
> obvious bugs.  Probably a few non-obvious ones, though!  ;-)
> 
> This commit is on -rcu's "dev" branch along with this rcutorture
> addition:
> 
> cd7bd64af59f ("EXP rcutorture: Test polled expedited grace-period primitives")
> 
> I will carry these in -rcu's "dev" branch until at least the upcoming
> merge window, fixing bugs as and when they becom apparent.  If I don't
> hear otherwise by that time, I will create a tag for it and leave
> it behind.
> 
> The backport to v5.17-rc2 just requires removing:
> 
> 	mutex_init(&rnp->boost_kthread_mutex);
> 
> From rcu_init_one().  This line is added by this -rcu commit:
> 
> 02a50b09c31f ("rcu: Add mutex for rcu boost kthread spawning and affinity setting")
> 
> Please let me know how it goes!
> 

Thanks Paul. I gave this a whirl with a ported variant of this patch on
top. There is definitely a notable improvement with the expedited grace
periods. A few quick runs of the same batched alloc/free test (i.e. 10
sample) I had run against the original version:

batch	baseline	baseline+bg	test	test+bg

1	889954		210075		552911	25540
4	879540		212740		575356	24624
8	924928		213568		496992	26080
16	922960		211504		518496	24592
32	844832		219744		524672	28608
64	579968		196544		358720	24128
128	667392		195840		397696	22400
256	624896		197888		376320	31232
512	572928		204800		382464	46080
1024	549888		174080		379904	73728
2048	522240		174080		350208	106496
4096	536576		167936		360448	131072

So this shows a major improvement in the case where the system is
otherwise idle. We still aren't quite at the baseline numbers, but
that's not really the goal here because those numbers are partly driven
by the fact that we unsafely reuse recently freed inodes in cases where
proper behavior would be to allocate new inode chunks for a period of
time. The core test numbers are much closer to the single threaded
allocation rate (55k-65k inodes/sec) on this setup, so that is quite
positive.

The "bg" variants are the same tests with 64 tasks doing unrelated
pathwalk listings on a kernel source tree (on separate storage)
concurrently in the background. The purpose of this was just to generate
background (rcu) activity in the form of pathname lookups and whatnot
and see how that impacts the results. This clearly affects both kernels,
but the test kernel drops down closer to numbers reminiscent of the
non-expedited grace period variant. Note that this impact seems to scale
with increased background workload. With a similar test running only 8
background tasks, the test kernel is pretty consistently in the
225k-250k (per 10s) range across the set of batch sizes. That's about
half the core test rate, so still not as terrible as the original
variant. ;)

In any event, this probably requires some thought/discussion (and more
testing) on whether this is considered an acceptable change or whether
we want to explore options to mitigate this further. I am still playing
with some ideas to potentially mitigate grace period latency, so it
might be worth seeing if anything useful falls out of that as well.
Thoughts appreciated...

Brian

> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit dd896a86aebc5b225ceee13fcf1375c7542a5e2d
> Author: Paul E. McKenney <paulmck@kernel.org>
> Date:   Mon Jan 31 16:55:52 2022 -0800
> 
>     EXP rcu: Add polled expedited grace-period primitives
>     
>     This is an experimental proof of concept of polled expedited grace-period
>     functions.  These functions are get_state_synchronize_rcu_expedited(),
>     start_poll_synchronize_rcu_expedited(), poll_state_synchronize_rcu_expedited(),
>     and cond_synchronize_rcu_expedited(), which are similar to
>     get_state_synchronize_rcu(), start_poll_synchronize_rcu(),
>     poll_state_synchronize_rcu(), and cond_synchronize_rcu(), respectively.
>     
>     One limitation is that start_poll_synchronize_rcu_expedited() cannot
>     be invoked before workqueues are initialized.
>     
>     Cc: Brian Foster <bfoster@redhat.com>
>     Cc: Dave Chinner <david@fromorbit.com>
>     Cc: Al Viro <viro@zeniv.linux.org.uk>
>     Cc: Ian Kent <raven@themaw.net>
>     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> index 858f4d429946d..ca139b4b2d25f 100644
> --- a/include/linux/rcutiny.h
> +++ b/include/linux/rcutiny.h
> @@ -23,6 +23,26 @@ static inline void cond_synchronize_rcu(unsigned long oldstate)
>  	might_sleep();
>  }
>  
> +static inline unsigned long get_state_synchronize_rcu_expedited(void)
> +{
> +	return get_state_synchronize_rcu();
> +}
> +
> +static inline unsigned long start_poll_synchronize_rcu_expedited(void)
> +{
> +	return start_poll_synchronize_rcu();
> +}
> +
> +static inline bool poll_state_synchronize_rcu_expedited(unsigned long oldstate)
> +{
> +	return poll_state_synchronize_rcu(oldstate);
> +}
> +
> +static inline void cond_synchronize_rcu_expedited(unsigned long oldstate)
> +{
> +	cond_synchronize_rcu(oldstate);
> +}
> +
>  extern void rcu_barrier(void);
>  
>  static inline void synchronize_rcu_expedited(void)
> diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> index 76665db179fa1..eb774e9be21bf 100644
> --- a/include/linux/rcutree.h
> +++ b/include/linux/rcutree.h
> @@ -40,6 +40,10 @@ bool rcu_eqs_special_set(int cpu);
>  void rcu_momentary_dyntick_idle(void);
>  void kfree_rcu_scheduler_running(void);
>  bool rcu_gp_might_be_stalled(void);
> +unsigned long get_state_synchronize_rcu_expedited(void);
> +unsigned long start_poll_synchronize_rcu_expedited(void);
> +bool poll_state_synchronize_rcu_expedited(unsigned long oldstate);
> +void cond_synchronize_rcu_expedited(unsigned long oldstate);
>  unsigned long get_state_synchronize_rcu(void);
>  unsigned long start_poll_synchronize_rcu(void);
>  bool poll_state_synchronize_rcu(unsigned long oldstate);
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index 24b5f2c2de87b..5b61cf20c91e9 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -23,6 +23,13 @@
>  #define RCU_SEQ_CTR_SHIFT	2
>  #define RCU_SEQ_STATE_MASK	((1 << RCU_SEQ_CTR_SHIFT) - 1)
>  
> +/*
> + * Low-order bit definitions for polled grace-period APIs.
> + */
> +#define RCU_GET_STATE_FROM_EXPEDITED	0x1
> +#define RCU_GET_STATE_USE_NORMAL	0x2
> +#define RCU_GET_STATE_BAD_FOR_NORMAL	(RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL)
> +
>  /*
>   * Return the counter portion of a sequence number previously returned
>   * by rcu_seq_snap() or rcu_seq_current().
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index e6ad532cffe78..5de36abcd7da1 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3871,7 +3871,8 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu);
>   */
>  bool poll_state_synchronize_rcu(unsigned long oldstate)
>  {
> -	if (rcu_seq_done(&rcu_state.gp_seq, oldstate)) {
> +	if (rcu_seq_done(&rcu_state.gp_seq, oldstate) &&
> +	    !WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL)) {
>  		smp_mb(); /* Ensure GP ends before subsequent accesses. */
>  		return true;
>  	}
> @@ -3900,7 +3901,8 @@ EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu);
>   */
>  void cond_synchronize_rcu(unsigned long oldstate)
>  {
> -	if (!poll_state_synchronize_rcu(oldstate))
> +	if (!poll_state_synchronize_rcu(oldstate) &&
> +	    !WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL))
>  		synchronize_rcu();
>  }
>  EXPORT_SYMBOL_GPL(cond_synchronize_rcu);
> @@ -4593,6 +4595,9 @@ static void __init rcu_init_one(void)
>  			init_waitqueue_head(&rnp->exp_wq[3]);
>  			spin_lock_init(&rnp->exp_lock);
>  			mutex_init(&rnp->boost_kthread_mutex);
> +			raw_spin_lock_init(&rnp->exp_poll_lock);
> +			rnp->exp_seq_poll_rq = 0x1;
> +			INIT_WORK(&rnp->exp_poll_wq, sync_rcu_do_polled_gp);
>  		}
>  	}
>  
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 926673ebe355f..19fc9acce3ce2 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -128,6 +128,10 @@ struct rcu_node {
>  	wait_queue_head_t exp_wq[4];
>  	struct rcu_exp_work rew;
>  	bool exp_need_flush;	/* Need to flush workitem? */
> +	raw_spinlock_t exp_poll_lock;
> +				/* Lock and data for polled expedited grace periods. */
> +	unsigned long exp_seq_poll_rq;
> +	struct work_struct exp_poll_wq;
>  } ____cacheline_internodealigned_in_smp;
>  
>  /*
> @@ -476,3 +480,6 @@ static void rcu_iw_handler(struct irq_work *iwp);
>  static void check_cpu_stall(struct rcu_data *rdp);
>  static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
>  				     const unsigned long gpssdelay);
> +
> +/* Forward declarations for tree_exp.h. */
> +static void sync_rcu_do_polled_gp(struct work_struct *wp);
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 1a45667402260..728896f374fee 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -871,3 +871,154 @@ void synchronize_rcu_expedited(void)
>  		destroy_work_on_stack(&rew.rew_work);
>  }
>  EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
> +
> +/**
> + * get_state_synchronize_rcu_expedited - Snapshot current expedited RCU state
> + *
> + * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited()
> + * or poll_state_synchronize_rcu_expedited(), allowing them to determine
> + * whether or not a full expedited grace period has elapsed in the meantime.
> + */
> +unsigned long get_state_synchronize_rcu_expedited(void)
> +{
> +	if (rcu_gp_is_normal())
> +	return get_state_synchronize_rcu() |
> +	       RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL;
> +
> +	// Any prior manipulation of RCU-protected data must happen
> +	// before the load from ->expedited_sequence.
> +	smp_mb();  /* ^^^ */
> +	return rcu_exp_gp_seq_snap() | RCU_GET_STATE_FROM_EXPEDITED;
> +}
> +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu_expedited);
> +
> +/*
> + * Ensure that start_poll_synchronize_rcu_expedited() has the expedited
> + * RCU grace periods that it needs.
> + */
> +static void sync_rcu_do_polled_gp(struct work_struct *wp)
> +{
> +	unsigned long flags;
> +	struct rcu_node *rnp = container_of(wp, struct rcu_node, exp_poll_wq);
> +	unsigned long s;
> +
> +	raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
> +	s = rnp->exp_seq_poll_rq;
> +	rnp->exp_seq_poll_rq |= 0x1;
> +	raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
> +	if (s & 0x1)
> +		return;
> +	while (!sync_exp_work_done(s))
> +		synchronize_rcu_expedited();
> +	raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
> +	s = rnp->exp_seq_poll_rq;
> +	if (!(s & 0x1) && !sync_exp_work_done(s))
> +		queue_work(rcu_gp_wq, &rnp->exp_poll_wq);
> +	else
> +		rnp->exp_seq_poll_rq |= 0x1;
> +	raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
> +}
> +
> +/**
> + * start_poll_synchronize_rcu_expedited - Snapshot current expedited RCU state and start grace period
> + *
> + * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited()
> + * or poll_state_synchronize_rcu_expedited(), allowing them to determine
> + * whether or not a full expedited grace period has elapsed in the meantime.
> + * If the needed grace period is not already slated to start, initiates
> + * that grace period.
> + */
> +
> +unsigned long start_poll_synchronize_rcu_expedited(void)
> +{
> +	unsigned long flags;
> +	struct rcu_data *rdp;
> +	struct rcu_node *rnp;
> +	unsigned long s;
> +
> +	if (rcu_gp_is_normal())
> +		return start_poll_synchronize_rcu_expedited() |
> +		       RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL;
> +
> +	s = rcu_exp_gp_seq_snap();
> +	rdp = per_cpu_ptr(&rcu_data, raw_smp_processor_id());
> +	rnp = rdp->mynode;
> +	raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
> +	if ((rnp->exp_seq_poll_rq & 0x1) || ULONG_CMP_LT(rnp->exp_seq_poll_rq, s)) {
> +		rnp->exp_seq_poll_rq = s;
> +		queue_work(rcu_gp_wq, &rnp->exp_poll_wq);
> +	}
> +	raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
> +
> +	return s | RCU_GET_STATE_FROM_EXPEDITED;
> +}
> +EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu_expedited);
> +
> +/**
> + * poll_state_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period
> + *
> + * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited()
> + *
> + * If a full expedited RCU grace period has elapsed since the earlier call
> + * from which oldstate was obtained, return @true, otherwise return @false.
> + * If @false is returned, it is the caller's responsibility to invoke
> + * this function later on until it does return @true.  Alternatively,
> + * the caller can explicitly wait for a grace period, for example, by
> + * passing @oldstate to cond_synchronize_rcu_expedited() or by directly
> + * invoking synchronize_rcu_expedited().
> + *
> + * Yes, this function does not take counter wrap into account.
> + * But counter wrap is harmless.  If the counter wraps, we have waited for
> + * more than 2 billion grace periods (and way more on a 64-bit system!).
> + * Those needing to keep oldstate values for very long time periods
> + * (several hours even on 32-bit systems) should check them occasionally
> + * and either refresh them or set a flag indicating that the grace period
> + * has completed.
> + *
> + * This function provides the same memory-ordering guarantees that would
> + * be provided by a synchronize_rcu_expedited() that was invoked at the
> + * call to the function that provided @oldstate, and that returned at the
> + * end of this function.
> + */
> +bool poll_state_synchronize_rcu_expedited(unsigned long oldstate)
> +{
> +	WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED));
> +	if (oldstate & RCU_GET_STATE_USE_NORMAL)
> +		return poll_state_synchronize_rcu(oldstate & ~RCU_GET_STATE_BAD_FOR_NORMAL);
> +	if (!rcu_exp_gp_seq_done(oldstate & ~RCU_SEQ_STATE_MASK))
> +		return false;
> +	smp_mb(); /* Ensure GP ends before subsequent accesses. */
> +	return true;
> +}
> +EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu_expedited);
> +
> +/**
> + * cond_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period
> + *
> + * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited()
> + *
> + * If a full expedited RCU grace period has elapsed since the earlier
> + * call from which oldstate was obtained, just return.  Otherwise, invoke
> + * synchronize_rcu_expedited() to wait for a full grace period.
> + *
> + * Yes, this function does not take counter wrap into account.  But
> + * counter wrap is harmless.  If the counter wraps, we have waited for
> + * more than 2 billion grace periods (and way more on a 64-bit system!),
> + * so waiting for one additional grace period should be just fine.
> + *
> + * This function provides the same memory-ordering guarantees that would
> + * be provided by a synchronize_rcu_expedited() that was invoked at the
> + * call to the function that provided @oldstate, and that returned at the
> + * end of this function.
> + */
> +void cond_synchronize_rcu_expedited(unsigned long oldstate)
> +{
> +	WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED));
> +	if (poll_state_synchronize_rcu_expedited(oldstate))
> +		return;
> +	if (oldstate & RCU_GET_STATE_USE_NORMAL)
> +		synchronize_rcu_expedited();
> +	else
> +		synchronize_rcu();
> +}
> +EXPORT_SYMBOL_GPL(cond_synchronize_rcu_expedited);
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-02-07 13:30                         ` Brian Foster
@ 2022-02-07 16:36                           ` Paul E. McKenney
  2022-02-10  4:09                             ` Dave Chinner
  0 siblings, 1 reply; 36+ messages in thread
From: Paul E. McKenney @ 2022-02-07 16:36 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, Al Viro, linux-xfs, Ian Kent, rcu

On Mon, Feb 07, 2022 at 08:30:03AM -0500, Brian Foster wrote:
> On Tue, Feb 01, 2022 at 02:00:28PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 31, 2022 at 08:22:43AM -0500, Brian Foster wrote:
> > > On Fri, Jan 28, 2022 at 01:39:11PM -0800, Paul E. McKenney wrote:
> > > > On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote:
> > > > > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote:
> > > > > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote:
> > > > > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote:
> > > > > > > 
> > > > > > > > Right, background inactivation does not improve performance - it's
> > > > > > > > necessary to get the transactions out of the evict() path. All we
> > > > > > > > wanted was to ensure that there were no performance degradations as
> > > > > > > > a result of background inactivation, not that it was faster.
> > > > > > > > 
> > > > > > > > If you want to confirm that there is an increase in cold cache
> > > > > > > > access when the batch size is increased, cpu profiles with 'perf
> > > > > > > > top'/'perf record/report' and CPU cache performance metric reporting
> > > > > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread
> > > > > > > > where I mention those things to Paul.
> > > > > > > 
> > > > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid?
> > > > > > > I'm not asking for that to happen this cycle and for backports Ian's
> > > > > > > patch is obviously fine.
> > > > > > 
> > > > > > Yes, but not in the near term.
> > > > > > 
> > > > > > > What I really want to avoid is the situation when we are stuck with
> > > > > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing
> > > > > > > reused inodes would hurt XFS too badly.  And the benchmarks in this
> > > > > > > thread do look like that.
> > > > > > 
> > > > > > The simplest way I think is to have the XFS inode allocation track
> > > > > > "busy inodes" in the same way we track "busy extents". A busy extent
> > > > > > is an extent that has been freed by the user, but is not yet marked
> > > > > > free in the journal/on disk. If we try to reallocate that busy
> > > > > > extent, we either select a different free extent to allocate, or if
> > > > > > we can't find any we force the journal to disk, wait for it to
> > > > > > complete (hence unbusying the extents) and retry the allocation
> > > > > > again.
> > > > > > 
> > > > > > We can do something similar for inode allocation - it's actually a
> > > > > > lockless tag lookup on the radix tree entry for the candidate inode
> > > > > > number. If we find the reclaimable radix tree tag set, the we select
> > > > > > a different inode. If we can't allocate a new inode, then we kick
> > > > > > synchronize_rcu() and retry the allocation, allowing inodes to be
> > > > > > recycled this time.
> > > > > 
> > > > > I'm starting to poke around this area since it's become clear that the
> > > > > currently proposed scheme just involves too much latency (unless Paul
> > > > > chimes in with his expedited grace period variant, at which point I will
> > > > > revisit) in the fast allocation/recycle path. ISTM so far that a simple
> > > > > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will
> > > > > have pretty much the same pattern of behavior as this patch: one
> > > > > synchronize_rcu() per batch.
> > > > 
> > > > Apologies for being slow, but there have been some distractions.
> > > > One of the distractions was trying to put together atheoretically
> > > > attractive but massively overcomplicated implementation of
> > > > poll_state_synchronize_rcu_expedited().  It currently looks like a
> > > > somewhat suboptimal but much simpler approach is available.  This
> > > > assumes that XFS is not in the picture until after both the scheduler
> > > > and workqueues are operational.
> > > > 
> > > 
> > > No worries.. I don't think that would be a roadblock for us. ;)
> > > 
> > > > And yes, the complicated version might prove necessary, but let's
> > > > see if this whole thing is even useful first.  ;-)
> > > > 
> > > 
> > > Indeed. This patch only really requires a single poll/sync pair of
> > > calls, so assuming the expedited grace period usage plays nice enough
> > > with typical !expedited usage elsewhere in the kernel for some basic
> > > tests, it would be fairly trivial to port this over and at least get an
> > > idea of what the worst case behavior might be with expedited grace
> > > periods, whether it satisfies the existing latency requirements, etc.
> > > 
> > > Brian
> > > 
> > > > In the meantime, if you want to look at an extremely unbaked view,
> > > > here you go:
> > > > 
> > > > https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
> > 
> > And here is a version that passes moderate rcutorture testing.  So no
> > obvious bugs.  Probably a few non-obvious ones, though!  ;-)
> > 
> > This commit is on -rcu's "dev" branch along with this rcutorture
> > addition:
> > 
> > cd7bd64af59f ("EXP rcutorture: Test polled expedited grace-period primitives")
> > 
> > I will carry these in -rcu's "dev" branch until at least the upcoming
> > merge window, fixing bugs as and when they becom apparent.  If I don't
> > hear otherwise by that time, I will create a tag for it and leave
> > it behind.
> > 
> > The backport to v5.17-rc2 just requires removing:
> > 
> > 	mutex_init(&rnp->boost_kthread_mutex);
> > 
> > From rcu_init_one().  This line is added by this -rcu commit:
> > 
> > 02a50b09c31f ("rcu: Add mutex for rcu boost kthread spawning and affinity setting")
> > 
> > Please let me know how it goes!
> > 
> 
> Thanks Paul. I gave this a whirl with a ported variant of this patch on
> top. There is definitely a notable improvement with the expedited grace
> periods. A few quick runs of the same batched alloc/free test (i.e. 10
> sample) I had run against the original version:
> 
> batch	baseline	baseline+bg	test	test+bg
> 
> 1	889954		210075		552911	25540
> 4	879540		212740		575356	24624
> 8	924928		213568		496992	26080
> 16	922960		211504		518496	24592
> 32	844832		219744		524672	28608
> 64	579968		196544		358720	24128
> 128	667392		195840		397696	22400
> 256	624896		197888		376320	31232
> 512	572928		204800		382464	46080
> 1024	549888		174080		379904	73728
> 2048	522240		174080		350208	106496
> 4096	536576		167936		360448	131072
> 
> So this shows a major improvement in the case where the system is
> otherwise idle. We still aren't quite at the baseline numbers, but
> that's not really the goal here because those numbers are partly driven
> by the fact that we unsafely reuse recently freed inodes in cases where
> proper behavior would be to allocate new inode chunks for a period of
> time. The core test numbers are much closer to the single threaded
> allocation rate (55k-65k inodes/sec) on this setup, so that is quite
> positive.
> 
> The "bg" variants are the same tests with 64 tasks doing unrelated
> pathwalk listings on a kernel source tree (on separate storage)
> concurrently in the background. The purpose of this was just to generate
> background (rcu) activity in the form of pathname lookups and whatnot
> and see how that impacts the results. This clearly affects both kernels,
> but the test kernel drops down closer to numbers reminiscent of the
> non-expedited grace period variant. Note that this impact seems to scale
> with increased background workload. With a similar test running only 8
> background tasks, the test kernel is pretty consistently in the
> 225k-250k (per 10s) range across the set of batch sizes. That's about
> half the core test rate, so still not as terrible as the original
> variant. ;)
> 
> In any event, this probably requires some thought/discussion (and more
> testing) on whether this is considered an acceptable change or whether
> we want to explore options to mitigate this further. I am still playing
> with some ideas to potentially mitigate grace period latency, so it
> might be worth seeing if anything useful falls out of that as well.
> Thoughts appreciated...

So this fixes a bug, but results in many 10s of percent performance
degradation?  Ouch...

Another approach is to use SLAB_TYPESAFE_BY_RCU.  This allows immediate
reuse of freed memory, but also requires pointer traversals to the memory
to do a revalidation operation.  (Sorry, no free lunch here!)

							Thanx, Paul

> Brian
> 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > commit dd896a86aebc5b225ceee13fcf1375c7542a5e2d
> > Author: Paul E. McKenney <paulmck@kernel.org>
> > Date:   Mon Jan 31 16:55:52 2022 -0800
> > 
> >     EXP rcu: Add polled expedited grace-period primitives
> >     
> >     This is an experimental proof of concept of polled expedited grace-period
> >     functions.  These functions are get_state_synchronize_rcu_expedited(),
> >     start_poll_synchronize_rcu_expedited(), poll_state_synchronize_rcu_expedited(),
> >     and cond_synchronize_rcu_expedited(), which are similar to
> >     get_state_synchronize_rcu(), start_poll_synchronize_rcu(),
> >     poll_state_synchronize_rcu(), and cond_synchronize_rcu(), respectively.
> >     
> >     One limitation is that start_poll_synchronize_rcu_expedited() cannot
> >     be invoked before workqueues are initialized.
> >     
> >     Cc: Brian Foster <bfoster@redhat.com>
> >     Cc: Dave Chinner <david@fromorbit.com>
> >     Cc: Al Viro <viro@zeniv.linux.org.uk>
> >     Cc: Ian Kent <raven@themaw.net>
> >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > 
> > diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> > index 858f4d429946d..ca139b4b2d25f 100644
> > --- a/include/linux/rcutiny.h
> > +++ b/include/linux/rcutiny.h
> > @@ -23,6 +23,26 @@ static inline void cond_synchronize_rcu(unsigned long oldstate)
> >  	might_sleep();
> >  }
> >  
> > +static inline unsigned long get_state_synchronize_rcu_expedited(void)
> > +{
> > +	return get_state_synchronize_rcu();
> > +}
> > +
> > +static inline unsigned long start_poll_synchronize_rcu_expedited(void)
> > +{
> > +	return start_poll_synchronize_rcu();
> > +}
> > +
> > +static inline bool poll_state_synchronize_rcu_expedited(unsigned long oldstate)
> > +{
> > +	return poll_state_synchronize_rcu(oldstate);
> > +}
> > +
> > +static inline void cond_synchronize_rcu_expedited(unsigned long oldstate)
> > +{
> > +	cond_synchronize_rcu(oldstate);
> > +}
> > +
> >  extern void rcu_barrier(void);
> >  
> >  static inline void synchronize_rcu_expedited(void)
> > diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> > index 76665db179fa1..eb774e9be21bf 100644
> > --- a/include/linux/rcutree.h
> > +++ b/include/linux/rcutree.h
> > @@ -40,6 +40,10 @@ bool rcu_eqs_special_set(int cpu);
> >  void rcu_momentary_dyntick_idle(void);
> >  void kfree_rcu_scheduler_running(void);
> >  bool rcu_gp_might_be_stalled(void);
> > +unsigned long get_state_synchronize_rcu_expedited(void);
> > +unsigned long start_poll_synchronize_rcu_expedited(void);
> > +bool poll_state_synchronize_rcu_expedited(unsigned long oldstate);
> > +void cond_synchronize_rcu_expedited(unsigned long oldstate);
> >  unsigned long get_state_synchronize_rcu(void);
> >  unsigned long start_poll_synchronize_rcu(void);
> >  bool poll_state_synchronize_rcu(unsigned long oldstate);
> > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > index 24b5f2c2de87b..5b61cf20c91e9 100644
> > --- a/kernel/rcu/rcu.h
> > +++ b/kernel/rcu/rcu.h
> > @@ -23,6 +23,13 @@
> >  #define RCU_SEQ_CTR_SHIFT	2
> >  #define RCU_SEQ_STATE_MASK	((1 << RCU_SEQ_CTR_SHIFT) - 1)
> >  
> > +/*
> > + * Low-order bit definitions for polled grace-period APIs.
> > + */
> > +#define RCU_GET_STATE_FROM_EXPEDITED	0x1
> > +#define RCU_GET_STATE_USE_NORMAL	0x2
> > +#define RCU_GET_STATE_BAD_FOR_NORMAL	(RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL)
> > +
> >  /*
> >   * Return the counter portion of a sequence number previously returned
> >   * by rcu_seq_snap() or rcu_seq_current().
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index e6ad532cffe78..5de36abcd7da1 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3871,7 +3871,8 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu);
> >   */
> >  bool poll_state_synchronize_rcu(unsigned long oldstate)
> >  {
> > -	if (rcu_seq_done(&rcu_state.gp_seq, oldstate)) {
> > +	if (rcu_seq_done(&rcu_state.gp_seq, oldstate) &&
> > +	    !WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL)) {
> >  		smp_mb(); /* Ensure GP ends before subsequent accesses. */
> >  		return true;
> >  	}
> > @@ -3900,7 +3901,8 @@ EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu);
> >   */
> >  void cond_synchronize_rcu(unsigned long oldstate)
> >  {
> > -	if (!poll_state_synchronize_rcu(oldstate))
> > +	if (!poll_state_synchronize_rcu(oldstate) &&
> > +	    !WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL))
> >  		synchronize_rcu();
> >  }
> >  EXPORT_SYMBOL_GPL(cond_synchronize_rcu);
> > @@ -4593,6 +4595,9 @@ static void __init rcu_init_one(void)
> >  			init_waitqueue_head(&rnp->exp_wq[3]);
> >  			spin_lock_init(&rnp->exp_lock);
> >  			mutex_init(&rnp->boost_kthread_mutex);
> > +			raw_spin_lock_init(&rnp->exp_poll_lock);
> > +			rnp->exp_seq_poll_rq = 0x1;
> > +			INIT_WORK(&rnp->exp_poll_wq, sync_rcu_do_polled_gp);
> >  		}
> >  	}
> >  
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > index 926673ebe355f..19fc9acce3ce2 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -128,6 +128,10 @@ struct rcu_node {
> >  	wait_queue_head_t exp_wq[4];
> >  	struct rcu_exp_work rew;
> >  	bool exp_need_flush;	/* Need to flush workitem? */
> > +	raw_spinlock_t exp_poll_lock;
> > +				/* Lock and data for polled expedited grace periods. */
> > +	unsigned long exp_seq_poll_rq;
> > +	struct work_struct exp_poll_wq;
> >  } ____cacheline_internodealigned_in_smp;
> >  
> >  /*
> > @@ -476,3 +480,6 @@ static void rcu_iw_handler(struct irq_work *iwp);
> >  static void check_cpu_stall(struct rcu_data *rdp);
> >  static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
> >  				     const unsigned long gpssdelay);
> > +
> > +/* Forward declarations for tree_exp.h. */
> > +static void sync_rcu_do_polled_gp(struct work_struct *wp);
> > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > index 1a45667402260..728896f374fee 100644
> > --- a/kernel/rcu/tree_exp.h
> > +++ b/kernel/rcu/tree_exp.h
> > @@ -871,3 +871,154 @@ void synchronize_rcu_expedited(void)
> >  		destroy_work_on_stack(&rew.rew_work);
> >  }
> >  EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
> > +
> > +/**
> > + * get_state_synchronize_rcu_expedited - Snapshot current expedited RCU state
> > + *
> > + * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited()
> > + * or poll_state_synchronize_rcu_expedited(), allowing them to determine
> > + * whether or not a full expedited grace period has elapsed in the meantime.
> > + */
> > +unsigned long get_state_synchronize_rcu_expedited(void)
> > +{
> > +	if (rcu_gp_is_normal())
> > +	return get_state_synchronize_rcu() |
> > +	       RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL;
> > +
> > +	// Any prior manipulation of RCU-protected data must happen
> > +	// before the load from ->expedited_sequence.
> > +	smp_mb();  /* ^^^ */
> > +	return rcu_exp_gp_seq_snap() | RCU_GET_STATE_FROM_EXPEDITED;
> > +}
> > +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu_expedited);
> > +
> > +/*
> > + * Ensure that start_poll_synchronize_rcu_expedited() has the expedited
> > + * RCU grace periods that it needs.
> > + */
> > +static void sync_rcu_do_polled_gp(struct work_struct *wp)
> > +{
> > +	unsigned long flags;
> > +	struct rcu_node *rnp = container_of(wp, struct rcu_node, exp_poll_wq);
> > +	unsigned long s;
> > +
> > +	raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
> > +	s = rnp->exp_seq_poll_rq;
> > +	rnp->exp_seq_poll_rq |= 0x1;
> > +	raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
> > +	if (s & 0x1)
> > +		return;
> > +	while (!sync_exp_work_done(s))
> > +		synchronize_rcu_expedited();
> > +	raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
> > +	s = rnp->exp_seq_poll_rq;
> > +	if (!(s & 0x1) && !sync_exp_work_done(s))
> > +		queue_work(rcu_gp_wq, &rnp->exp_poll_wq);
> > +	else
> > +		rnp->exp_seq_poll_rq |= 0x1;
> > +	raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
> > +}
> > +
> > +/**
> > + * start_poll_synchronize_rcu_expedited - Snapshot current expedited RCU state and start grace period
> > + *
> > + * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited()
> > + * or poll_state_synchronize_rcu_expedited(), allowing them to determine
> > + * whether or not a full expedited grace period has elapsed in the meantime.
> > + * If the needed grace period is not already slated to start, initiates
> > + * that grace period.
> > + */
> > +
> > +unsigned long start_poll_synchronize_rcu_expedited(void)
> > +{
> > +	unsigned long flags;
> > +	struct rcu_data *rdp;
> > +	struct rcu_node *rnp;
> > +	unsigned long s;
> > +
> > +	if (rcu_gp_is_normal())
> > +		return start_poll_synchronize_rcu_expedited() |
> > +		       RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL;
> > +
> > +	s = rcu_exp_gp_seq_snap();
> > +	rdp = per_cpu_ptr(&rcu_data, raw_smp_processor_id());
> > +	rnp = rdp->mynode;
> > +	raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
> > +	if ((rnp->exp_seq_poll_rq & 0x1) || ULONG_CMP_LT(rnp->exp_seq_poll_rq, s)) {
> > +		rnp->exp_seq_poll_rq = s;
> > +		queue_work(rcu_gp_wq, &rnp->exp_poll_wq);
> > +	}
> > +	raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
> > +
> > +	return s | RCU_GET_STATE_FROM_EXPEDITED;
> > +}
> > +EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu_expedited);
> > +
> > +/**
> > + * poll_state_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period
> > + *
> > + * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited()
> > + *
> > + * If a full expedited RCU grace period has elapsed since the earlier call
> > + * from which oldstate was obtained, return @true, otherwise return @false.
> > + * If @false is returned, it is the caller's responsibility to invoke
> > + * this function later on until it does return @true.  Alternatively,
> > + * the caller can explicitly wait for a grace period, for example, by
> > + * passing @oldstate to cond_synchronize_rcu_expedited() or by directly
> > + * invoking synchronize_rcu_expedited().
> > + *
> > + * Yes, this function does not take counter wrap into account.
> > + * But counter wrap is harmless.  If the counter wraps, we have waited for
> > + * more than 2 billion grace periods (and way more on a 64-bit system!).
> > + * Those needing to keep oldstate values for very long time periods
> > + * (several hours even on 32-bit systems) should check them occasionally
> > + * and either refresh them or set a flag indicating that the grace period
> > + * has completed.
> > + *
> > + * This function provides the same memory-ordering guarantees that would
> > + * be provided by a synchronize_rcu_expedited() that was invoked at the
> > + * call to the function that provided @oldstate, and that returned at the
> > + * end of this function.
> > + */
> > +bool poll_state_synchronize_rcu_expedited(unsigned long oldstate)
> > +{
> > +	WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED));
> > +	if (oldstate & RCU_GET_STATE_USE_NORMAL)
> > +		return poll_state_synchronize_rcu(oldstate & ~RCU_GET_STATE_BAD_FOR_NORMAL);
> > +	if (!rcu_exp_gp_seq_done(oldstate & ~RCU_SEQ_STATE_MASK))
> > +		return false;
> > +	smp_mb(); /* Ensure GP ends before subsequent accesses. */
> > +	return true;
> > +}
> > +EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu_expedited);
> > +
> > +/**
> > + * cond_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period
> > + *
> > + * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited()
> > + *
> > + * If a full expedited RCU grace period has elapsed since the earlier
> > + * call from which oldstate was obtained, just return.  Otherwise, invoke
> > + * synchronize_rcu_expedited() to wait for a full grace period.
> > + *
> > + * Yes, this function does not take counter wrap into account.  But
> > + * counter wrap is harmless.  If the counter wraps, we have waited for
> > + * more than 2 billion grace periods (and way more on a 64-bit system!),
> > + * so waiting for one additional grace period should be just fine.
> > + *
> > + * This function provides the same memory-ordering guarantees that would
> > + * be provided by a synchronize_rcu_expedited() that was invoked at the
> > + * call to the function that provided @oldstate, and that returned at the
> > + * end of this function.
> > + */
> > +void cond_synchronize_rcu_expedited(unsigned long oldstate)
> > +{
> > +	WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED));
> > +	if (poll_state_synchronize_rcu_expedited(oldstate))
> > +		return;
> > +	if (oldstate & RCU_GET_STATE_USE_NORMAL)
> > +		synchronize_rcu_expedited();
> > +	else
> > +		synchronize_rcu();
> > +}
> > +EXPORT_SYMBOL_GPL(cond_synchronize_rcu_expedited);
> > 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-02-07 16:36                           ` Paul E. McKenney
@ 2022-02-10  4:09                             ` Dave Chinner
  2022-02-10  5:45                               ` Paul E. McKenney
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2022-02-10  4:09 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Brian Foster, Al Viro, linux-xfs, Ian Kent, rcu

On Mon, Feb 07, 2022 at 08:36:21AM -0800, Paul E. McKenney wrote:
> On Mon, Feb 07, 2022 at 08:30:03AM -0500, Brian Foster wrote:
> Another approach is to use SLAB_TYPESAFE_BY_RCU.  This allows immediate
> reuse of freed memory, but also requires pointer traversals to the memory
> to do a revalidation operation.  (Sorry, no free lunch here!)

Can't do that with inodes - newly allocated/reused inodes have to go
through inode_init_always() which is the very function that causes
the problems we have now with path-walk tripping over inodes in an
intermediate re-initialised state because we recycled it inside a
RCU grace period.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-02-10  4:09                             ` Dave Chinner
@ 2022-02-10  5:45                               ` Paul E. McKenney
  2022-02-10 20:47                                 ` Brian Foster
  0 siblings, 1 reply; 36+ messages in thread
From: Paul E. McKenney @ 2022-02-10  5:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, Al Viro, linux-xfs, Ian Kent, rcu

On Thu, Feb 10, 2022 at 03:09:17PM +1100, Dave Chinner wrote:
> On Mon, Feb 07, 2022 at 08:36:21AM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 07, 2022 at 08:30:03AM -0500, Brian Foster wrote:
> > Another approach is to use SLAB_TYPESAFE_BY_RCU.  This allows immediate
> > reuse of freed memory, but also requires pointer traversals to the memory
> > to do a revalidation operation.  (Sorry, no free lunch here!)
> 
> Can't do that with inodes - newly allocated/reused inodes have to go
> through inode_init_always() which is the very function that causes
> the problems we have now with path-walk tripping over inodes in an
> intermediate re-initialised state because we recycled it inside a
> RCU grace period.

So not just no free lunch, but this is also not a lunch that is consistent
with the code's dietary restrictions.

From what you said earlier in this thread, I am guessing that you have
some other fix in mind.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] xfs: require an rcu grace period before inode recycle
  2022-02-10  5:45                               ` Paul E. McKenney
@ 2022-02-10 20:47                                 ` Brian Foster
  0 siblings, 0 replies; 36+ messages in thread
From: Brian Foster @ 2022-02-10 20:47 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Dave Chinner, Al Viro, linux-xfs, Ian Kent, rcu

On Wed, Feb 09, 2022 at 09:45:44PM -0800, Paul E. McKenney wrote:
> On Thu, Feb 10, 2022 at 03:09:17PM +1100, Dave Chinner wrote:
> > On Mon, Feb 07, 2022 at 08:36:21AM -0800, Paul E. McKenney wrote:
> > > On Mon, Feb 07, 2022 at 08:30:03AM -0500, Brian Foster wrote:
> > > Another approach is to use SLAB_TYPESAFE_BY_RCU.  This allows immediate
> > > reuse of freed memory, but also requires pointer traversals to the memory
> > > to do a revalidation operation.  (Sorry, no free lunch here!)
> > 
> > Can't do that with inodes - newly allocated/reused inodes have to go
> > through inode_init_always() which is the very function that causes
> > the problems we have now with path-walk tripping over inodes in an
> > intermediate re-initialised state because we recycled it inside a
> > RCU grace period.
> 
> So not just no free lunch, but this is also not a lunch that is consistent
> with the code's dietary restrictions.
> 
> From what you said earlier in this thread, I am guessing that you have
> some other fix in mind.
> 

Yeah.. I've got an experiment running that essentially tracks pending
inode grace period cookies and attempts to avoid them at allocation
time. It's crude atm, but the initial numbers I see aren't that far off
from the results produced by your expedited grace period mechanism. I
see numbers mostly in the 40-50k cycles per second ballpark. This is
somewhat expected because the current baseline behavior relies on unsafe
reuse of inodes before a grace period has elapsed. We have to rely on
more physical allocations to get around this, so the small batch
alloc/free patterns simply won't be able to spin as fast. The difference
I do see with this sort of explicit gp tracking is that the results
remain much closer to the baseline kernel when background activity is
ramped up.

However, one of the things I'd like to experiment with is whether the
combination of this approach and expedited grace periods provides any
sort of opportunity for further optimization. For example, if we can
identify that a grace period has elapsed between the time of
->destroy_inode() and when the queue processing ultimately marks the
inode reclaimable, that might allow for some optimized allocation
behavior. I see this occur occasionally with normal grace periods, but
not quite frequent enough to make a difference.

What I observe right now is that the same test above runs at much closer
to the baseline numbers when using the ikeep mount option, so I may need
to look into ways to mitigate the chunk allocation overhead..

Brian

> 							Thanx, Paul
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2022-02-10 20:47 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-01-21 14:24 [PATCH] xfs: require an rcu grace period before inode recycle Brian Foster
2022-01-21 17:26 ` Darrick J. Wong
2022-01-21 18:33   ` Brian Foster
2022-01-22  5:30     ` Paul E. McKenney
2022-01-22 16:55       ` Paul E. McKenney
2022-01-24 15:12       ` Brian Foster
2022-01-24 16:40         ` Paul E. McKenney
2022-01-23 22:43 ` Dave Chinner
2022-01-24 15:06   ` Brian Foster
2022-01-24 15:02 ` Brian Foster
2022-01-24 22:08   ` Dave Chinner
2022-01-24 23:29     ` Brian Foster
2022-01-25  0:31       ` Dave Chinner
2022-01-25 14:40         ` Paul E. McKenney
2022-01-25 22:36           ` Dave Chinner
2022-01-26  5:29             ` Paul E. McKenney
2022-01-26 13:21               ` Brian Foster
2022-01-25 18:30         ` Brian Foster
2022-01-25 20:07           ` Brian Foster
2022-01-25 22:45           ` Dave Chinner
2022-01-27  4:19             ` Al Viro
2022-01-27  5:26               ` Dave Chinner
2022-01-27 19:01                 ` Brian Foster
2022-01-27 22:18                   ` Dave Chinner
2022-01-28 14:11                     ` Brian Foster
2022-01-28 23:53                       ` Dave Chinner
2022-01-31 13:28                         ` Brian Foster
2022-01-28 21:39                   ` Paul E. McKenney
2022-01-31 13:22                     ` Brian Foster
2022-02-01 22:00                       ` Paul E. McKenney
2022-02-03 18:49                         ` Paul E. McKenney
2022-02-07 13:30                         ` Brian Foster
2022-02-07 16:36                           ` Paul E. McKenney
2022-02-10  4:09                             ` Dave Chinner
2022-02-10  5:45                               ` Paul E. McKenney
2022-02-10 20:47                                 ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).