* [PATCH] xfs: require an rcu grace period before inode recycle
@ 2022-01-21 14:24 Brian Foster
2022-01-21 17:26 ` Darrick J. Wong
` (2 more replies)
0 siblings, 3 replies; 36+ messages in thread
From: Brian Foster @ 2022-01-21 14:24 UTC (permalink / raw)
To: linux-xfs; +Cc: Dave Chinner, Al Viro, Ian Kent, rcu
The XFS inode allocation algorithm aggressively reuses recently
freed inodes. This is historical behavior that has been in place for
quite some time, since XFS was imported to mainline Linux. Once the
VFS adopted RCUwalk path lookups (also some time ago), this behavior
became slightly incompatible because the inode recycle path doesn't
isolate concurrent access to the inode from the VFS.
This has recently manifested as problems in the VFS when XFS happens
to change the type or properties of a recently unlinked inode while
still involved in an RCU lookup. For example, if the VFS refers to a
previous incarnation of a symlink inode, obtains the ->get_link()
callback from inode_operations, and the latter happens to change to
a non-symlink type via a recycle event, the ->get_link() callback
pointer is reset to NULL and the lookup results in a crash.
To avoid this class of problem, isolate in-core inodes for recycling
with an RCU grace period. This is the same level of protection the
VFS expects for inactivated inodes that are never reused, and so
guarantees no further concurrent access before the type or
properties of the inode change. We don't want an unconditional
synchronize_rcu() event here because that would result in a
significant performance impact to mixed inode allocation workloads.
Fortunately, we can take advantage of the recently added deferred
inactivation mechanism to mitigate the need for an RCU wait in most
cases. Deferred inactivation queues and batches the on-disk freeing
of recently destroyed inodes, and so significantly increases the
likelihood that a grace period has elapsed by the time an inode is
freed and observable by the allocation code as a reuse candidate.
Capture the current RCU grace period cookie at inode destroy time
and refer to it at allocation time to conditionally wait for an RCU
grace period if one hadn't expired in the meantime. Since only
unlinked inodes are recycle candidates and unlinked inodes always
require inactivation, we only need to poll and assign RCU state in
the inactivation codepath. Slightly adjust struct xfs_inode to fit
the new field into padding holes that conveniently preexist in the
same cacheline as the deferred inactivation list.
Finally, note that the ideal long term solution here is to
rearchitect bits of XFS' internal inode lifecycle management such
that this additional stall point is not required, but this requires
more thought, time and work to address. This approach restores
functional correctness in the meantime.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
Hi all,
Here's the RCU fixup patch for inode reuse that I've been playing with,
re: the vfs patch discussion [1]. I've put it in pretty much the most
basic form, but I think there are a couple aspects worth thinking about:
1. Use and frequency of start_poll_synchronize_rcu() (vs.
get_state_synchronize_rcu()). The former is a bit more active than the
latter in that it triggers the start of a grace period, when necessary.
This currently invokes per inode, which is the ideal frequency in
theory, but could be reduced, associated with the xfs_inogegc thresholds
in some manner, etc., if there is good reason to do that.
2. The rcu cookie lifecycle. This variant updates it on inactivation
queue and nowhere else because the RCU docs imply that counter rollover
is not a significant problem. In practice, I think this means that if an
inode is stamped at least once, and the counter rolls over, future
(non-inactivation, non-unlinked) eviction -> repopulation cycles could
trigger rcu syncs. I think this would require repeated
eviction/reinstantiation cycles within a small window to be noticeable,
so I'm not sure how likely this is to occur. We could be more defensive
by resetting or refreshing the cookie. E.g., refresh (or reset to zero)
at recycle time, unconditionally refresh at destroy time (using
get_state_synchronize_rcu() for non-inactivation), etc.
Otherwise testing is ongoing, but this version at least survives an
fstests regression run.
Brian
[1] https://lore.kernel.org/linux-fsdevel/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
fs/xfs/xfs_icache.c | 11 +++++++++++
fs/xfs/xfs_inode.h | 3 ++-
2 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index d019c98eb839..4931daa45ca4 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -349,6 +349,16 @@ xfs_iget_recycle(
spin_unlock(&ip->i_flags_lock);
rcu_read_unlock();
+ /*
+ * VFS RCU pathwalk lookups dictate the same lifecycle rules for an
+ * inode recycle as for freeing an inode. I.e., we cannot repurpose the
+ * inode until a grace period has elapsed from the time the previous
+ * version of the inode was destroyed. In most cases a grace period has
+ * already elapsed if the inode was (deferred) inactivated, but
+ * synchronize here as a last resort to guarantee correctness.
+ */
+ cond_synchronize_rcu(ip->i_destroy_gp);
+
ASSERT(!rwsem_is_locked(&inode->i_rwsem));
error = xfs_reinit_inode(mp, inode);
if (error) {
@@ -2019,6 +2029,7 @@ xfs_inodegc_queue(
trace_xfs_inode_set_need_inactive(ip);
spin_lock(&ip->i_flags_lock);
ip->i_flags |= XFS_NEED_INACTIVE;
+ ip->i_destroy_gp = start_poll_synchronize_rcu();
spin_unlock(&ip->i_flags_lock);
gc = get_cpu_ptr(mp->m_inodegc);
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index c447bf04205a..2153e3edbb86 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -40,8 +40,9 @@ typedef struct xfs_inode {
/* Transaction and locking information. */
struct xfs_inode_log_item *i_itemp; /* logging information */
mrlock_t i_lock; /* inode lock */
- atomic_t i_pincount; /* inode pin count */
struct llist_node i_gclist; /* deferred inactivation list */
+ unsigned long i_destroy_gp; /* destroy rcugp cookie */
+ atomic_t i_pincount; /* inode pin count */
/*
* Bitsets of inode metadata that have been checked and/or are sick.
--
2.31.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-21 14:24 [PATCH] xfs: require an rcu grace period before inode recycle Brian Foster @ 2022-01-21 17:26 ` Darrick J. Wong 2022-01-21 18:33 ` Brian Foster 2022-01-23 22:43 ` Dave Chinner 2022-01-24 15:02 ` Brian Foster 2 siblings, 1 reply; 36+ messages in thread From: Darrick J. Wong @ 2022-01-21 17:26 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs, Dave Chinner, Al Viro, Ian Kent, rcu On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote: > The XFS inode allocation algorithm aggressively reuses recently > freed inodes. This is historical behavior that has been in place for > quite some time, since XFS was imported to mainline Linux. Once the > VFS adopted RCUwalk path lookups (also some time ago), this behavior > became slightly incompatible because the inode recycle path doesn't > isolate concurrent access to the inode from the VFS. > > This has recently manifested as problems in the VFS when XFS happens > to change the type or properties of a recently unlinked inode while > still involved in an RCU lookup. For example, if the VFS refers to a > previous incarnation of a symlink inode, obtains the ->get_link() > callback from inode_operations, and the latter happens to change to > a non-symlink type via a recycle event, the ->get_link() callback > pointer is reset to NULL and the lookup results in a crash. Hmm, so I guess what you're saying is that if the memory buffer allocation in ->get_link is slow enough, some other thread can free the inode, drop it, reallocate it, and reinstantiate it (not as a symlink this time) all before ->get_link's memory allocation call returns, after which Bad Things Happen(tm)? Can the lookup thread end up with the wrong inode->i_ops too? > To avoid this class of problem, isolate in-core inodes for recycling > with an RCU grace period. This is the same level of protection the > VFS expects for inactivated inodes that are never reused, and so > guarantees no further concurrent access before the type or > properties of the inode change. We don't want an unconditional > synchronize_rcu() event here because that would result in a > significant performance impact to mixed inode allocation workloads. > > Fortunately, we can take advantage of the recently added deferred > inactivation mechanism to mitigate the need for an RCU wait in most > cases. Deferred inactivation queues and batches the on-disk freeing > of recently destroyed inodes, and so significantly increases the > likelihood that a grace period has elapsed by the time an inode is > freed and observable by the allocation code as a reuse candidate. > Capture the current RCU grace period cookie at inode destroy time > and refer to it at allocation time to conditionally wait for an RCU > grace period if one hadn't expired in the meantime. Since only > unlinked inodes are recycle candidates and unlinked inodes always > require inactivation, Any inode can become a recycle candidate (i.e. RECLAIMABLE but otherwise idle) but I think your point here is that unlinked inodes that become recycling candidates can cause lookup threads to trip over symlinks, and that's why we need to assign RCU state and poll on it, right? (That wasn't a challenge, I'm just making sure I understand this correctly.) > we only need to poll and assign RCU state in > the inactivation codepath. Slightly adjust struct xfs_inode to fit > the new field into padding holes that conveniently preexist in the > same cacheline as the deferred inactivation list. > > Finally, note that the ideal long term solution here is to > rearchitect bits of XFS' internal inode lifecycle management such > that this additional stall point is not required, but this requires > more thought, time and work to address. This approach restores > functional correctness in the meantime. > > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > > Hi all, > > Here's the RCU fixup patch for inode reuse that I've been playing with, > re: the vfs patch discussion [1]. I've put it in pretty much the most > basic form, but I think there are a couple aspects worth thinking about: > > 1. Use and frequency of start_poll_synchronize_rcu() (vs. > get_state_synchronize_rcu()). The former is a bit more active than the > latter in that it triggers the start of a grace period, when necessary. > This currently invokes per inode, which is the ideal frequency in > theory, but could be reduced, associated with the xfs_inogegc thresholds > in some manner, etc., if there is good reason to do that. If you rm -rf $path, do each of the inodes get a separate rcu state, or do they share? > 2. The rcu cookie lifecycle. This variant updates it on inactivation > queue and nowhere else because the RCU docs imply that counter rollover > is not a significant problem. In practice, I think this means that if an > inode is stamped at least once, and the counter rolls over, future > (non-inactivation, non-unlinked) eviction -> repopulation cycles could > trigger rcu syncs. I think this would require repeated > eviction/reinstantiation cycles within a small window to be noticeable, > so I'm not sure how likely this is to occur. We could be more defensive > by resetting or refreshing the cookie. E.g., refresh (or reset to zero) > at recycle time, unconditionally refresh at destroy time (using > get_state_synchronize_rcu() for non-inactivation), etc. > > Otherwise testing is ongoing, but this version at least survives an > fstests regression run. > > Brian > > [1] https://lore.kernel.org/linux-fsdevel/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ > > fs/xfs/xfs_icache.c | 11 +++++++++++ > fs/xfs/xfs_inode.h | 3 ++- > 2 files changed, 13 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index d019c98eb839..4931daa45ca4 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -349,6 +349,16 @@ xfs_iget_recycle( > spin_unlock(&ip->i_flags_lock); > rcu_read_unlock(); > > + /* > + * VFS RCU pathwalk lookups dictate the same lifecycle rules for an > + * inode recycle as for freeing an inode. I.e., we cannot repurpose the > + * inode until a grace period has elapsed from the time the previous > + * version of the inode was destroyed. In most cases a grace period has > + * already elapsed if the inode was (deferred) inactivated, but > + * synchronize here as a last resort to guarantee correctness. > + */ > + cond_synchronize_rcu(ip->i_destroy_gp); > + > ASSERT(!rwsem_is_locked(&inode->i_rwsem)); > error = xfs_reinit_inode(mp, inode); > if (error) { > @@ -2019,6 +2029,7 @@ xfs_inodegc_queue( > trace_xfs_inode_set_need_inactive(ip); > spin_lock(&ip->i_flags_lock); > ip->i_flags |= XFS_NEED_INACTIVE; > + ip->i_destroy_gp = start_poll_synchronize_rcu(); Hmm. The description says that we only need the rcu synchronization when we're freeing an inode after its link count drops to zero, because that's the vector for (say) the VFS inode ops actually changing due to free/inactivate/reallocate/recycle while someone else is doing a lookup. I'm a bit puzzled why this unconditionally starts an rcu grace period, instead of done only if i_nlink==0; and why we call cond_synchronize_rcu above unconditionally instead of checking for i_mode==0 (or whatever state the cached inode is left in after it's freed)? --D > spin_unlock(&ip->i_flags_lock); > > gc = get_cpu_ptr(mp->m_inodegc); > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > index c447bf04205a..2153e3edbb86 100644 > --- a/fs/xfs/xfs_inode.h > +++ b/fs/xfs/xfs_inode.h > @@ -40,8 +40,9 @@ typedef struct xfs_inode { > /* Transaction and locking information. */ > struct xfs_inode_log_item *i_itemp; /* logging information */ > mrlock_t i_lock; /* inode lock */ > - atomic_t i_pincount; /* inode pin count */ > struct llist_node i_gclist; /* deferred inactivation list */ > + unsigned long i_destroy_gp; /* destroy rcugp cookie */ > + atomic_t i_pincount; /* inode pin count */ > > /* > * Bitsets of inode metadata that have been checked and/or are sick. > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-21 17:26 ` Darrick J. Wong @ 2022-01-21 18:33 ` Brian Foster 2022-01-22 5:30 ` Paul E. McKenney 0 siblings, 1 reply; 36+ messages in thread From: Brian Foster @ 2022-01-21 18:33 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, Dave Chinner, Al Viro, Ian Kent, rcu On Fri, Jan 21, 2022 at 09:26:03AM -0800, Darrick J. Wong wrote: > On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote: > > The XFS inode allocation algorithm aggressively reuses recently > > freed inodes. This is historical behavior that has been in place for > > quite some time, since XFS was imported to mainline Linux. Once the > > VFS adopted RCUwalk path lookups (also some time ago), this behavior > > became slightly incompatible because the inode recycle path doesn't > > isolate concurrent access to the inode from the VFS. > > > > This has recently manifested as problems in the VFS when XFS happens > > to change the type or properties of a recently unlinked inode while > > still involved in an RCU lookup. For example, if the VFS refers to a > > previous incarnation of a symlink inode, obtains the ->get_link() > > callback from inode_operations, and the latter happens to change to > > a non-symlink type via a recycle event, the ->get_link() callback > > pointer is reset to NULL and the lookup results in a crash. > > Hmm, so I guess what you're saying is that if the memory buffer > allocation in ->get_link is slow enough, some other thread can free the > inode, drop it, reallocate it, and reinstantiate it (not as a symlink > this time) all before ->get_link's memory allocation call returns, after > which Bad Things Happen(tm)? > > Can the lookup thread end up with the wrong inode->i_ops too? > We really don't need to even get into the XFS symlink code to reason about the fundamental form of this issue. Consider that an RCU walk starts, locates a symlink inode, meanwhile XFS recycles that inode into something completely different, then the VFS loads and calls ->get_link() (which is now NULL) on said inode and explodes. So the presumption is that the VFS uses RCU protection to rely on some form of stability of the inode (i.e., that the inode memory isn't freed, callback vectors don't change, etc.). Validity of the symlink content is a variant of that class of problem, likely already addressed by the recent inline symlink change, but that doesn't address the broader issue. > > To avoid this class of problem, isolate in-core inodes for recycling > > with an RCU grace period. This is the same level of protection the > > VFS expects for inactivated inodes that are never reused, and so > > guarantees no further concurrent access before the type or > > properties of the inode change. We don't want an unconditional > > synchronize_rcu() event here because that would result in a > > significant performance impact to mixed inode allocation workloads. > > > > Fortunately, we can take advantage of the recently added deferred > > inactivation mechanism to mitigate the need for an RCU wait in most > > cases. Deferred inactivation queues and batches the on-disk freeing > > of recently destroyed inodes, and so significantly increases the > > likelihood that a grace period has elapsed by the time an inode is > > freed and observable by the allocation code as a reuse candidate. > > Capture the current RCU grace period cookie at inode destroy time > > and refer to it at allocation time to conditionally wait for an RCU > > grace period if one hadn't expired in the meantime. Since only > > unlinked inodes are recycle candidates and unlinked inodes always > > require inactivation, > > Any inode can become a recycle candidate (i.e. RECLAIMABLE but otherwise > idle) but I think your point here is that unlinked inodes that become > recycling candidates can cause lookup threads to trip over symlinks, and > that's why we need to assign RCU state and poll on it, right? > Good point. When I wrote the commit log I was thinking of recycled inodes as "reincarnated" inodes, so that wording could probably be improved. But yes, the code is written minimally/simply so I was trying to document that it's unlinked -> freed -> reallocated inodes that we really care about here. WRT to symlinks, I was trying to use that as an example and not necessarily as the general reason for the patch. I.e., the general reason is that the VFS uses rcu protection for inode stability (just as for the inode free path), and the symlink thing is just an example of how things can go wrong in the current implementation without it. > (That wasn't a challenge, I'm just making sure I understand this > correctly.) > > > we only need to poll and assign RCU state in > > the inactivation codepath. Slightly adjust struct xfs_inode to fit > > the new field into padding holes that conveniently preexist in the > > same cacheline as the deferred inactivation list. > > > > Finally, note that the ideal long term solution here is to > > rearchitect bits of XFS' internal inode lifecycle management such > > that this additional stall point is not required, but this requires > > more thought, time and work to address. This approach restores > > functional correctness in the meantime. > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > --- > > > > Hi all, > > > > Here's the RCU fixup patch for inode reuse that I've been playing with, > > re: the vfs patch discussion [1]. I've put it in pretty much the most > > basic form, but I think there are a couple aspects worth thinking about: > > > > 1. Use and frequency of start_poll_synchronize_rcu() (vs. > > get_state_synchronize_rcu()). The former is a bit more active than the > > latter in that it triggers the start of a grace period, when necessary. > > This currently invokes per inode, which is the ideal frequency in > > theory, but could be reduced, associated with the xfs_inogegc thresholds > > in some manner, etc., if there is good reason to do that. > > If you rm -rf $path, do each of the inodes get a separate rcu state, or > do they share? > My previous experiments on a teardown grace period had me thinking batching would occur, but I don't recall which RCU call I was using at the time so I'd probably have to throw a tracepoint in there to dump some of the grace period values and double check to be sure. (If this is not the case, that might be a good reason to tweak things as discussed above). > > 2. The rcu cookie lifecycle. This variant updates it on inactivation > > queue and nowhere else because the RCU docs imply that counter rollover > > is not a significant problem. In practice, I think this means that if an > > inode is stamped at least once, and the counter rolls over, future > > (non-inactivation, non-unlinked) eviction -> repopulation cycles could > > trigger rcu syncs. I think this would require repeated > > eviction/reinstantiation cycles within a small window to be noticeable, > > so I'm not sure how likely this is to occur. We could be more defensive > > by resetting or refreshing the cookie. E.g., refresh (or reset to zero) > > at recycle time, unconditionally refresh at destroy time (using > > get_state_synchronize_rcu() for non-inactivation), etc. > > > > Otherwise testing is ongoing, but this version at least survives an > > fstests regression run. > > > > Brian > > > > [1] https://lore.kernel.org/linux-fsdevel/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ > > > > fs/xfs/xfs_icache.c | 11 +++++++++++ > > fs/xfs/xfs_inode.h | 3 ++- > > 2 files changed, 13 insertions(+), 1 deletion(-) > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > index d019c98eb839..4931daa45ca4 100644 > > --- a/fs/xfs/xfs_icache.c > > +++ b/fs/xfs/xfs_icache.c > > @@ -349,6 +349,16 @@ xfs_iget_recycle( > > spin_unlock(&ip->i_flags_lock); > > rcu_read_unlock(); > > > > + /* > > + * VFS RCU pathwalk lookups dictate the same lifecycle rules for an > > + * inode recycle as for freeing an inode. I.e., we cannot repurpose the > > + * inode until a grace period has elapsed from the time the previous > > + * version of the inode was destroyed. In most cases a grace period has > > + * already elapsed if the inode was (deferred) inactivated, but > > + * synchronize here as a last resort to guarantee correctness. > > + */ > > + cond_synchronize_rcu(ip->i_destroy_gp); > > + > > ASSERT(!rwsem_is_locked(&inode->i_rwsem)); > > error = xfs_reinit_inode(mp, inode); > > if (error) { > > @@ -2019,6 +2029,7 @@ xfs_inodegc_queue( > > trace_xfs_inode_set_need_inactive(ip); > > spin_lock(&ip->i_flags_lock); > > ip->i_flags |= XFS_NEED_INACTIVE; > > + ip->i_destroy_gp = start_poll_synchronize_rcu(); > > Hmm. The description says that we only need the rcu synchronization > when we're freeing an inode after its link count drops to zero, because > that's the vector for (say) the VFS inode ops actually changing due to > free/inactivate/reallocate/recycle while someone else is doing a lookup. > Right.. > I'm a bit puzzled why this unconditionally starts an rcu grace period, > instead of done only if i_nlink==0; and why we call cond_synchronize_rcu > above unconditionally instead of checking for i_mode==0 (or whatever > state the cached inode is left in after it's freed)? > Just an attempt to start simple and/or make any performance test/problems more blatant. I probably could have tagged this RFC. My primary goal with this patch was to establish whether the general approach is sane/viable/acceptable or we need to move in another direction. That aside, I think it's reasonable to have explicit logic around the unlinked case if we want to keep it restricted to that, though I would probably implement that as a conditional i_destroy_gp assignment and let the consumer context key off whether that field is set rather than attempt to infer unlinked logic (and then I guess reset it back to zero so it doesn't leak across reincarnation). That also probably facilitates a meaningful tracepoint to track the cases that do end up syncing, which helps with your earlier question around batching, so I'll look into those changes once I get through broader testing Brian > --D > > > spin_unlock(&ip->i_flags_lock); > > > > gc = get_cpu_ptr(mp->m_inodegc); > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > > index c447bf04205a..2153e3edbb86 100644 > > --- a/fs/xfs/xfs_inode.h > > +++ b/fs/xfs/xfs_inode.h > > @@ -40,8 +40,9 @@ typedef struct xfs_inode { > > /* Transaction and locking information. */ > > struct xfs_inode_log_item *i_itemp; /* logging information */ > > mrlock_t i_lock; /* inode lock */ > > - atomic_t i_pincount; /* inode pin count */ > > struct llist_node i_gclist; /* deferred inactivation list */ > > + unsigned long i_destroy_gp; /* destroy rcugp cookie */ > > + atomic_t i_pincount; /* inode pin count */ > > > > /* > > * Bitsets of inode metadata that have been checked and/or are sick. > > -- > > 2.31.1 > > > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-21 18:33 ` Brian Foster @ 2022-01-22 5:30 ` Paul E. McKenney 2022-01-22 16:55 ` Paul E. McKenney 2022-01-24 15:12 ` Brian Foster 0 siblings, 2 replies; 36+ messages in thread From: Paul E. McKenney @ 2022-01-22 5:30 UTC (permalink / raw) To: Brian Foster Cc: Darrick J. Wong, linux-xfs, Dave Chinner, Al Viro, Ian Kent, rcu On Fri, Jan 21, 2022 at 01:33:46PM -0500, Brian Foster wrote: > On Fri, Jan 21, 2022 at 09:26:03AM -0800, Darrick J. Wong wrote: > > On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote: > > > The XFS inode allocation algorithm aggressively reuses recently > > > freed inodes. This is historical behavior that has been in place for > > > quite some time, since XFS was imported to mainline Linux. Once the > > > VFS adopted RCUwalk path lookups (also some time ago), this behavior > > > became slightly incompatible because the inode recycle path doesn't > > > isolate concurrent access to the inode from the VFS. > > > > > > This has recently manifested as problems in the VFS when XFS happens > > > to change the type or properties of a recently unlinked inode while > > > still involved in an RCU lookup. For example, if the VFS refers to a > > > previous incarnation of a symlink inode, obtains the ->get_link() > > > callback from inode_operations, and the latter happens to change to > > > a non-symlink type via a recycle event, the ->get_link() callback > > > pointer is reset to NULL and the lookup results in a crash. > > > > Hmm, so I guess what you're saying is that if the memory buffer > > allocation in ->get_link is slow enough, some other thread can free the > > inode, drop it, reallocate it, and reinstantiate it (not as a symlink > > this time) all before ->get_link's memory allocation call returns, after > > which Bad Things Happen(tm)? > > > > Can the lookup thread end up with the wrong inode->i_ops too? > > > > We really don't need to even get into the XFS symlink code to reason > about the fundamental form of this issue. Consider that an RCU walk > starts, locates a symlink inode, meanwhile XFS recycles that inode into > something completely different, then the VFS loads and calls > ->get_link() (which is now NULL) on said inode and explodes. So the > presumption is that the VFS uses RCU protection to rely on some form of > stability of the inode (i.e., that the inode memory isn't freed, > callback vectors don't change, etc.). > > Validity of the symlink content is a variant of that class of problem, > likely already addressed by the recent inline symlink change, but that > doesn't address the broader issue. > > > > To avoid this class of problem, isolate in-core inodes for recycling > > > with an RCU grace period. This is the same level of protection the > > > VFS expects for inactivated inodes that are never reused, and so > > > guarantees no further concurrent access before the type or > > > properties of the inode change. We don't want an unconditional > > > synchronize_rcu() event here because that would result in a > > > significant performance impact to mixed inode allocation workloads. > > > > > > Fortunately, we can take advantage of the recently added deferred > > > inactivation mechanism to mitigate the need for an RCU wait in most > > > cases. Deferred inactivation queues and batches the on-disk freeing > > > of recently destroyed inodes, and so significantly increases the > > > likelihood that a grace period has elapsed by the time an inode is > > > freed and observable by the allocation code as a reuse candidate. > > > Capture the current RCU grace period cookie at inode destroy time > > > and refer to it at allocation time to conditionally wait for an RCU > > > grace period if one hadn't expired in the meantime. Since only > > > unlinked inodes are recycle candidates and unlinked inodes always > > > require inactivation, > > > > Any inode can become a recycle candidate (i.e. RECLAIMABLE but otherwise > > idle) but I think your point here is that unlinked inodes that become > > recycling candidates can cause lookup threads to trip over symlinks, and > > that's why we need to assign RCU state and poll on it, right? > > > > Good point. When I wrote the commit log I was thinking of recycled > inodes as "reincarnated" inodes, so that wording could probably be > improved. But yes, the code is written minimally/simply so I was trying > to document that it's unlinked -> freed -> reallocated inodes that we > really care about here. > > WRT to symlinks, I was trying to use that as an example and not > necessarily as the general reason for the patch. I.e., the general > reason is that the VFS uses rcu protection for inode stability (just as > for the inode free path), and the symlink thing is just an example of > how things can go wrong in the current implementation without it. > > > (That wasn't a challenge, I'm just making sure I understand this > > correctly.) > > > > > we only need to poll and assign RCU state in > > > the inactivation codepath. Slightly adjust struct xfs_inode to fit > > > the new field into padding holes that conveniently preexist in the > > > same cacheline as the deferred inactivation list. > > > > > > Finally, note that the ideal long term solution here is to > > > rearchitect bits of XFS' internal inode lifecycle management such > > > that this additional stall point is not required, but this requires > > > more thought, time and work to address. This approach restores > > > functional correctness in the meantime. > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > --- > > > > > > Hi all, > > > > > > Here's the RCU fixup patch for inode reuse that I've been playing with, > > > re: the vfs patch discussion [1]. I've put it in pretty much the most > > > basic form, but I think there are a couple aspects worth thinking about: > > > > > > 1. Use and frequency of start_poll_synchronize_rcu() (vs. > > > get_state_synchronize_rcu()). The former is a bit more active than the > > > latter in that it triggers the start of a grace period, when necessary. > > > This currently invokes per inode, which is the ideal frequency in > > > theory, but could be reduced, associated with the xfs_inogegc thresholds > > > in some manner, etc., if there is good reason to do that. > > > > If you rm -rf $path, do each of the inodes get a separate rcu state, or > > do they share? > > My previous experiments on a teardown grace period had me thinking > batching would occur, but I don't recall which RCU call I was using at > the time so I'd probably have to throw a tracepoint in there to dump > some of the grace period values and double check to be sure. (If this is > not the case, that might be a good reason to tweak things as discussed > above). An RCU grace period typically takes some milliseconds to complete, so a great many inodes would end up being tagged for the same grace period. For example, if "rm -rf" could delete one file per microsecond, the first few thousand files would be tagged with one grace period, the next few thousand with the next grace period, and so on. In the unlikely event that RCU was totally idle when the "rm -rf" started, the very first file might get its own grace period, but they would batch in the thousands thereafter. On start_poll_synchronize_rcu() vs. get_state_synchronize_rcu(), if there is always other RCU update activity, get_state_synchronize_rcu() is just fine. So if XFS does a call_rcu() or synchronize_rcu() every so often, all you need here is get_state_synchronize_rcu()(). Another approach is to do a start_poll_synchronize_rcu() every 1,000 events, and use get_state_synchronize_rcu() otherwise. And there are a lot of possible variations on that theme. But why not just try always doing start_poll_synchronize_rcu() and only bother with get_state_synchronize_rcu() if that turns out to be too slow? > > > 2. The rcu cookie lifecycle. This variant updates it on inactivation > > > queue and nowhere else because the RCU docs imply that counter rollover > > > is not a significant problem. In practice, I think this means that if an > > > inode is stamped at least once, and the counter rolls over, future > > > (non-inactivation, non-unlinked) eviction -> repopulation cycles could > > > trigger rcu syncs. I think this would require repeated > > > eviction/reinstantiation cycles within a small window to be noticeable, > > > so I'm not sure how likely this is to occur. We could be more defensive > > > by resetting or refreshing the cookie. E.g., refresh (or reset to zero) > > > at recycle time, unconditionally refresh at destroy time (using > > > get_state_synchronize_rcu() for non-inactivation), etc. Even on a 32-bit system that is running RCU grace periods as fast as they will go, it will take about 12 days to overflow that counter. But if you have an inode sitting on the list for that long, yes, you could see unnecessary synchronous grace-period waits. Would it help if there was an API that gave you a special cookie value that cond_synchronize_rcu() and friends recognized as "already expired"? That way if poll_state_synchronize_rcu() says that original cookie has expired, you could replace that cookie value with one that would stay expired. Maybe a get_expired_synchronize_rcu() or some such? Thanx, Paul > > > Otherwise testing is ongoing, but this version at least survives an > > > fstests regression run. > > > > > > Brian > > > > > > [1] https://lore.kernel.org/linux-fsdevel/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ > > > > > > fs/xfs/xfs_icache.c | 11 +++++++++++ > > > fs/xfs/xfs_inode.h | 3 ++- > > > 2 files changed, 13 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > index d019c98eb839..4931daa45ca4 100644 > > > --- a/fs/xfs/xfs_icache.c > > > +++ b/fs/xfs/xfs_icache.c > > > @@ -349,6 +349,16 @@ xfs_iget_recycle( > > > spin_unlock(&ip->i_flags_lock); > > > rcu_read_unlock(); > > > > > > + /* > > > + * VFS RCU pathwalk lookups dictate the same lifecycle rules for an > > > + * inode recycle as for freeing an inode. I.e., we cannot repurpose the > > > + * inode until a grace period has elapsed from the time the previous > > > + * version of the inode was destroyed. In most cases a grace period has > > > + * already elapsed if the inode was (deferred) inactivated, but > > > + * synchronize here as a last resort to guarantee correctness. > > > + */ > > > + cond_synchronize_rcu(ip->i_destroy_gp); > > > + > > > ASSERT(!rwsem_is_locked(&inode->i_rwsem)); > > > error = xfs_reinit_inode(mp, inode); > > > if (error) { > > > @@ -2019,6 +2029,7 @@ xfs_inodegc_queue( > > > trace_xfs_inode_set_need_inactive(ip); > > > spin_lock(&ip->i_flags_lock); > > > ip->i_flags |= XFS_NEED_INACTIVE; > > > + ip->i_destroy_gp = start_poll_synchronize_rcu(); > > > > Hmm. The description says that we only need the rcu synchronization > > when we're freeing an inode after its link count drops to zero, because > > that's the vector for (say) the VFS inode ops actually changing due to > > free/inactivate/reallocate/recycle while someone else is doing a lookup. > > > > Right.. > > > I'm a bit puzzled why this unconditionally starts an rcu grace period, > > instead of done only if i_nlink==0; and why we call cond_synchronize_rcu > > above unconditionally instead of checking for i_mode==0 (or whatever > > state the cached inode is left in after it's freed)? > > > > Just an attempt to start simple and/or make any performance > test/problems more blatant. I probably could have tagged this RFC. My > primary goal with this patch was to establish whether the general > approach is sane/viable/acceptable or we need to move in another > direction. > > That aside, I think it's reasonable to have explicit logic around the > unlinked case if we want to keep it restricted to that, though I would > probably implement that as a conditional i_destroy_gp assignment and let > the consumer context key off whether that field is set rather than > attempt to infer unlinked logic (and then I guess reset it back to zero > so it doesn't leak across reincarnation). That also probably facilitates > a meaningful tracepoint to track the cases that do end up syncing, which > helps with your earlier question around batching, so I'll look into > those changes once I get through broader testing > > Brian > > > --D > > > > > spin_unlock(&ip->i_flags_lock); > > > > > > gc = get_cpu_ptr(mp->m_inodegc); > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > > > index c447bf04205a..2153e3edbb86 100644 > > > --- a/fs/xfs/xfs_inode.h > > > +++ b/fs/xfs/xfs_inode.h > > > @@ -40,8 +40,9 @@ typedef struct xfs_inode { > > > /* Transaction and locking information. */ > > > struct xfs_inode_log_item *i_itemp; /* logging information */ > > > mrlock_t i_lock; /* inode lock */ > > > - atomic_t i_pincount; /* inode pin count */ > > > struct llist_node i_gclist; /* deferred inactivation list */ > > > + unsigned long i_destroy_gp; /* destroy rcugp cookie */ > > > + atomic_t i_pincount; /* inode pin count */ > > > > > > /* > > > * Bitsets of inode metadata that have been checked and/or are sick. > > > -- > > > 2.31.1 > > > > > > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-22 5:30 ` Paul E. McKenney @ 2022-01-22 16:55 ` Paul E. McKenney 2022-01-24 15:12 ` Brian Foster 1 sibling, 0 replies; 36+ messages in thread From: Paul E. McKenney @ 2022-01-22 16:55 UTC (permalink / raw) To: Brian Foster Cc: Darrick J. Wong, linux-xfs, Dave Chinner, Al Viro, Ian Kent, rcu On Fri, Jan 21, 2022 at 09:30:19PM -0800, Paul E. McKenney wrote: > On Fri, Jan 21, 2022 at 01:33:46PM -0500, Brian Foster wrote: [ . . . ] > > My previous experiments on a teardown grace period had me thinking > > batching would occur, but I don't recall which RCU call I was using at > > the time so I'd probably have to throw a tracepoint in there to dump > > some of the grace period values and double check to be sure. (If this is > > not the case, that might be a good reason to tweak things as discussed > > above). > > An RCU grace period typically takes some milliseconds to complete, so a > great many inodes would end up being tagged for the same grace period. > For example, if "rm -rf" could delete one file per microsecond, the > first few thousand files would be tagged with one grace period, > the next few thousand with the next grace period, and so on. > > In the unlikely event that RCU was totally idle when the "rm -rf" > started, the very first file might get its own grace period, but > they would batch in the thousands thereafter. > > On start_poll_synchronize_rcu() vs. get_state_synchronize_rcu(), if > there is always other RCU update activity, get_state_synchronize_rcu() > is just fine. So if XFS does a call_rcu() or synchronize_rcu() every > so often, all you need here is get_state_synchronize_rcu()(). > > Another approach is to do a start_poll_synchronize_rcu() every 1,000 > events, and use get_state_synchronize_rcu() otherwise. And there are > a lot of possible variations on that theme. > > But why not just try always doing start_poll_synchronize_rcu() and > only bother with get_state_synchronize_rcu() if that turns out to > be too slow? Plus there are a few optimizations I could apply that would speed up get_state_synchronize_rcu(), for example, reducing lock contention. But I would of course have to see a need before increasing complexity. Thanx, Paul ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-22 5:30 ` Paul E. McKenney 2022-01-22 16:55 ` Paul E. McKenney @ 2022-01-24 15:12 ` Brian Foster 2022-01-24 16:40 ` Paul E. McKenney 1 sibling, 1 reply; 36+ messages in thread From: Brian Foster @ 2022-01-24 15:12 UTC (permalink / raw) To: Paul E. McKenney Cc: Darrick J. Wong, linux-xfs, Dave Chinner, Al Viro, Ian Kent, rcu On Fri, Jan 21, 2022 at 09:30:19PM -0800, Paul E. McKenney wrote: > On Fri, Jan 21, 2022 at 01:33:46PM -0500, Brian Foster wrote: > > On Fri, Jan 21, 2022 at 09:26:03AM -0800, Darrick J. Wong wrote: > > > On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote: > > > > The XFS inode allocation algorithm aggressively reuses recently > > > > freed inodes. This is historical behavior that has been in place for > > > > quite some time, since XFS was imported to mainline Linux. Once the > > > > VFS adopted RCUwalk path lookups (also some time ago), this behavior > > > > became slightly incompatible because the inode recycle path doesn't > > > > isolate concurrent access to the inode from the VFS. > > > > > > > > This has recently manifested as problems in the VFS when XFS happens > > > > to change the type or properties of a recently unlinked inode while > > > > still involved in an RCU lookup. For example, if the VFS refers to a > > > > previous incarnation of a symlink inode, obtains the ->get_link() > > > > callback from inode_operations, and the latter happens to change to > > > > a non-symlink type via a recycle event, the ->get_link() callback > > > > pointer is reset to NULL and the lookup results in a crash. > > > > > > Hmm, so I guess what you're saying is that if the memory buffer > > > allocation in ->get_link is slow enough, some other thread can free the > > > inode, drop it, reallocate it, and reinstantiate it (not as a symlink > > > this time) all before ->get_link's memory allocation call returns, after > > > which Bad Things Happen(tm)? > > > > > > Can the lookup thread end up with the wrong inode->i_ops too? > > > > > > > We really don't need to even get into the XFS symlink code to reason > > about the fundamental form of this issue. Consider that an RCU walk > > starts, locates a symlink inode, meanwhile XFS recycles that inode into > > something completely different, then the VFS loads and calls > > ->get_link() (which is now NULL) on said inode and explodes. So the > > presumption is that the VFS uses RCU protection to rely on some form of > > stability of the inode (i.e., that the inode memory isn't freed, > > callback vectors don't change, etc.). > > > > Validity of the symlink content is a variant of that class of problem, > > likely already addressed by the recent inline symlink change, but that > > doesn't address the broader issue. > > > > > > To avoid this class of problem, isolate in-core inodes for recycling > > > > with an RCU grace period. This is the same level of protection the > > > > VFS expects for inactivated inodes that are never reused, and so > > > > guarantees no further concurrent access before the type or > > > > properties of the inode change. We don't want an unconditional > > > > synchronize_rcu() event here because that would result in a > > > > significant performance impact to mixed inode allocation workloads. > > > > > > > > Fortunately, we can take advantage of the recently added deferred > > > > inactivation mechanism to mitigate the need for an RCU wait in most > > > > cases. Deferred inactivation queues and batches the on-disk freeing > > > > of recently destroyed inodes, and so significantly increases the > > > > likelihood that a grace period has elapsed by the time an inode is > > > > freed and observable by the allocation code as a reuse candidate. > > > > Capture the current RCU grace period cookie at inode destroy time > > > > and refer to it at allocation time to conditionally wait for an RCU > > > > grace period if one hadn't expired in the meantime. Since only > > > > unlinked inodes are recycle candidates and unlinked inodes always > > > > require inactivation, > > > > > > Any inode can become a recycle candidate (i.e. RECLAIMABLE but otherwise > > > idle) but I think your point here is that unlinked inodes that become > > > recycling candidates can cause lookup threads to trip over symlinks, and > > > that's why we need to assign RCU state and poll on it, right? > > > > > > > Good point. When I wrote the commit log I was thinking of recycled > > inodes as "reincarnated" inodes, so that wording could probably be > > improved. But yes, the code is written minimally/simply so I was trying > > to document that it's unlinked -> freed -> reallocated inodes that we > > really care about here. > > > > WRT to symlinks, I was trying to use that as an example and not > > necessarily as the general reason for the patch. I.e., the general > > reason is that the VFS uses rcu protection for inode stability (just as > > for the inode free path), and the symlink thing is just an example of > > how things can go wrong in the current implementation without it. > > > > > (That wasn't a challenge, I'm just making sure I understand this > > > correctly.) > > > > > > > we only need to poll and assign RCU state in > > > > the inactivation codepath. Slightly adjust struct xfs_inode to fit > > > > the new field into padding holes that conveniently preexist in the > > > > same cacheline as the deferred inactivation list. > > > > > > > > Finally, note that the ideal long term solution here is to > > > > rearchitect bits of XFS' internal inode lifecycle management such > > > > that this additional stall point is not required, but this requires > > > > more thought, time and work to address. This approach restores > > > > functional correctness in the meantime. > > > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > > --- > > > > > > > > Hi all, > > > > > > > > Here's the RCU fixup patch for inode reuse that I've been playing with, > > > > re: the vfs patch discussion [1]. I've put it in pretty much the most > > > > basic form, but I think there are a couple aspects worth thinking about: > > > > > > > > 1. Use and frequency of start_poll_synchronize_rcu() (vs. > > > > get_state_synchronize_rcu()). The former is a bit more active than the > > > > latter in that it triggers the start of a grace period, when necessary. > > > > This currently invokes per inode, which is the ideal frequency in > > > > theory, but could be reduced, associated with the xfs_inogegc thresholds > > > > in some manner, etc., if there is good reason to do that. > > > > > > If you rm -rf $path, do each of the inodes get a separate rcu state, or > > > do they share? > > > > My previous experiments on a teardown grace period had me thinking > > batching would occur, but I don't recall which RCU call I was using at > > the time so I'd probably have to throw a tracepoint in there to dump > > some of the grace period values and double check to be sure. (If this is > > not the case, that might be a good reason to tweak things as discussed > > above). > > An RCU grace period typically takes some milliseconds to complete, so a > great many inodes would end up being tagged for the same grace period. > For example, if "rm -rf" could delete one file per microsecond, the > first few thousand files would be tagged with one grace period, > the next few thousand with the next grace period, and so on. > > In the unlikely event that RCU was totally idle when the "rm -rf" > started, the very first file might get its own grace period, but > they would batch in the thousands thereafter. > Great, thanks for the info. > On start_poll_synchronize_rcu() vs. get_state_synchronize_rcu(), if > there is always other RCU update activity, get_state_synchronize_rcu() > is just fine. So if XFS does a call_rcu() or synchronize_rcu() every > so often, all you need here is get_state_synchronize_rcu()(). > > Another approach is to do a start_poll_synchronize_rcu() every 1,000 > events, and use get_state_synchronize_rcu() otherwise. And there are > a lot of possible variations on that theme. > > But why not just try always doing start_poll_synchronize_rcu() and > only bother with get_state_synchronize_rcu() if that turns out to > be too slow? > Ack, that makes sense to me. We use call_rcu() to free inode memory and obviously will have a sync in the lookup path after this patch, but that is a consequence of the polling we add at the same time. I'm not sure that's enough activity on our own so I'd probably prefer to keep things simple, use the start_poll_*() variant from the start, and then consider further start/get filtering like you describe above if it ever becomes a problem. > > > > 2. The rcu cookie lifecycle. This variant updates it on inactivation > > > > queue and nowhere else because the RCU docs imply that counter rollover > > > > is not a significant problem. In practice, I think this means that if an > > > > inode is stamped at least once, and the counter rolls over, future > > > > (non-inactivation, non-unlinked) eviction -> repopulation cycles could > > > > trigger rcu syncs. I think this would require repeated > > > > eviction/reinstantiation cycles within a small window to be noticeable, > > > > so I'm not sure how likely this is to occur. We could be more defensive > > > > by resetting or refreshing the cookie. E.g., refresh (or reset to zero) > > > > at recycle time, unconditionally refresh at destroy time (using > > > > get_state_synchronize_rcu() for non-inactivation), etc. > > Even on a 32-bit system that is running RCU grace periods as fast as they > will go, it will take about 12 days to overflow that counter. But if > you have an inode sitting on the list for that long, yes, you could > see unnecessary synchronous grace-period waits. > > Would it help if there was an API that gave you a special cookie value > that cond_synchronize_rcu() and friends recognized as "already expired"? > That way if poll_state_synchronize_rcu() says that original cookie > has expired, you could replace that cookie value with one that would > stay expired. Maybe a get_expired_synchronize_rcu() or some such? > Hmm.. so I think this would be helpful if we were to stamp the inode conditionally (i.e. unlinked inodes only) on eviction because then we wouldn't have to worry about clearing the cookie if said inode happens to be reallocated and then run through one or more eviction -> recycle sequences after a rollover of the grace period counter. With that sort of scheme, the inode could be sitting in cache for who knows how long with a counter that was conditionally synced against many days (or weeks?) prior, from whenever it was initially reallocated. However, as Dave points out that we probably want to poll RCU state on every inode eviction, I suspect that means this is less of an issue. An inode must be evicted for it to become a recycle candidate, and so if we update the inode unconditionally on every eviction, then I think the recycle code should always see the most recent cookie value and we don't have to worry much about clearing it. I think it's technically possible for an inode to sit in an inactivation queue for that sort of time period, but that would probably require the filesystem go idle or drop to low enough activity that a spurious rcu sync here or there is probably not a big deal. So all in all, I suspect if we already had such a special cookie variant of the API that was otherwise functionally equivalent, I'd probably use it to cover that potential case, but it's not clear to me atm that this use case necessarily warrants introduction of such an API... Brian > Thanx, Paul > > > > > Otherwise testing is ongoing, but this version at least survives an > > > > fstests regression run. > > > > > > > > Brian > > > > > > > > [1] https://lore.kernel.org/linux-fsdevel/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ > > > > > > > > fs/xfs/xfs_icache.c | 11 +++++++++++ > > > > fs/xfs/xfs_inode.h | 3 ++- > > > > 2 files changed, 13 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > > index d019c98eb839..4931daa45ca4 100644 > > > > --- a/fs/xfs/xfs_icache.c > > > > +++ b/fs/xfs/xfs_icache.c > > > > @@ -349,6 +349,16 @@ xfs_iget_recycle( > > > > spin_unlock(&ip->i_flags_lock); > > > > rcu_read_unlock(); > > > > > > > > + /* > > > > + * VFS RCU pathwalk lookups dictate the same lifecycle rules for an > > > > + * inode recycle as for freeing an inode. I.e., we cannot repurpose the > > > > + * inode until a grace period has elapsed from the time the previous > > > > + * version of the inode was destroyed. In most cases a grace period has > > > > + * already elapsed if the inode was (deferred) inactivated, but > > > > + * synchronize here as a last resort to guarantee correctness. > > > > + */ > > > > + cond_synchronize_rcu(ip->i_destroy_gp); > > > > + > > > > ASSERT(!rwsem_is_locked(&inode->i_rwsem)); > > > > error = xfs_reinit_inode(mp, inode); > > > > if (error) { > > > > @@ -2019,6 +2029,7 @@ xfs_inodegc_queue( > > > > trace_xfs_inode_set_need_inactive(ip); > > > > spin_lock(&ip->i_flags_lock); > > > > ip->i_flags |= XFS_NEED_INACTIVE; > > > > + ip->i_destroy_gp = start_poll_synchronize_rcu(); > > > > > > Hmm. The description says that we only need the rcu synchronization > > > when we're freeing an inode after its link count drops to zero, because > > > that's the vector for (say) the VFS inode ops actually changing due to > > > free/inactivate/reallocate/recycle while someone else is doing a lookup. > > > > > > > Right.. > > > > > I'm a bit puzzled why this unconditionally starts an rcu grace period, > > > instead of done only if i_nlink==0; and why we call cond_synchronize_rcu > > > above unconditionally instead of checking for i_mode==0 (or whatever > > > state the cached inode is left in after it's freed)? > > > > > > > Just an attempt to start simple and/or make any performance > > test/problems more blatant. I probably could have tagged this RFC. My > > primary goal with this patch was to establish whether the general > > approach is sane/viable/acceptable or we need to move in another > > direction. > > > > That aside, I think it's reasonable to have explicit logic around the > > unlinked case if we want to keep it restricted to that, though I would > > probably implement that as a conditional i_destroy_gp assignment and let > > the consumer context key off whether that field is set rather than > > attempt to infer unlinked logic (and then I guess reset it back to zero > > so it doesn't leak across reincarnation). That also probably facilitates > > a meaningful tracepoint to track the cases that do end up syncing, which > > helps with your earlier question around batching, so I'll look into > > those changes once I get through broader testing > > > > Brian > > > > > --D > > > > > > > spin_unlock(&ip->i_flags_lock); > > > > > > > > gc = get_cpu_ptr(mp->m_inodegc); > > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > > > > index c447bf04205a..2153e3edbb86 100644 > > > > --- a/fs/xfs/xfs_inode.h > > > > +++ b/fs/xfs/xfs_inode.h > > > > @@ -40,8 +40,9 @@ typedef struct xfs_inode { > > > > /* Transaction and locking information. */ > > > > struct xfs_inode_log_item *i_itemp; /* logging information */ > > > > mrlock_t i_lock; /* inode lock */ > > > > - atomic_t i_pincount; /* inode pin count */ > > > > struct llist_node i_gclist; /* deferred inactivation list */ > > > > + unsigned long i_destroy_gp; /* destroy rcugp cookie */ > > > > + atomic_t i_pincount; /* inode pin count */ > > > > > > > > /* > > > > * Bitsets of inode metadata that have been checked and/or are sick. > > > > -- > > > > 2.31.1 > > > > > > > > > > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-24 15:12 ` Brian Foster @ 2022-01-24 16:40 ` Paul E. McKenney 0 siblings, 0 replies; 36+ messages in thread From: Paul E. McKenney @ 2022-01-24 16:40 UTC (permalink / raw) To: Brian Foster Cc: Darrick J. Wong, linux-xfs, Dave Chinner, Al Viro, Ian Kent, rcu On Mon, Jan 24, 2022 at 10:12:45AM -0500, Brian Foster wrote: > On Fri, Jan 21, 2022 at 09:30:19PM -0800, Paul E. McKenney wrote: > > On Fri, Jan 21, 2022 at 01:33:46PM -0500, Brian Foster wrote: > > > On Fri, Jan 21, 2022 at 09:26:03AM -0800, Darrick J. Wong wrote: > > > > On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote: > > > > > The XFS inode allocation algorithm aggressively reuses recently > > > > > freed inodes. This is historical behavior that has been in place for > > > > > quite some time, since XFS was imported to mainline Linux. Once the > > > > > VFS adopted RCUwalk path lookups (also some time ago), this behavior > > > > > became slightly incompatible because the inode recycle path doesn't > > > > > isolate concurrent access to the inode from the VFS. > > > > > > > > > > This has recently manifested as problems in the VFS when XFS happens > > > > > to change the type or properties of a recently unlinked inode while > > > > > still involved in an RCU lookup. For example, if the VFS refers to a > > > > > previous incarnation of a symlink inode, obtains the ->get_link() > > > > > callback from inode_operations, and the latter happens to change to > > > > > a non-symlink type via a recycle event, the ->get_link() callback > > > > > pointer is reset to NULL and the lookup results in a crash. > > > > > > > > Hmm, so I guess what you're saying is that if the memory buffer > > > > allocation in ->get_link is slow enough, some other thread can free the > > > > inode, drop it, reallocate it, and reinstantiate it (not as a symlink > > > > this time) all before ->get_link's memory allocation call returns, after > > > > which Bad Things Happen(tm)? > > > > > > > > Can the lookup thread end up with the wrong inode->i_ops too? > > > > > > > > > > We really don't need to even get into the XFS symlink code to reason > > > about the fundamental form of this issue. Consider that an RCU walk > > > starts, locates a symlink inode, meanwhile XFS recycles that inode into > > > something completely different, then the VFS loads and calls > > > ->get_link() (which is now NULL) on said inode and explodes. So the > > > presumption is that the VFS uses RCU protection to rely on some form of > > > stability of the inode (i.e., that the inode memory isn't freed, > > > callback vectors don't change, etc.). > > > > > > Validity of the symlink content is a variant of that class of problem, > > > likely already addressed by the recent inline symlink change, but that > > > doesn't address the broader issue. > > > > > > > > To avoid this class of problem, isolate in-core inodes for recycling > > > > > with an RCU grace period. This is the same level of protection the > > > > > VFS expects for inactivated inodes that are never reused, and so > > > > > guarantees no further concurrent access before the type or > > > > > properties of the inode change. We don't want an unconditional > > > > > synchronize_rcu() event here because that would result in a > > > > > significant performance impact to mixed inode allocation workloads. > > > > > > > > > > Fortunately, we can take advantage of the recently added deferred > > > > > inactivation mechanism to mitigate the need for an RCU wait in most > > > > > cases. Deferred inactivation queues and batches the on-disk freeing > > > > > of recently destroyed inodes, and so significantly increases the > > > > > likelihood that a grace period has elapsed by the time an inode is > > > > > freed and observable by the allocation code as a reuse candidate. > > > > > Capture the current RCU grace period cookie at inode destroy time > > > > > and refer to it at allocation time to conditionally wait for an RCU > > > > > grace period if one hadn't expired in the meantime. Since only > > > > > unlinked inodes are recycle candidates and unlinked inodes always > > > > > require inactivation, > > > > > > > > Any inode can become a recycle candidate (i.e. RECLAIMABLE but otherwise > > > > idle) but I think your point here is that unlinked inodes that become > > > > recycling candidates can cause lookup threads to trip over symlinks, and > > > > that's why we need to assign RCU state and poll on it, right? > > > > > > > > > > Good point. When I wrote the commit log I was thinking of recycled > > > inodes as "reincarnated" inodes, so that wording could probably be > > > improved. But yes, the code is written minimally/simply so I was trying > > > to document that it's unlinked -> freed -> reallocated inodes that we > > > really care about here. > > > > > > WRT to symlinks, I was trying to use that as an example and not > > > necessarily as the general reason for the patch. I.e., the general > > > reason is that the VFS uses rcu protection for inode stability (just as > > > for the inode free path), and the symlink thing is just an example of > > > how things can go wrong in the current implementation without it. > > > > > > > (That wasn't a challenge, I'm just making sure I understand this > > > > correctly.) > > > > > > > > > we only need to poll and assign RCU state in > > > > > the inactivation codepath. Slightly adjust struct xfs_inode to fit > > > > > the new field into padding holes that conveniently preexist in the > > > > > same cacheline as the deferred inactivation list. > > > > > > > > > > Finally, note that the ideal long term solution here is to > > > > > rearchitect bits of XFS' internal inode lifecycle management such > > > > > that this additional stall point is not required, but this requires > > > > > more thought, time and work to address. This approach restores > > > > > functional correctness in the meantime. > > > > > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > > > --- > > > > > > > > > > Hi all, > > > > > > > > > > Here's the RCU fixup patch for inode reuse that I've been playing with, > > > > > re: the vfs patch discussion [1]. I've put it in pretty much the most > > > > > basic form, but I think there are a couple aspects worth thinking about: > > > > > > > > > > 1. Use and frequency of start_poll_synchronize_rcu() (vs. > > > > > get_state_synchronize_rcu()). The former is a bit more active than the > > > > > latter in that it triggers the start of a grace period, when necessary. > > > > > This currently invokes per inode, which is the ideal frequency in > > > > > theory, but could be reduced, associated with the xfs_inogegc thresholds > > > > > in some manner, etc., if there is good reason to do that. > > > > > > > > If you rm -rf $path, do each of the inodes get a separate rcu state, or > > > > do they share? > > > > > > My previous experiments on a teardown grace period had me thinking > > > batching would occur, but I don't recall which RCU call I was using at > > > the time so I'd probably have to throw a tracepoint in there to dump > > > some of the grace period values and double check to be sure. (If this is > > > not the case, that might be a good reason to tweak things as discussed > > > above). > > > > An RCU grace period typically takes some milliseconds to complete, so a > > great many inodes would end up being tagged for the same grace period. > > For example, if "rm -rf" could delete one file per microsecond, the > > first few thousand files would be tagged with one grace period, > > the next few thousand with the next grace period, and so on. > > > > In the unlikely event that RCU was totally idle when the "rm -rf" > > started, the very first file might get its own grace period, but > > they would batch in the thousands thereafter. > > > > Great, thanks for the info. > > > On start_poll_synchronize_rcu() vs. get_state_synchronize_rcu(), if > > there is always other RCU update activity, get_state_synchronize_rcu() > > is just fine. So if XFS does a call_rcu() or synchronize_rcu() every > > so often, all you need here is get_state_synchronize_rcu()(). > > > > Another approach is to do a start_poll_synchronize_rcu() every 1,000 > > events, and use get_state_synchronize_rcu() otherwise. And there are > > a lot of possible variations on that theme. > > > > But why not just try always doing start_poll_synchronize_rcu() and > > only bother with get_state_synchronize_rcu() if that turns out to > > be too slow? > > > > Ack, that makes sense to me. We use call_rcu() to free inode memory and > obviously will have a sync in the lookup path after this patch, but that > is a consequence of the polling we add at the same time. I'm not sure > that's enough activity on our own so I'd probably prefer to keep things > simple, use the start_poll_*() variant from the start, and then consider > further start/get filtering like you describe above if it ever becomes a > problem. > > > > > > 2. The rcu cookie lifecycle. This variant updates it on inactivation > > > > > queue and nowhere else because the RCU docs imply that counter rollover > > > > > is not a significant problem. In practice, I think this means that if an > > > > > inode is stamped at least once, and the counter rolls over, future > > > > > (non-inactivation, non-unlinked) eviction -> repopulation cycles could > > > > > trigger rcu syncs. I think this would require repeated > > > > > eviction/reinstantiation cycles within a small window to be noticeable, > > > > > so I'm not sure how likely this is to occur. We could be more defensive > > > > > by resetting or refreshing the cookie. E.g., refresh (or reset to zero) > > > > > at recycle time, unconditionally refresh at destroy time (using > > > > > get_state_synchronize_rcu() for non-inactivation), etc. > > > > Even on a 32-bit system that is running RCU grace periods as fast as they > > will go, it will take about 12 days to overflow that counter. But if > > you have an inode sitting on the list for that long, yes, you could > > see unnecessary synchronous grace-period waits. > > > > Would it help if there was an API that gave you a special cookie value > > that cond_synchronize_rcu() and friends recognized as "already expired"? > > That way if poll_state_synchronize_rcu() says that original cookie > > has expired, you could replace that cookie value with one that would > > stay expired. Maybe a get_expired_synchronize_rcu() or some such? > > > > Hmm.. so I think this would be helpful if we were to stamp the inode > conditionally (i.e. unlinked inodes only) on eviction because then we > wouldn't have to worry about clearing the cookie if said inode happens > to be reallocated and then run through one or more eviction -> recycle > sequences after a rollover of the grace period counter. With that sort > of scheme, the inode could be sitting in cache for who knows how long > with a counter that was conditionally synced against many days (or > weeks?) prior, from whenever it was initially reallocated. > > However, as Dave points out that we probably want to poll RCU state on > every inode eviction, I suspect that means this is less of an issue. An > inode must be evicted for it to become a recycle candidate, and so if we > update the inode unconditionally on every eviction, then I think the > recycle code should always see the most recent cookie value and we don't > have to worry much about clearing it. > > I think it's technically possible for an inode to sit in an inactivation > queue for that sort of time period, but that would probably require the > filesystem go idle or drop to low enough activity that a spurious rcu > sync here or there is probably not a big deal. So all in all, I suspect > if we already had such a special cookie variant of the API that was > otherwise functionally equivalent, I'd probably use it to cover that > potential case, but it's not clear to me atm that this use case > necessarily warrants introduction of such an API... If you need it, it happens to be easy to provide. If you don't need it, I am of course happy to avoid adding another RCU API member. ;-) Thanx, Paul > > > > > Otherwise testing is ongoing, but this version at least survives an > > > > > fstests regression run. > > > > > > > > > > Brian > > > > > > > > > > [1] https://lore.kernel.org/linux-fsdevel/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ > > > > > > > > > > fs/xfs/xfs_icache.c | 11 +++++++++++ > > > > > fs/xfs/xfs_inode.h | 3 ++- > > > > > 2 files changed, 13 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > > > index d019c98eb839..4931daa45ca4 100644 > > > > > --- a/fs/xfs/xfs_icache.c > > > > > +++ b/fs/xfs/xfs_icache.c > > > > > @@ -349,6 +349,16 @@ xfs_iget_recycle( > > > > > spin_unlock(&ip->i_flags_lock); > > > > > rcu_read_unlock(); > > > > > > > > > > + /* > > > > > + * VFS RCU pathwalk lookups dictate the same lifecycle rules for an > > > > > + * inode recycle as for freeing an inode. I.e., we cannot repurpose the > > > > > + * inode until a grace period has elapsed from the time the previous > > > > > + * version of the inode was destroyed. In most cases a grace period has > > > > > + * already elapsed if the inode was (deferred) inactivated, but > > > > > + * synchronize here as a last resort to guarantee correctness. > > > > > + */ > > > > > + cond_synchronize_rcu(ip->i_destroy_gp); > > > > > + > > > > > ASSERT(!rwsem_is_locked(&inode->i_rwsem)); > > > > > error = xfs_reinit_inode(mp, inode); > > > > > if (error) { > > > > > @@ -2019,6 +2029,7 @@ xfs_inodegc_queue( > > > > > trace_xfs_inode_set_need_inactive(ip); > > > > > spin_lock(&ip->i_flags_lock); > > > > > ip->i_flags |= XFS_NEED_INACTIVE; > > > > > + ip->i_destroy_gp = start_poll_synchronize_rcu(); > > > > > > > > Hmm. The description says that we only need the rcu synchronization > > > > when we're freeing an inode after its link count drops to zero, because > > > > that's the vector for (say) the VFS inode ops actually changing due to > > > > free/inactivate/reallocate/recycle while someone else is doing a lookup. > > > > > > > > > > Right.. > > > > > > > I'm a bit puzzled why this unconditionally starts an rcu grace period, > > > > instead of done only if i_nlink==0; and why we call cond_synchronize_rcu > > > > above unconditionally instead of checking for i_mode==0 (or whatever > > > > state the cached inode is left in after it's freed)? > > > > > > > > > > Just an attempt to start simple and/or make any performance > > > test/problems more blatant. I probably could have tagged this RFC. My > > > primary goal with this patch was to establish whether the general > > > approach is sane/viable/acceptable or we need to move in another > > > direction. > > > > > > That aside, I think it's reasonable to have explicit logic around the > > > unlinked case if we want to keep it restricted to that, though I would > > > probably implement that as a conditional i_destroy_gp assignment and let > > > the consumer context key off whether that field is set rather than > > > attempt to infer unlinked logic (and then I guess reset it back to zero > > > so it doesn't leak across reincarnation). That also probably facilitates > > > a meaningful tracepoint to track the cases that do end up syncing, which > > > helps with your earlier question around batching, so I'll look into > > > those changes once I get through broader testing > > > > > > Brian > > > > > > > --D > > > > > > > > > spin_unlock(&ip->i_flags_lock); > > > > > > > > > > gc = get_cpu_ptr(mp->m_inodegc); > > > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > > > > > index c447bf04205a..2153e3edbb86 100644 > > > > > --- a/fs/xfs/xfs_inode.h > > > > > +++ b/fs/xfs/xfs_inode.h > > > > > @@ -40,8 +40,9 @@ typedef struct xfs_inode { > > > > > /* Transaction and locking information. */ > > > > > struct xfs_inode_log_item *i_itemp; /* logging information */ > > > > > mrlock_t i_lock; /* inode lock */ > > > > > - atomic_t i_pincount; /* inode pin count */ > > > > > struct llist_node i_gclist; /* deferred inactivation list */ > > > > > + unsigned long i_destroy_gp; /* destroy rcugp cookie */ > > > > > + atomic_t i_pincount; /* inode pin count */ > > > > > > > > > > /* > > > > > * Bitsets of inode metadata that have been checked and/or are sick. > > > > > -- > > > > > 2.31.1 > > > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-21 14:24 [PATCH] xfs: require an rcu grace period before inode recycle Brian Foster 2022-01-21 17:26 ` Darrick J. Wong @ 2022-01-23 22:43 ` Dave Chinner 2022-01-24 15:06 ` Brian Foster 2022-01-24 15:02 ` Brian Foster 2 siblings, 1 reply; 36+ messages in thread From: Dave Chinner @ 2022-01-23 22:43 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs, Al Viro, Ian Kent, rcu On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote: > The XFS inode allocation algorithm aggressively reuses recently > freed inodes. This is historical behavior that has been in place for > quite some time, since XFS was imported to mainline Linux. Once the > VFS adopted RCUwalk path lookups (also some time ago), this behavior > became slightly incompatible because the inode recycle path doesn't > isolate concurrent access to the inode from the VFS. > > This has recently manifested as problems in the VFS when XFS happens > to change the type or properties of a recently unlinked inode while > still involved in an RCU lookup. For example, if the VFS refers to a > previous incarnation of a symlink inode, obtains the ->get_link() > callback from inode_operations, and the latter happens to change to > a non-symlink type via a recycle event, the ->get_link() callback > pointer is reset to NULL and the lookup results in a crash. > > To avoid this class of problem, isolate in-core inodes for recycling > with an RCU grace period. This is the same level of protection the > VFS expects for inactivated inodes that are never reused, and so > guarantees no further concurrent access before the type or > properties of the inode change. We don't want an unconditional > synchronize_rcu() event here because that would result in a > significant performance impact to mixed inode allocation workloads. > > Fortunately, we can take advantage of the recently added deferred > inactivation mechanism to mitigate the need for an RCU wait in most > cases. Deferred inactivation queues and batches the on-disk freeing > of recently destroyed inodes, and so significantly increases the > likelihood that a grace period has elapsed by the time an inode is > freed and observable by the allocation code as a reuse candidate. > Capture the current RCU grace period cookie at inode destroy time > and refer to it at allocation time to conditionally wait for an RCU > grace period if one hadn't expired in the meantime. Since only > unlinked inodes are recycle candidates and unlinked inodes always > require inactivation, we only need to poll and assign RCU state in > the inactivation codepath. I think this assertion is incorrect. Recycling can occur on any inode that has been evicted from the VFS cache. i.e. while the inode is sitting in XFS_IRECLAIMABLE state waiting for the background inodegc to run (every ~5s by default) a ->lookup from the VFS can occur and we find that same inode sitting there in XFS_IRECLAIMABLE state. This lookup then hits the recycle path. In this case, even though we re-instantiate the inode into the same identity, it goes through a transient state where the inode has it's identity returned to the default initial "just allocated" VFS state and this transient state can be visible from RCU lookups within the RCU grace period the inode was evicted from. This means the RCU lookup could see the inode with i_ops having been reset to &empty_ops, which means any method called on the inode at this time (e.g. ->get_link) will hit a NULL pointer dereference. This requires multiple concurrent lookups on the same inode that just got evicted, some which the RCU pathwalk finds the old stale dentry/inode pair, and others that don't find that old pair. This is much harder to trip over but, IIRC, we used to see this quite a lot with NFS server workloads when multiple operations on a single inode could come in from multiple clients and be processed in parallel by knfsd threads. This was quite a hot path before the NFS server had an open-file cache added to it, and it probably still is if the NFS server OFC is not large enough for the working set of files being accessed... Hence we have to ensure that RCU lookups can't find an evicted inode through anything other than xfs_iget() while we are re-instantiating the VFS inode state in xfs_iget_recycle(). Hence the RCU state sampling needs to be done unconditionally for all inodes going through ->destroy_inode so we can ensure grace periods expire for all inodes being recycled, not just those that required inactivation... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-23 22:43 ` Dave Chinner @ 2022-01-24 15:06 ` Brian Foster 0 siblings, 0 replies; 36+ messages in thread From: Brian Foster @ 2022-01-24 15:06 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, Al Viro, Ian Kent, rcu On Mon, Jan 24, 2022 at 09:43:46AM +1100, Dave Chinner wrote: > On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote: > > The XFS inode allocation algorithm aggressively reuses recently > > freed inodes. This is historical behavior that has been in place for > > quite some time, since XFS was imported to mainline Linux. Once the > > VFS adopted RCUwalk path lookups (also some time ago), this behavior > > became slightly incompatible because the inode recycle path doesn't > > isolate concurrent access to the inode from the VFS. > > > > This has recently manifested as problems in the VFS when XFS happens > > to change the type or properties of a recently unlinked inode while > > still involved in an RCU lookup. For example, if the VFS refers to a > > previous incarnation of a symlink inode, obtains the ->get_link() > > callback from inode_operations, and the latter happens to change to > > a non-symlink type via a recycle event, the ->get_link() callback > > pointer is reset to NULL and the lookup results in a crash. > > > > To avoid this class of problem, isolate in-core inodes for recycling > > with an RCU grace period. This is the same level of protection the > > VFS expects for inactivated inodes that are never reused, and so > > guarantees no further concurrent access before the type or > > properties of the inode change. We don't want an unconditional > > synchronize_rcu() event here because that would result in a > > significant performance impact to mixed inode allocation workloads. > > > > Fortunately, we can take advantage of the recently added deferred > > inactivation mechanism to mitigate the need for an RCU wait in most > > cases. Deferred inactivation queues and batches the on-disk freeing > > of recently destroyed inodes, and so significantly increases the > > likelihood that a grace period has elapsed by the time an inode is > > freed and observable by the allocation code as a reuse candidate. > > Capture the current RCU grace period cookie at inode destroy time > > and refer to it at allocation time to conditionally wait for an RCU > > grace period if one hadn't expired in the meantime. Since only > > unlinked inodes are recycle candidates and unlinked inodes always > > require inactivation, we only need to poll and assign RCU state in > > the inactivation codepath. > > I think this assertion is incorrect. > > Recycling can occur on any inode that has been evicted from the VFS > cache. i.e. while the inode is sitting in XFS_IRECLAIMABLE state > waiting for the background inodegc to run (every ~5s by default) a > ->lookup from the VFS can occur and we find that same inode sitting > there in XFS_IRECLAIMABLE state. This lookup then hits the recycle > path. > See my reply to Darrick wrt to the poor wording. I'm aware of the eviction -> recycle case, just didn't think we needed to deal with it here. > In this case, even though we re-instantiate the inode into the same > identity, it goes through a transient state where the inode has it's > identity returned to the default initial "just allocated" VFS state > and this transient state can be visible from RCU lookups within the > RCU grace period the inode was evicted from. This means the RCU > lookup could see the inode with i_ops having been reset to > &empty_ops, which means any method called on the inode at this time > (e.g. ->get_link) will hit a NULL pointer dereference. > Hmm, good point. > This requires multiple concurrent lookups on the same inode that > just got evicted, some which the RCU pathwalk finds the old stale > dentry/inode pair, and others that don't find that old pair. This is > much harder to trip over but, IIRC, we used to see this quite a lot > with NFS server workloads when multiple operations on a single inode > could come in from multiple clients and be processed in parallel by > knfsd threads. This was quite a hot path before the NFS server had an > open-file cache added to it, and it probably still is if the NFS > server OFC is not large enough for the working set of files being > accessed... > > Hence we have to ensure that RCU lookups can't find an evicted inode > through anything other than xfs_iget() while we are re-instantiating > the VFS inode state in xfs_iget_recycle(). Hence the RCU state > sampling needs to be done unconditionally for all inodes going > through ->destroy_inode so we can ensure grace periods expire for > all inodes being recycled, not just those that required > inactivation... > Yeah, that makes sense. So this means we don't want to filter to unlinked inodes, but OTOH Paul's feedback suggests the RCU calls should be fairly efficient on a per-inode basis. On top of that, the non-unlinked eviction case doesn't have such a direct impact on a mixed workload the way the unlinked case does (i.e. inactivation populating a free inode record for the next inode allocation to discover), so this is probably less significant of a change. Personally, my general takeaway from the just posted test results is that we really should be thinking about how to shift the allocation path cost away into the inactivation side, even if not done from the start. This changes things a bit because we know we need an rcu sync in the iget path for the (non-unlinnked) eviction case regardless, so perhaps the right approach is to get the basic functional fix in place to start, then revisit potential optimizations in the inactivation path for the unlinked inode case. IOW, a conditional, asynchronous rcu delay in the inactivation path (only) for unlinked inodes doesn't remove the need for an iget rcu sync in general, but it would still improve inode allocation performance if we ensure those inodes aren't reallocatable until a grace period has elapsed. We just have to implement it in a way that doesn't unreasonably impact sustained removal performance. Thoughts? Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-21 14:24 [PATCH] xfs: require an rcu grace period before inode recycle Brian Foster 2022-01-21 17:26 ` Darrick J. Wong 2022-01-23 22:43 ` Dave Chinner @ 2022-01-24 15:02 ` Brian Foster 2022-01-24 22:08 ` Dave Chinner 2 siblings, 1 reply; 36+ messages in thread From: Brian Foster @ 2022-01-24 15:02 UTC (permalink / raw) To: linux-xfs; +Cc: Dave Chinner, Al Viro, Ian Kent, rcu On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote: > The XFS inode allocation algorithm aggressively reuses recently > freed inodes. This is historical behavior that has been in place for > quite some time, since XFS was imported to mainline Linux. Once the > VFS adopted RCUwalk path lookups (also some time ago), this behavior > became slightly incompatible because the inode recycle path doesn't > isolate concurrent access to the inode from the VFS. > > This has recently manifested as problems in the VFS when XFS happens > to change the type or properties of a recently unlinked inode while > still involved in an RCU lookup. For example, if the VFS refers to a > previous incarnation of a symlink inode, obtains the ->get_link() > callback from inode_operations, and the latter happens to change to > a non-symlink type via a recycle event, the ->get_link() callback > pointer is reset to NULL and the lookup results in a crash. > > To avoid this class of problem, isolate in-core inodes for recycling > with an RCU grace period. This is the same level of protection the > VFS expects for inactivated inodes that are never reused, and so > guarantees no further concurrent access before the type or > properties of the inode change. We don't want an unconditional > synchronize_rcu() event here because that would result in a > significant performance impact to mixed inode allocation workloads. > > Fortunately, we can take advantage of the recently added deferred > inactivation mechanism to mitigate the need for an RCU wait in most > cases. Deferred inactivation queues and batches the on-disk freeing > of recently destroyed inodes, and so significantly increases the > likelihood that a grace period has elapsed by the time an inode is > freed and observable by the allocation code as a reuse candidate. > Capture the current RCU grace period cookie at inode destroy time > and refer to it at allocation time to conditionally wait for an RCU > grace period if one hadn't expired in the meantime. Since only > unlinked inodes are recycle candidates and unlinked inodes always > require inactivation, we only need to poll and assign RCU state in > the inactivation codepath. Slightly adjust struct xfs_inode to fit > the new field into padding holes that conveniently preexist in the > same cacheline as the deferred inactivation list. > > Finally, note that the ideal long term solution here is to > rearchitect bits of XFS' internal inode lifecycle management such > that this additional stall point is not required, but this requires > more thought, time and work to address. This approach restores > functional correctness in the meantime. > > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > > Hi all, > > Here's the RCU fixup patch for inode reuse that I've been playing with, > re: the vfs patch discussion [1]. I've put it in pretty much the most > basic form, but I think there are a couple aspects worth thinking about: > > 1. Use and frequency of start_poll_synchronize_rcu() (vs. > get_state_synchronize_rcu()). The former is a bit more active than the > latter in that it triggers the start of a grace period, when necessary. > This currently invokes per inode, which is the ideal frequency in > theory, but could be reduced, associated with the xfs_inogegc thresholds > in some manner, etc., if there is good reason to do that. > > 2. The rcu cookie lifecycle. This variant updates it on inactivation > queue and nowhere else because the RCU docs imply that counter rollover > is not a significant problem. In practice, I think this means that if an > inode is stamped at least once, and the counter rolls over, future > (non-inactivation, non-unlinked) eviction -> repopulation cycles could > trigger rcu syncs. I think this would require repeated > eviction/reinstantiation cycles within a small window to be noticeable, > so I'm not sure how likely this is to occur. We could be more defensive > by resetting or refreshing the cookie. E.g., refresh (or reset to zero) > at recycle time, unconditionally refresh at destroy time (using > get_state_synchronize_rcu() for non-inactivation), etc. > > Otherwise testing is ongoing, but this version at least survives an > fstests regression run. > FYI, I modified my repeated alloc/free test to do some batching and form it into something more able to measure the potential side effect / cost of the grace period sync. The test is a single threaded, file alloc/free loop using a variable per iteration batch size. The test runs for ~60s and reports how many total files were allocated/freed in that period with the specified batch size. Note that this particular test ran without any background workload. Results are as follows: files baseline test 1 38480 38437 4 126055 111080 8 218299 134469 16 306619 141968 32 397909 152267 64 418603 200875 128 469077 289365 256 684117 566016 512 931328 878933 1024 1126741 1118891 The first column shows the batch size of the test run while the second and third show results (averaged across three test runs) for the baseline (5.16.0-rc5) and test kernels. This basically shows that as the inactivation queue more efficiently batches removals, the number of stalls on the allocation side increase accordingly and thus slow the task down. This becomes significant by around 8 files per alloc/free iteration and seems to recover at around 512 files per iteration. Outside of those values, the additional overhead appears to be mostly masked. I'm not sure how realistic this sort of symmetric/predictable workload is in the wild, but this is more designed to show potential impact of the change. The delay cost can be shifted to the remove side to some degree if we wanted to go that route. E.g., a quick experiment to add an rcu sync in the inactivation path right before the inode is freed allows this test to behave much more in line with baseline up through about the 256 file mark, after which point results start to fall off as I suspect we start to measure stalls in the remove side. That's just a test of a quick hack, however. Since there is no real urgency to inactivate an unlinked inode (it has no potential users until it's freed), I suspect that result can be further optimized to absorb the cost of an rcu delay by deferring the steps that make the inode available for reallocation in the first place. In theory if that can be made completely asynchronous, then there is no real latency cost at all because nothing can use the inode until it's ultimately free on disk. However in reality we must have thresholds and whatnot to ensure the outstanding queue cannot grow out of control. My previous experiments suggest that an RCU delay on the inactivation side is measureable via a simple 'rm -rf' with the current thresholds, but can be mitigated if the pipeline/thresholds are tuned up a bit to accomodate the added delay. This has more complexity and tradeoffs, but IMO, this is something we should be thinking about at least as a next step to something like this patch. Brian > Brian > > [1] https://lore.kernel.org/linux-fsdevel/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/ > > fs/xfs/xfs_icache.c | 11 +++++++++++ > fs/xfs/xfs_inode.h | 3 ++- > 2 files changed, 13 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index d019c98eb839..4931daa45ca4 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -349,6 +349,16 @@ xfs_iget_recycle( > spin_unlock(&ip->i_flags_lock); > rcu_read_unlock(); > > + /* > + * VFS RCU pathwalk lookups dictate the same lifecycle rules for an > + * inode recycle as for freeing an inode. I.e., we cannot repurpose the > + * inode until a grace period has elapsed from the time the previous > + * version of the inode was destroyed. In most cases a grace period has > + * already elapsed if the inode was (deferred) inactivated, but > + * synchronize here as a last resort to guarantee correctness. > + */ > + cond_synchronize_rcu(ip->i_destroy_gp); > + > ASSERT(!rwsem_is_locked(&inode->i_rwsem)); > error = xfs_reinit_inode(mp, inode); > if (error) { > @@ -2019,6 +2029,7 @@ xfs_inodegc_queue( > trace_xfs_inode_set_need_inactive(ip); > spin_lock(&ip->i_flags_lock); > ip->i_flags |= XFS_NEED_INACTIVE; > + ip->i_destroy_gp = start_poll_synchronize_rcu(); > spin_unlock(&ip->i_flags_lock); > > gc = get_cpu_ptr(mp->m_inodegc); > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > index c447bf04205a..2153e3edbb86 100644 > --- a/fs/xfs/xfs_inode.h > +++ b/fs/xfs/xfs_inode.h > @@ -40,8 +40,9 @@ typedef struct xfs_inode { > /* Transaction and locking information. */ > struct xfs_inode_log_item *i_itemp; /* logging information */ > mrlock_t i_lock; /* inode lock */ > - atomic_t i_pincount; /* inode pin count */ > struct llist_node i_gclist; /* deferred inactivation list */ > + unsigned long i_destroy_gp; /* destroy rcugp cookie */ > + atomic_t i_pincount; /* inode pin count */ > > /* > * Bitsets of inode metadata that have been checked and/or are sick. > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-24 15:02 ` Brian Foster @ 2022-01-24 22:08 ` Dave Chinner 2022-01-24 23:29 ` Brian Foster 0 siblings, 1 reply; 36+ messages in thread From: Dave Chinner @ 2022-01-24 22:08 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs, Al Viro, Ian Kent, rcu On Mon, Jan 24, 2022 at 10:02:27AM -0500, Brian Foster wrote: > On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote: > > The XFS inode allocation algorithm aggressively reuses recently > > freed inodes. This is historical behavior that has been in place for > > quite some time, since XFS was imported to mainline Linux. Once the > > VFS adopted RCUwalk path lookups (also some time ago), this behavior > > became slightly incompatible because the inode recycle path doesn't > > isolate concurrent access to the inode from the VFS. > > > > This has recently manifested as problems in the VFS when XFS happens > > to change the type or properties of a recently unlinked inode while > > still involved in an RCU lookup. For example, if the VFS refers to a > > previous incarnation of a symlink inode, obtains the ->get_link() > > callback from inode_operations, and the latter happens to change to > > a non-symlink type via a recycle event, the ->get_link() callback > > pointer is reset to NULL and the lookup results in a crash. > > > > To avoid this class of problem, isolate in-core inodes for recycling > > with an RCU grace period. This is the same level of protection the > > VFS expects for inactivated inodes that are never reused, and so > > guarantees no further concurrent access before the type or > > properties of the inode change. We don't want an unconditional > > synchronize_rcu() event here because that would result in a > > significant performance impact to mixed inode allocation workloads. > > > > Fortunately, we can take advantage of the recently added deferred > > inactivation mechanism to mitigate the need for an RCU wait in most > > cases. Deferred inactivation queues and batches the on-disk freeing > > of recently destroyed inodes, and so significantly increases the > > likelihood that a grace period has elapsed by the time an inode is > > freed and observable by the allocation code as a reuse candidate. > > Capture the current RCU grace period cookie at inode destroy time > > and refer to it at allocation time to conditionally wait for an RCU > > grace period if one hadn't expired in the meantime. Since only > > unlinked inodes are recycle candidates and unlinked inodes always > > require inactivation, we only need to poll and assign RCU state in > > the inactivation codepath. Slightly adjust struct xfs_inode to fit > > the new field into padding holes that conveniently preexist in the > > same cacheline as the deferred inactivation list. > > > > Finally, note that the ideal long term solution here is to > > rearchitect bits of XFS' internal inode lifecycle management such > > that this additional stall point is not required, but this requires > > more thought, time and work to address. This approach restores > > functional correctness in the meantime. > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > --- > > > > Hi all, > > > > Here's the RCU fixup patch for inode reuse that I've been playing with, > > re: the vfs patch discussion [1]. I've put it in pretty much the most > > basic form, but I think there are a couple aspects worth thinking about: > > > > 1. Use and frequency of start_poll_synchronize_rcu() (vs. > > get_state_synchronize_rcu()). The former is a bit more active than the > > latter in that it triggers the start of a grace period, when necessary. > > This currently invokes per inode, which is the ideal frequency in > > theory, but could be reduced, associated with the xfs_inogegc thresholds > > in some manner, etc., if there is good reason to do that. > > > > 2. The rcu cookie lifecycle. This variant updates it on inactivation > > queue and nowhere else because the RCU docs imply that counter rollover > > is not a significant problem. In practice, I think this means that if an > > inode is stamped at least once, and the counter rolls over, future > > (non-inactivation, non-unlinked) eviction -> repopulation cycles could > > trigger rcu syncs. I think this would require repeated > > eviction/reinstantiation cycles within a small window to be noticeable, > > so I'm not sure how likely this is to occur. We could be more defensive > > by resetting or refreshing the cookie. E.g., refresh (or reset to zero) > > at recycle time, unconditionally refresh at destroy time (using > > get_state_synchronize_rcu() for non-inactivation), etc. > > > > Otherwise testing is ongoing, but this version at least survives an > > fstests regression run. > > > > FYI, I modified my repeated alloc/free test to do some batching and form > it into something more able to measure the potential side effect / cost > of the grace period sync. The test is a single threaded, file alloc/free > loop using a variable per iteration batch size. The test runs for ~60s > and reports how many total files were allocated/freed in that period > with the specified batch size. Note that this particular test ran > without any background workload. Results are as follows: > > files baseline test > > 1 38480 38437 > 4 126055 111080 > 8 218299 134469 > 16 306619 141968 > 32 397909 152267 > 64 418603 200875 > 128 469077 289365 > 256 684117 566016 > 512 931328 878933 > 1024 1126741 1118891 Can you post the test code, because 38,000 alloc/unlinks in 60s is extremely slow for a single tight open-unlink-close loop. I'd be expecting at least ~10,000 alloc/unlink iterations per second, not 650/second. A quick test here with "batch size == 1" main loop on a vanilla 5.17-rc1 kernel: for (i = 0; i < iters; i++) { int fd = open(file, O_CREAT|O_RDWR, 0777); if (fd < 0) { perror("open"); exit(1); } unlink(file); close(fd); } $ time ./open-unlink 10000 /mnt/scratch/blah real 0m0.962s user 0m0.022s sys 0m0.775s Shows pretty much 10,000 alloc/unlinks a second without any specific batching on my slow machine. And my "fast" machine (3yr old 2.1GHz Xeons) $ time sudo ./open-unlink 40000 /mnt/scratch/foo real 0m0.958s user 0m0.033s sys 0m0.770s Runs single loop iterations at 40,000 alloc/unlink iterations per second. So I'm either not understanding the test you are running and/or the kernel/patches that you are comparing here. Is the "baseline" just a vanilla, unmodified upstream kernel, or something else? > That's just a test of a quick hack, however. Since there is no real > urgency to inactivate an unlinked inode (it has no potential users until > it's freed), On the contrary, there is extreme urgency to inactivate inodes quickly. Darrick made the original assumption that we could delay inactivation indefinitely and so he allowed really deep queues of up to 64k deferred inactivations. But with queues this deep, we could never get that background inactivation code to perform anywhere near the original synchronous background inactivation code. e.g. I measured 60-70% performance degradataions on my scalability tests, and nothing stood out in the profiles until I started looking at CPU data cache misses. What we found was that if we don't run the background inactivation while the inodes are still hot in the CPU cache, the cost of bring the inodes back into the CPU cache at a later time is extremely expensive and cannot be avoided. That's where all the performance was lost and so this is exactly what the current per-cpu background inactivation implementation avoids. i.e. we have shallow queues, early throttling and CPU affinity to ensure that the inodes are processed before they are evicted from the CPU caches and ensure we don't take a performance hit. IOWs, the deferred inactivation queues are designed to minimise inactivation delay, generally trying to delay inactivation for a couple of milliseconds at most during typical fast-path inactivations (i.e. an extent or two per inode needing to be freed, plus maybe the inode itself). Such inactivations generally take 50-100us of CPU time each to process, and we try to keep the inactivation batch size down to 32 inodes... > I suspect that result can be further optimized to absorb > the cost of an rcu delay by deferring the steps that make the inode > available for reallocation in the first place. A typical RCU grace period delay is longer than the latency we require to keep the inodes hot in cache for efficient background inactivation. We can't move the "we need to RCU delay inactivation" overhead to the background inactivation code without taking a global performance hit to the filesystem performance due to the CPU cache thrashing it will introduce.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-24 22:08 ` Dave Chinner @ 2022-01-24 23:29 ` Brian Foster 2022-01-25 0:31 ` Dave Chinner 0 siblings, 1 reply; 36+ messages in thread From: Brian Foster @ 2022-01-24 23:29 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, Al Viro, Ian Kent, rcu On Tue, Jan 25, 2022 at 09:08:53AM +1100, Dave Chinner wrote: > On Mon, Jan 24, 2022 at 10:02:27AM -0500, Brian Foster wrote: > > On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote: > > > The XFS inode allocation algorithm aggressively reuses recently > > > freed inodes. This is historical behavior that has been in place for > > > quite some time, since XFS was imported to mainline Linux. Once the > > > VFS adopted RCUwalk path lookups (also some time ago), this behavior > > > became slightly incompatible because the inode recycle path doesn't > > > isolate concurrent access to the inode from the VFS. > > > > > > This has recently manifested as problems in the VFS when XFS happens > > > to change the type or properties of a recently unlinked inode while > > > still involved in an RCU lookup. For example, if the VFS refers to a > > > previous incarnation of a symlink inode, obtains the ->get_link() > > > callback from inode_operations, and the latter happens to change to > > > a non-symlink type via a recycle event, the ->get_link() callback > > > pointer is reset to NULL and the lookup results in a crash. > > > > > > To avoid this class of problem, isolate in-core inodes for recycling > > > with an RCU grace period. This is the same level of protection the > > > VFS expects for inactivated inodes that are never reused, and so > > > guarantees no further concurrent access before the type or > > > properties of the inode change. We don't want an unconditional > > > synchronize_rcu() event here because that would result in a > > > significant performance impact to mixed inode allocation workloads. > > > > > > Fortunately, we can take advantage of the recently added deferred > > > inactivation mechanism to mitigate the need for an RCU wait in most > > > cases. Deferred inactivation queues and batches the on-disk freeing > > > of recently destroyed inodes, and so significantly increases the > > > likelihood that a grace period has elapsed by the time an inode is > > > freed and observable by the allocation code as a reuse candidate. > > > Capture the current RCU grace period cookie at inode destroy time > > > and refer to it at allocation time to conditionally wait for an RCU > > > grace period if one hadn't expired in the meantime. Since only > > > unlinked inodes are recycle candidates and unlinked inodes always > > > require inactivation, we only need to poll and assign RCU state in > > > the inactivation codepath. Slightly adjust struct xfs_inode to fit > > > the new field into padding holes that conveniently preexist in the > > > same cacheline as the deferred inactivation list. > > > > > > Finally, note that the ideal long term solution here is to > > > rearchitect bits of XFS' internal inode lifecycle management such > > > that this additional stall point is not required, but this requires > > > more thought, time and work to address. This approach restores > > > functional correctness in the meantime. > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > --- > > > > > > Hi all, > > > > > > Here's the RCU fixup patch for inode reuse that I've been playing with, > > > re: the vfs patch discussion [1]. I've put it in pretty much the most > > > basic form, but I think there are a couple aspects worth thinking about: > > > > > > 1. Use and frequency of start_poll_synchronize_rcu() (vs. > > > get_state_synchronize_rcu()). The former is a bit more active than the > > > latter in that it triggers the start of a grace period, when necessary. > > > This currently invokes per inode, which is the ideal frequency in > > > theory, but could be reduced, associated with the xfs_inogegc thresholds > > > in some manner, etc., if there is good reason to do that. > > > > > > 2. The rcu cookie lifecycle. This variant updates it on inactivation > > > queue and nowhere else because the RCU docs imply that counter rollover > > > is not a significant problem. In practice, I think this means that if an > > > inode is stamped at least once, and the counter rolls over, future > > > (non-inactivation, non-unlinked) eviction -> repopulation cycles could > > > trigger rcu syncs. I think this would require repeated > > > eviction/reinstantiation cycles within a small window to be noticeable, > > > so I'm not sure how likely this is to occur. We could be more defensive > > > by resetting or refreshing the cookie. E.g., refresh (or reset to zero) > > > at recycle time, unconditionally refresh at destroy time (using > > > get_state_synchronize_rcu() for non-inactivation), etc. > > > > > > Otherwise testing is ongoing, but this version at least survives an > > > fstests regression run. > > > > > > > FYI, I modified my repeated alloc/free test to do some batching and form > > it into something more able to measure the potential side effect / cost > > of the grace period sync. The test is a single threaded, file alloc/free > > loop using a variable per iteration batch size. The test runs for ~60s > > and reports how many total files were allocated/freed in that period > > with the specified batch size. Note that this particular test ran > > without any background workload. Results are as follows: > > > > files baseline test > > > > 1 38480 38437 > > 4 126055 111080 > > 8 218299 134469 > > 16 306619 141968 > > 32 397909 152267 > > 64 418603 200875 > > 128 469077 289365 > > 256 684117 566016 > > 512 931328 878933 > > 1024 1126741 1118891 > > Can you post the test code, because 38,000 alloc/unlinks in 60s is > extremely slow for a single tight open-unlink-close loop. I'd be > expecting at least ~10,000 alloc/unlink iterations per second, not > 650/second. > Hm, Ok. My test was just a bash script doing a 'touch <files>; rm <files>' loop. I know there was application overhead because if I tweaked the script to open an fd directly rather than use touch, the single file performance jumped up a bit, but it seemed to wash away as I increased the file count so I kept running it with larger sizes. This seems off so I'll port it over to C code and see how much the numbers change. > A quick test here with "batch size == 1" main loop on a vanilla > 5.17-rc1 kernel: > > for (i = 0; i < iters; i++) { > int fd = open(file, O_CREAT|O_RDWR, 0777); > > if (fd < 0) { > perror("open"); > exit(1); > } > unlink(file); > close(fd); > } > > > $ time ./open-unlink 10000 /mnt/scratch/blah > > real 0m0.962s > user 0m0.022s > sys 0m0.775s > > Shows pretty much 10,000 alloc/unlinks a second without any specific > batching on my slow machine. And my "fast" machine (3yr old 2.1GHz > Xeons) > > $ time sudo ./open-unlink 40000 /mnt/scratch/foo > > real 0m0.958s > user 0m0.033s > sys 0m0.770s > > Runs single loop iterations at 40,000 alloc/unlink iterations per > second. > > So I'm either not understanding the test you are running and/or the > kernel/patches that you are comparing here. Is the "baseline" just a > vanilla, unmodified upstream kernel, or something else? > Yeah, the baseline was just the XFS for-next branch. > > That's just a test of a quick hack, however. Since there is no real > > urgency to inactivate an unlinked inode (it has no potential users until > > it's freed), > > On the contrary, there is extreme urgency to inactivate inodes > quickly. > Ok, I think we're talking about slightly different things. What I mean above is that if a task removes a file and goes off doing unrelated $work, that inode will just sit on the percpu queue indefinitely. That's fine, as there's no functional need for us to process it immediately unless we're around -ENOSPC thresholds or some such that demand reclaim of the inode. It sounds like what you're talking about is specifically the behavior/performance of sustained file removal (which is important obviously), where apparently there is a notable degradation if the queues become deep enough to push the inode batches out of CPU cache. So that makes sense... > Darrick made the original assumption that we could delay > inactivation indefinitely and so he allowed really deep queues of up > to 64k deferred inactivations. But with queues this deep, we could > never get that background inactivation code to perform anywhere near > the original synchronous background inactivation code. e.g. I > measured 60-70% performance degradataions on my scalability tests, > and nothing stood out in the profiles until I started looking at > CPU data cache misses. > ... but could you elaborate on the scalability tests involved here so I can get a better sense of it in practice and perhaps observe the impact of changes in this path? Brian > What we found was that if we don't run the background inactivation > while the inodes are still hot in the CPU cache, the cost of bring > the inodes back into the CPU cache at a later time is extremely > expensive and cannot be avoided. That's where all the performance > was lost and so this is exactly what the current per-cpu background > inactivation implementation avoids. i.e. we have shallow queues, > early throttling and CPU affinity to ensure that the inodes are > processed before they are evicted from the CPU caches and ensure we > don't take a performance hit. > > IOWs, the deferred inactivation queues are designed to minimise > inactivation delay, generally trying to delay inactivation for a > couple of milliseconds at most during typical fast-path > inactivations (i.e. an extent or two per inode needing to be freed, > plus maybe the inode itself). Such inactivations generally take > 50-100us of CPU time each to process, and we try to keep the > inactivation batch size down to 32 inodes... > > > I suspect that result can be further optimized to absorb > > the cost of an rcu delay by deferring the steps that make the inode > > available for reallocation in the first place. > > A typical RCU grace period delay is longer than the latency we > require to keep the inodes hot in cache for efficient background > inactivation. We can't move the "we need to RCU delay inactivation" > overhead to the background inactivation code without taking a > global performance hit to the filesystem performance due to the CPU > cache thrashing it will introduce.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-24 23:29 ` Brian Foster @ 2022-01-25 0:31 ` Dave Chinner 2022-01-25 14:40 ` Paul E. McKenney 2022-01-25 18:30 ` Brian Foster 0 siblings, 2 replies; 36+ messages in thread From: Dave Chinner @ 2022-01-25 0:31 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs, Al Viro, Ian Kent, rcu On Mon, Jan 24, 2022 at 06:29:18PM -0500, Brian Foster wrote: > On Tue, Jan 25, 2022 at 09:08:53AM +1100, Dave Chinner wrote: > > > FYI, I modified my repeated alloc/free test to do some batching and form > > > it into something more able to measure the potential side effect / cost > > > of the grace period sync. The test is a single threaded, file alloc/free > > > loop using a variable per iteration batch size. The test runs for ~60s > > > and reports how many total files were allocated/freed in that period > > > with the specified batch size. Note that this particular test ran > > > without any background workload. Results are as follows: > > > > > > files baseline test > > > > > > 1 38480 38437 > > > 4 126055 111080 > > > 8 218299 134469 > > > 16 306619 141968 > > > 32 397909 152267 > > > 64 418603 200875 > > > 128 469077 289365 > > > 256 684117 566016 > > > 512 931328 878933 > > > 1024 1126741 1118891 > > > > Can you post the test code, because 38,000 alloc/unlinks in 60s is > > extremely slow for a single tight open-unlink-close loop. I'd be > > expecting at least ~10,000 alloc/unlink iterations per second, not > > 650/second. > > > > Hm, Ok. My test was just a bash script doing a 'touch <files>; rm > <files>' loop. I know there was application overhead because if I > tweaked the script to open an fd directly rather than use touch, the > single file performance jumped up a bit, but it seemed to wash away as I > increased the file count so I kept running it with larger sizes. This > seems off so I'll port it over to C code and see how much the numbers > change. Yeah, using touch/rm becomes fork/exec bound very quickly. You'll find that using "echo > <file>" is much faster than "touch <file>" because it runs a shell built-in operation without fork/exec overhead to create the file. But you can't play tricks like that to replace rm: $ time for ((i=0;i<1000;i++)); do touch /mnt/scratch/foo; rm /mnt/scratch/foo ; done real 0m2.653s user 0m0.910s sys 0m2.051s $ time for ((i=0;i<1000;i++)); do echo > /mnt/scratch/foo; rm /mnt/scratch/foo ; done real 0m1.260s user 0m0.452s sys 0m0.913s $ time ./open-unlink 1000 /mnt/scratch/foo real 0m0.037s user 0m0.001s sys 0m0.030s $ Note the difference in system time between the three operations - almost all the difference in system CPU time is the overhead of fork/exec to run the touch/rm binaries, not do the filesystem operations.... > > > That's just a test of a quick hack, however. Since there is no real > > > urgency to inactivate an unlinked inode (it has no potential users until > > > it's freed), > > > > On the contrary, there is extreme urgency to inactivate inodes > > quickly. > > > > Ok, I think we're talking about slightly different things. What I mean > above is that if a task removes a file and goes off doing unrelated > $work, that inode will just sit on the percpu queue indefinitely. That's > fine, as there's no functional need for us to process it immediately > unless we're around -ENOSPC thresholds or some such that demand reclaim > of the inode. Yup, an occasional unlink sitting around for a while on an unlinked list isn't going to cause a performance problem. Indeed, such workloads are more likely to benefit from the reduced unlink() syscall overhead and won't even notice the increase in background CPU overhead for inactivation of those occasional inodes. > It sounds like what you're talking about is specifically > the behavior/performance of sustained file removal (which is important > obviously), where apparently there is a notable degradation if the > queues become deep enough to push the inode batches out of CPU cache. So > that makes sense... Yup, sustained bulk throughput is where cache residency really matters. And for unlink, sustained unlink workloads are quite common; they often are something people wait for on the command line or make up a performance critical component of a highly concurrent workload so it's pretty important to get this part right. > > Darrick made the original assumption that we could delay > > inactivation indefinitely and so he allowed really deep queues of up > > to 64k deferred inactivations. But with queues this deep, we could > > never get that background inactivation code to perform anywhere near > > the original synchronous background inactivation code. e.g. I > > measured 60-70% performance degradataions on my scalability tests, > > and nothing stood out in the profiles until I started looking at > > CPU data cache misses. > > > > ... but could you elaborate on the scalability tests involved here so I > can get a better sense of it in practice and perhaps observe the impact > of changes in this path? The same conconrrent fsmark create/traverse/unlink workloads I've been running for the past decade+ demonstrates it pretty simply. I also saw regressions with dbench (both op latency and throughput) as the clinet count (concurrency) increased, and with compilebench. I didn't look much further because all the common benchmarks I ran showed perf degradations with arbitrary delays that went away with the current code we have. ISTR that parts of aim7/reaim scalability workloads that the intel zero-day infrastructure runs are quite sensitive to background inactivation delays as well because that's a CPU bound workload and hence any reduction in cache residency results in a reduction of the number of concurrent jobs that can be run. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-25 0:31 ` Dave Chinner @ 2022-01-25 14:40 ` Paul E. McKenney 2022-01-25 22:36 ` Dave Chinner 2022-01-25 18:30 ` Brian Foster 1 sibling, 1 reply; 36+ messages in thread From: Paul E. McKenney @ 2022-01-25 14:40 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, linux-xfs, Al Viro, Ian Kent, rcu On Tue, Jan 25, 2022 at 11:31:20AM +1100, Dave Chinner wrote: > On Mon, Jan 24, 2022 at 06:29:18PM -0500, Brian Foster wrote: > > On Tue, Jan 25, 2022 at 09:08:53AM +1100, Dave Chinner wrote: > > > > FYI, I modified my repeated alloc/free test to do some batching and form > > > > it into something more able to measure the potential side effect / cost > > > > of the grace period sync. The test is a single threaded, file alloc/free > > > > loop using a variable per iteration batch size. The test runs for ~60s > > > > and reports how many total files were allocated/freed in that period > > > > with the specified batch size. Note that this particular test ran > > > > without any background workload. Results are as follows: > > > > > > > > files baseline test > > > > > > > > 1 38480 38437 > > > > 4 126055 111080 > > > > 8 218299 134469 > > > > 16 306619 141968 > > > > 32 397909 152267 > > > > 64 418603 200875 > > > > 128 469077 289365 > > > > 256 684117 566016 > > > > 512 931328 878933 > > > > 1024 1126741 1118891 > > > > > > Can you post the test code, because 38,000 alloc/unlinks in 60s is > > > extremely slow for a single tight open-unlink-close loop. I'd be > > > expecting at least ~10,000 alloc/unlink iterations per second, not > > > 650/second. > > > > > > > Hm, Ok. My test was just a bash script doing a 'touch <files>; rm > > <files>' loop. I know there was application overhead because if I > > tweaked the script to open an fd directly rather than use touch, the > > single file performance jumped up a bit, but it seemed to wash away as I > > increased the file count so I kept running it with larger sizes. This > > seems off so I'll port it over to C code and see how much the numbers > > change. > > Yeah, using touch/rm becomes fork/exec bound very quickly. You'll > find that using "echo > <file>" is much faster than "touch <file>" > because it runs a shell built-in operation without fork/exec > overhead to create the file. But you can't play tricks like that to > replace rm: > > $ time for ((i=0;i<1000;i++)); do touch /mnt/scratch/foo; rm /mnt/scratch/foo ; done > > real 0m2.653s > user 0m0.910s > sys 0m2.051s > $ time for ((i=0;i<1000;i++)); do echo > /mnt/scratch/foo; rm /mnt/scratch/foo ; done > > real 0m1.260s > user 0m0.452s > sys 0m0.913s > $ time ./open-unlink 1000 /mnt/scratch/foo > > real 0m0.037s > user 0m0.001s > sys 0m0.030s > $ > > Note the difference in system time between the three operations - > almost all the difference in system CPU time is the overhead of > fork/exec to run the touch/rm binaries, not do the filesystem > operations.... > > > > > That's just a test of a quick hack, however. Since there is no real > > > > urgency to inactivate an unlinked inode (it has no potential users until > > > > it's freed), > > > > > > On the contrary, there is extreme urgency to inactivate inodes > > > quickly. > > > > > > > Ok, I think we're talking about slightly different things. What I mean > > above is that if a task removes a file and goes off doing unrelated > > $work, that inode will just sit on the percpu queue indefinitely. That's > > fine, as there's no functional need for us to process it immediately > > unless we're around -ENOSPC thresholds or some such that demand reclaim > > of the inode. > > Yup, an occasional unlink sitting around for a while on an unlinked > list isn't going to cause a performance problem. Indeed, such > workloads are more likely to benefit from the reduced unlink() > syscall overhead and won't even notice the increase in background > CPU overhead for inactivation of those occasional inodes. > > > It sounds like what you're talking about is specifically > > the behavior/performance of sustained file removal (which is important > > obviously), where apparently there is a notable degradation if the > > queues become deep enough to push the inode batches out of CPU cache. So > > that makes sense... > > Yup, sustained bulk throughput is where cache residency really > matters. And for unlink, sustained unlink workloads are quite > common; they often are something people wait for on the command line > or make up a performance critical component of a highly concurrent > workload so it's pretty important to get this part right. > > > > Darrick made the original assumption that we could delay > > > inactivation indefinitely and so he allowed really deep queues of up > > > to 64k deferred inactivations. But with queues this deep, we could > > > never get that background inactivation code to perform anywhere near > > > the original synchronous background inactivation code. e.g. I > > > measured 60-70% performance degradataions on my scalability tests, > > > and nothing stood out in the profiles until I started looking at > > > CPU data cache misses. > > > > > > > ... but could you elaborate on the scalability tests involved here so I > > can get a better sense of it in practice and perhaps observe the impact > > of changes in this path? > > The same conconrrent fsmark create/traverse/unlink workloads I've > been running for the past decade+ demonstrates it pretty simply. I > also saw regressions with dbench (both op latency and throughput) as > the clinet count (concurrency) increased, and with compilebench. I > didn't look much further because all the common benchmarks I ran > showed perf degradations with arbitrary delays that went away with > the current code we have. ISTR that parts of aim7/reaim scalability > workloads that the intel zero-day infrastructure runs are quite > sensitive to background inactivation delays as well because that's a > CPU bound workload and hence any reduction in cache residency > results in a reduction of the number of concurrent jobs that can be > run. Curiosity and all that, but has this work produced any intuition on the sensitivity of the performance/scalability to the delays? As in the effect of microseconds vs. tens of microsecond vs. hundreds of microseconds? Thanx, Paul ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-25 14:40 ` Paul E. McKenney @ 2022-01-25 22:36 ` Dave Chinner 2022-01-26 5:29 ` Paul E. McKenney 0 siblings, 1 reply; 36+ messages in thread From: Dave Chinner @ 2022-01-25 22:36 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Brian Foster, linux-xfs, Al Viro, Ian Kent, rcu On Tue, Jan 25, 2022 at 06:40:44AM -0800, Paul E. McKenney wrote: > On Tue, Jan 25, 2022 at 11:31:20AM +1100, Dave Chinner wrote: > > > Ok, I think we're talking about slightly different things. What I mean > > > above is that if a task removes a file and goes off doing unrelated > > > $work, that inode will just sit on the percpu queue indefinitely. That's > > > fine, as there's no functional need for us to process it immediately > > > unless we're around -ENOSPC thresholds or some such that demand reclaim > > > of the inode. > > > > Yup, an occasional unlink sitting around for a while on an unlinked > > list isn't going to cause a performance problem. Indeed, such > > workloads are more likely to benefit from the reduced unlink() > > syscall overhead and won't even notice the increase in background > > CPU overhead for inactivation of those occasional inodes. > > > > > It sounds like what you're talking about is specifically > > > the behavior/performance of sustained file removal (which is important > > > obviously), where apparently there is a notable degradation if the > > > queues become deep enough to push the inode batches out of CPU cache. So > > > that makes sense... > > > > Yup, sustained bulk throughput is where cache residency really > > matters. And for unlink, sustained unlink workloads are quite > > common; they often are something people wait for on the command line > > or make up a performance critical component of a highly concurrent > > workload so it's pretty important to get this part right. > > > > > > Darrick made the original assumption that we could delay > > > > inactivation indefinitely and so he allowed really deep queues of up > > > > to 64k deferred inactivations. But with queues this deep, we could > > > > never get that background inactivation code to perform anywhere near > > > > the original synchronous background inactivation code. e.g. I > > > > measured 60-70% performance degradataions on my scalability tests, > > > > and nothing stood out in the profiles until I started looking at > > > > CPU data cache misses. > > > > > > > > > > ... but could you elaborate on the scalability tests involved here so I > > > can get a better sense of it in practice and perhaps observe the impact > > > of changes in this path? > > > > The same conconrrent fsmark create/traverse/unlink workloads I've > > been running for the past decade+ demonstrates it pretty simply. I > > also saw regressions with dbench (both op latency and throughput) as > > the clinet count (concurrency) increased, and with compilebench. I > > didn't look much further because all the common benchmarks I ran > > showed perf degradations with arbitrary delays that went away with > > the current code we have. ISTR that parts of aim7/reaim scalability > > workloads that the intel zero-day infrastructure runs are quite > > sensitive to background inactivation delays as well because that's a > > CPU bound workload and hence any reduction in cache residency > > results in a reduction of the number of concurrent jobs that can be > > run. > > Curiosity and all that, but has this work produced any intuition on > the sensitivity of the performance/scalability to the delays? As in > the effect of microseconds vs. tens of microsecond vs. hundreds of > microseconds? Some, yes. The upper delay threshold where performance is measurably impacted is in the order of single digit milliseconds, not microseconds. What I saw was that as the batch processing delay goes beyond ~5ms, IPC starts to fall. The CPU usage profile does not change shape, nor does the proportions of where CPU time is spent change. All I see if data cache misses go up substantially and IPC drop substantially. If I read my notes correctly, typical change from "fast" to "slow" in IPC was 0.82 to 0.39 and LLC-load-misses from 3% to 12%. The IPC degradation was all done by the time the background batch processing times were longer than a typical scheduler tick (10ms). Now, I've been testing on Xeon CPUs with 36-76MB of l2-l3 caches, so there's a fair amount of data that these can hold. I expect that with smaller caches, the inflection point will be at smaller batch sizes rather than more. Hence while I could have used larger batches for background processing (e.g. 64-128 inodes rather than 32), I chose smaller batch sizes by default so that CPUs with smaller caches are less likely to be adversely affected by the batch size being too large. OTOH, I started to measure noticable degradation by batch sizes of 256 inodes on my machines, which is why the hard queue limit got set to 256 inodes. Scaling the delay/batch size down towards single inode queuing also resulted in perf degradation. This was largely because of all the extra scheduling overhead that trying to switching between user task and kernel worker task for every inode entailed. Context switch rate went from a couple of thousand/sec to over 100,000/s for single inode batches, and performance went backwards in proportion with the amount of CPU then spent on context switches. It also lead to increases in buffer lock contention (hence context switches) as both user task and kworker try to access the same buffers... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-25 22:36 ` Dave Chinner @ 2022-01-26 5:29 ` Paul E. McKenney 2022-01-26 13:21 ` Brian Foster 0 siblings, 1 reply; 36+ messages in thread From: Paul E. McKenney @ 2022-01-26 5:29 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, linux-xfs, Al Viro, Ian Kent, rcu On Wed, Jan 26, 2022 at 09:36:07AM +1100, Dave Chinner wrote: > On Tue, Jan 25, 2022 at 06:40:44AM -0800, Paul E. McKenney wrote: > > On Tue, Jan 25, 2022 at 11:31:20AM +1100, Dave Chinner wrote: > > > > Ok, I think we're talking about slightly different things. What I mean > > > > above is that if a task removes a file and goes off doing unrelated > > > > $work, that inode will just sit on the percpu queue indefinitely. That's > > > > fine, as there's no functional need for us to process it immediately > > > > unless we're around -ENOSPC thresholds or some such that demand reclaim > > > > of the inode. > > > > > > Yup, an occasional unlink sitting around for a while on an unlinked > > > list isn't going to cause a performance problem. Indeed, such > > > workloads are more likely to benefit from the reduced unlink() > > > syscall overhead and won't even notice the increase in background > > > CPU overhead for inactivation of those occasional inodes. > > > > > > > It sounds like what you're talking about is specifically > > > > the behavior/performance of sustained file removal (which is important > > > > obviously), where apparently there is a notable degradation if the > > > > queues become deep enough to push the inode batches out of CPU cache. So > > > > that makes sense... > > > > > > Yup, sustained bulk throughput is where cache residency really > > > matters. And for unlink, sustained unlink workloads are quite > > > common; they often are something people wait for on the command line > > > or make up a performance critical component of a highly concurrent > > > workload so it's pretty important to get this part right. > > > > > > > > Darrick made the original assumption that we could delay > > > > > inactivation indefinitely and so he allowed really deep queues of up > > > > > to 64k deferred inactivations. But with queues this deep, we could > > > > > never get that background inactivation code to perform anywhere near > > > > > the original synchronous background inactivation code. e.g. I > > > > > measured 60-70% performance degradataions on my scalability tests, > > > > > and nothing stood out in the profiles until I started looking at > > > > > CPU data cache misses. > > > > > > > > > > > > > ... but could you elaborate on the scalability tests involved here so I > > > > can get a better sense of it in practice and perhaps observe the impact > > > > of changes in this path? > > > > > > The same conconrrent fsmark create/traverse/unlink workloads I've > > > been running for the past decade+ demonstrates it pretty simply. I > > > also saw regressions with dbench (both op latency and throughput) as > > > the clinet count (concurrency) increased, and with compilebench. I > > > didn't look much further because all the common benchmarks I ran > > > showed perf degradations with arbitrary delays that went away with > > > the current code we have. ISTR that parts of aim7/reaim scalability > > > workloads that the intel zero-day infrastructure runs are quite > > > sensitive to background inactivation delays as well because that's a > > > CPU bound workload and hence any reduction in cache residency > > > results in a reduction of the number of concurrent jobs that can be > > > run. > > > > Curiosity and all that, but has this work produced any intuition on > > the sensitivity of the performance/scalability to the delays? As in > > the effect of microseconds vs. tens of microsecond vs. hundreds of > > microseconds? > > Some, yes. > > The upper delay threshold where performance is measurably > impacted is in the order of single digit milliseconds, not > microseconds. > > What I saw was that as the batch processing delay goes beyond ~5ms, > IPC starts to fall. The CPU usage profile does not change shape, nor > does the proportions of where CPU time is spent change. All I see if > data cache misses go up substantially and IPC drop substantially. If > I read my notes correctly, typical change from "fast" to "slow" in > IPC was 0.82 to 0.39 and LLC-load-misses from 3% to 12%. The IPC > degradation was all done by the time the background batch processing > times were longer than a typical scheduler tick (10ms). > > Now, I've been testing on Xeon CPUs with 36-76MB of l2-l3 caches, so > there's a fair amount of data that these can hold. I expect that > with smaller caches, the inflection point will be at smaller batch > sizes rather than more. Hence while I could have used larger batches > for background processing (e.g. 64-128 inodes rather than 32), I > chose smaller batch sizes by default so that CPUs with smaller > caches are less likely to be adversely affected by the batch size > being too large. OTOH, I started to measure noticable degradation by > batch sizes of 256 inodes on my machines, which is why the hard > queue limit got set to 256 inodes. > > Scaling the delay/batch size down towards single inode queuing also > resulted in perf degradation. This was largely because of all the > extra scheduling overhead that trying to switching between user task > and kernel worker task for every inode entailed. Context switch rate > went from a couple of thousand/sec to over 100,000/s for single > inode batches, and performance went backwards in proportion with the > amount of CPU then spent on context switches. It also lead to > increases in buffer lock contention (hence context switches) as both > user task and kworker try to access the same buffers... Makes sense. Never a guarantee of easy answers. ;-) If it would help, I could create expedited-grace-period counterparts of get_state_synchronize_rcu(), start_poll_synchronize_rcu(), poll_state_synchronize_rcu(), and cond_synchronize_rcu(). These would provide sub-millisecond grace periods, in fact, sub-100-microsecond grace periods on smaller systems. Of course, nothing comes for free. Although expedited grace periods are way way cheaper than they used to be, they still IPI non-idle non-nohz_full-userspace CPUs, which translates to roughly the CPU overhead of a wakeup on each IPIed CPU. And of course disruption to aggressive non-nohz_full real-time applications. Shorter latencies also translates to fewer updates over which to amortize grace-period overhead. But it should get well under your single-digit milliseconds of delay. Thanx, Paul ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-26 5:29 ` Paul E. McKenney @ 2022-01-26 13:21 ` Brian Foster 0 siblings, 0 replies; 36+ messages in thread From: Brian Foster @ 2022-01-26 13:21 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Dave Chinner, linux-xfs, Al Viro, Ian Kent, rcu On Tue, Jan 25, 2022 at 09:29:10PM -0800, Paul E. McKenney wrote: > On Wed, Jan 26, 2022 at 09:36:07AM +1100, Dave Chinner wrote: > > On Tue, Jan 25, 2022 at 06:40:44AM -0800, Paul E. McKenney wrote: > > > On Tue, Jan 25, 2022 at 11:31:20AM +1100, Dave Chinner wrote: > > > > > Ok, I think we're talking about slightly different things. What I mean > > > > > above is that if a task removes a file and goes off doing unrelated > > > > > $work, that inode will just sit on the percpu queue indefinitely. That's > > > > > fine, as there's no functional need for us to process it immediately > > > > > unless we're around -ENOSPC thresholds or some such that demand reclaim > > > > > of the inode. > > > > > > > > Yup, an occasional unlink sitting around for a while on an unlinked > > > > list isn't going to cause a performance problem. Indeed, such > > > > workloads are more likely to benefit from the reduced unlink() > > > > syscall overhead and won't even notice the increase in background > > > > CPU overhead for inactivation of those occasional inodes. > > > > > > > > > It sounds like what you're talking about is specifically > > > > > the behavior/performance of sustained file removal (which is important > > > > > obviously), where apparently there is a notable degradation if the > > > > > queues become deep enough to push the inode batches out of CPU cache. So > > > > > that makes sense... > > > > > > > > Yup, sustained bulk throughput is where cache residency really > > > > matters. And for unlink, sustained unlink workloads are quite > > > > common; they often are something people wait for on the command line > > > > or make up a performance critical component of a highly concurrent > > > > workload so it's pretty important to get this part right. > > > > > > > > > > Darrick made the original assumption that we could delay > > > > > > inactivation indefinitely and so he allowed really deep queues of up > > > > > > to 64k deferred inactivations. But with queues this deep, we could > > > > > > never get that background inactivation code to perform anywhere near > > > > > > the original synchronous background inactivation code. e.g. I > > > > > > measured 60-70% performance degradataions on my scalability tests, > > > > > > and nothing stood out in the profiles until I started looking at > > > > > > CPU data cache misses. > > > > > > > > > > > > > > > > ... but could you elaborate on the scalability tests involved here so I > > > > > can get a better sense of it in practice and perhaps observe the impact > > > > > of changes in this path? > > > > > > > > The same conconrrent fsmark create/traverse/unlink workloads I've > > > > been running for the past decade+ demonstrates it pretty simply. I > > > > also saw regressions with dbench (both op latency and throughput) as > > > > the clinet count (concurrency) increased, and with compilebench. I > > > > didn't look much further because all the common benchmarks I ran > > > > showed perf degradations with arbitrary delays that went away with > > > > the current code we have. ISTR that parts of aim7/reaim scalability > > > > workloads that the intel zero-day infrastructure runs are quite > > > > sensitive to background inactivation delays as well because that's a > > > > CPU bound workload and hence any reduction in cache residency > > > > results in a reduction of the number of concurrent jobs that can be > > > > run. > > > > > > Curiosity and all that, but has this work produced any intuition on > > > the sensitivity of the performance/scalability to the delays? As in > > > the effect of microseconds vs. tens of microsecond vs. hundreds of > > > microseconds? > > > > Some, yes. > > > > The upper delay threshold where performance is measurably > > impacted is in the order of single digit milliseconds, not > > microseconds. > > > > What I saw was that as the batch processing delay goes beyond ~5ms, > > IPC starts to fall. The CPU usage profile does not change shape, nor > > does the proportions of where CPU time is spent change. All I see if > > data cache misses go up substantially and IPC drop substantially. If > > I read my notes correctly, typical change from "fast" to "slow" in > > IPC was 0.82 to 0.39 and LLC-load-misses from 3% to 12%. The IPC > > degradation was all done by the time the background batch processing > > times were longer than a typical scheduler tick (10ms). > > > > Now, I've been testing on Xeon CPUs with 36-76MB of l2-l3 caches, so > > there's a fair amount of data that these can hold. I expect that > > with smaller caches, the inflection point will be at smaller batch > > sizes rather than more. Hence while I could have used larger batches > > for background processing (e.g. 64-128 inodes rather than 32), I > > chose smaller batch sizes by default so that CPUs with smaller > > caches are less likely to be adversely affected by the batch size > > being too large. OTOH, I started to measure noticable degradation by > > batch sizes of 256 inodes on my machines, which is why the hard > > queue limit got set to 256 inodes. > > > > Scaling the delay/batch size down towards single inode queuing also > > resulted in perf degradation. This was largely because of all the > > extra scheduling overhead that trying to switching between user task > > and kernel worker task for every inode entailed. Context switch rate > > went from a couple of thousand/sec to over 100,000/s for single > > inode batches, and performance went backwards in proportion with the > > amount of CPU then spent on context switches. It also lead to > > increases in buffer lock contention (hence context switches) as both > > user task and kworker try to access the same buffers... > > Makes sense. Never a guarantee of easy answers. ;-) > > If it would help, I could create expedited-grace-period counterparts > of get_state_synchronize_rcu(), start_poll_synchronize_rcu(), > poll_state_synchronize_rcu(), and cond_synchronize_rcu(). These would > provide sub-millisecond grace periods, in fact, sub-100-microsecond > grace periods on smaller systems. > If you have something with enough basic functionality, I'd be interested in converting this patch over to an expedited variant to run some tests/experiments. As it is, it seems the current approach is kind of playing wack-a-mole between disrupting allocation performance by populating the free inode pool with too many free but "pending rcu grace period" inodes and sustained remove performance by pushing the internal inactivation queues too deep and thus losing CPU cache, as Dave describes above. So if an expedited grace period is possible that fits within the time window on paper, it certainly seems worthwhile to test. Otherwise the only thing that comes to mind right now is to start playing around with the physical inode allocation algorithm to avoid such pending inodes. I think a scanning approach may ultimately run into the same problems with the right workload (i.e. such that all free inodes are pending), so I suspect what this really means is either figuring a nice enough way to efficiently locate expired inodes (maybe via our own internal rcu callback to explicitly tag now expired inodes as good allocation candidates), or to determine when to proceed with inode chunk allocations when scanning is unlikely to succeed, or something similar along those general lines.. > Of course, nothing comes for free. Although expedited grace periods > are way way cheaper than they used to be, they still IPI non-idle > non-nohz_full-userspace CPUs, which translates to roughly the CPU overhead > of a wakeup on each IPIed CPU. And of course disruption to aggressive > non-nohz_full real-time applications. Shorter latencies also translates > to fewer updates over which to amortize grace-period overhead. > > But it should get well under your single-digit milliseconds of delay. > If the expedited variant were sufficient for the fast path case, I suppose it might be interesting to see if we could throttle down to non-expedited variants either based on heuristic or feedback from allocation side stalls. Brian > Thanx, Paul > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-25 0:31 ` Dave Chinner 2022-01-25 14:40 ` Paul E. McKenney @ 2022-01-25 18:30 ` Brian Foster 2022-01-25 20:07 ` Brian Foster 2022-01-25 22:45 ` Dave Chinner 1 sibling, 2 replies; 36+ messages in thread From: Brian Foster @ 2022-01-25 18:30 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, Al Viro, Ian Kent, rcu On Tue, Jan 25, 2022 at 11:31:20AM +1100, Dave Chinner wrote: > On Mon, Jan 24, 2022 at 06:29:18PM -0500, Brian Foster wrote: > > On Tue, Jan 25, 2022 at 09:08:53AM +1100, Dave Chinner wrote: > > > > FYI, I modified my repeated alloc/free test to do some batching and form > > > > it into something more able to measure the potential side effect / cost > > > > of the grace period sync. The test is a single threaded, file alloc/free > > > > loop using a variable per iteration batch size. The test runs for ~60s > > > > and reports how many total files were allocated/freed in that period > > > > with the specified batch size. Note that this particular test ran > > > > without any background workload. Results are as follows: > > > > > > > > files baseline test > > > > > > > > 1 38480 38437 > > > > 4 126055 111080 > > > > 8 218299 134469 > > > > 16 306619 141968 > > > > 32 397909 152267 > > > > 64 418603 200875 > > > > 128 469077 289365 > > > > 256 684117 566016 > > > > 512 931328 878933 > > > > 1024 1126741 1118891 > > > > > > Can you post the test code, because 38,000 alloc/unlinks in 60s is > > > extremely slow for a single tight open-unlink-close loop. I'd be > > > expecting at least ~10,000 alloc/unlink iterations per second, not > > > 650/second. > > > > > > > Hm, Ok. My test was just a bash script doing a 'touch <files>; rm > > <files>' loop. I know there was application overhead because if I > > tweaked the script to open an fd directly rather than use touch, the > > single file performance jumped up a bit, but it seemed to wash away as I > > increased the file count so I kept running it with larger sizes. This > > seems off so I'll port it over to C code and see how much the numbers > > change. > > Yeah, using touch/rm becomes fork/exec bound very quickly. You'll > find that using "echo > <file>" is much faster than "touch <file>" > because it runs a shell built-in operation without fork/exec > overhead to create the file. But you can't play tricks like that to > replace rm: > I had used 'exec' to open an fd (same idea) in the single file case and tested with that, saw that the increase was consistent and took that along with the increasing performance as batch sizes increased to mean that the application overhead wasn't a factor as the test scaled. That was clearly wrong, because if I port the whole thing to a C program the baseline numbers are way off. I think what also threw me off is that the single file test kernel case is actually fairly accurate between the two tests. Anyways, here's a series of (single run, no averaging, etc.) test runs with the updated test. Note that I reduced the runtime to 10s here since the test was running so much faster. Otherwise this is the same batched open/close -> unlink behavior: baseline test batch: 1 files: 893579 files: 41841 batch: 2 files: 912502 files: 41922 batch: 4 files: 930424 files: 42084 batch: 8 files: 932072 files: 41536 batch: 16 files: 930624 files: 41616 batch: 32 files: 777088 files: 41120 batch: 64 files: 567936 files: 57216 batch: 128 files: 579840 files: 96256 batch: 256 files: 548608 files: 174080 batch: 512 files: 546816 files: 246784 batch: 1024 files: 509952 files: 328704 batch: 2048 files: 505856 files: 399360 batch: 4096 files: 479232 files: 438272 So this shows that the performance delta is actually massive from the start. For reference, a single threaded, empty file, non syncing, fs_mark workload stabilizes at around ~55k files/sec on this fs. Both kernels sort of converge to that rate as the batch size increases, only the baseline kernel starts much faster and normalizes while the test kernel starts much slower and improves (and still really doesn't hit the mark even at a 4k batch size). My takeaway from this is that we may need to find a way to mitigate this overhead somewhat better than what the current patch does. Otherwise, this is a significant dropoff from even a pure allocation workload in simple mixed workload scenarios... > $ time for ((i=0;i<1000;i++)); do touch /mnt/scratch/foo; rm /mnt/scratch/foo ; done > > real 0m2.653s > user 0m0.910s > sys 0m2.051s > $ time for ((i=0;i<1000;i++)); do echo > /mnt/scratch/foo; rm /mnt/scratch/foo ; done > > real 0m1.260s > user 0m0.452s > sys 0m0.913s > $ time ./open-unlink 1000 /mnt/scratch/foo > > real 0m0.037s > user 0m0.001s > sys 0m0.030s > $ > > Note the difference in system time between the three operations - > almost all the difference in system CPU time is the overhead of > fork/exec to run the touch/rm binaries, not do the filesystem > operations.... > > > > > That's just a test of a quick hack, however. Since there is no real > > > > urgency to inactivate an unlinked inode (it has no potential users until > > > > it's freed), > > > > > > On the contrary, there is extreme urgency to inactivate inodes > > > quickly. > > > > > > > Ok, I think we're talking about slightly different things. What I mean > > above is that if a task removes a file and goes off doing unrelated > > $work, that inode will just sit on the percpu queue indefinitely. That's > > fine, as there's no functional need for us to process it immediately > > unless we're around -ENOSPC thresholds or some such that demand reclaim > > of the inode. > > Yup, an occasional unlink sitting around for a while on an unlinked > list isn't going to cause a performance problem. Indeed, such > workloads are more likely to benefit from the reduced unlink() > syscall overhead and won't even notice the increase in background > CPU overhead for inactivation of those occasional inodes. > > > It sounds like what you're talking about is specifically > > the behavior/performance of sustained file removal (which is important > > obviously), where apparently there is a notable degradation if the > > queues become deep enough to push the inode batches out of CPU cache. So > > that makes sense... > > Yup, sustained bulk throughput is where cache residency really > matters. And for unlink, sustained unlink workloads are quite > common; they often are something people wait for on the command line > or make up a performance critical component of a highly concurrent > workload so it's pretty important to get this part right. > > > > Darrick made the original assumption that we could delay > > > inactivation indefinitely and so he allowed really deep queues of up > > > to 64k deferred inactivations. But with queues this deep, we could > > > never get that background inactivation code to perform anywhere near > > > the original synchronous background inactivation code. e.g. I > > > measured 60-70% performance degradataions on my scalability tests, > > > and nothing stood out in the profiles until I started looking at > > > CPU data cache misses. > > > > > > > ... but could you elaborate on the scalability tests involved here so I > > can get a better sense of it in practice and perhaps observe the impact > > of changes in this path? > > The same conconrrent fsmark create/traverse/unlink workloads I've > been running for the past decade+ demonstrates it pretty simply. I > also saw regressions with dbench (both op latency and throughput) as > the clinet count (concurrency) increased, and with compilebench. I > didn't look much further because all the common benchmarks I ran > showed perf degradations with arbitrary delays that went away with > the current code we have. ISTR that parts of aim7/reaim scalability > workloads that the intel zero-day infrastructure runs are quite > sensitive to background inactivation delays as well because that's a > CPU bound workload and hence any reduction in cache residency > results in a reduction of the number of concurrent jobs that can be > run. > Ok, so if I (single threaded) create (via fs_mark), sync and remove 5m empty files, the remove takes about a minute. If I just bump out the current queue and block thresholds by 10x and repeat, that time increases to about ~1m24s. If I hack up a kernel to disable queueing entirely (i.e. fully synchronous inactivation), then I'm back to about a minute again. So I'm not producing any performance benefit with queueing/batching in this single threaded scenario, but I suspect the 10x threshold delta is at least measuring the negative effect of poor caching..? (Any decent way to confirm that..?). And of course if I take the baseline kernel and stick a cond_synchronize_rcu() in xfs_inactive_ifree() it brings the batch test numbers right back but slows the removal test way down. What I find interesting however is that if I hack up something more mild like invoke cond_synchronize_rcu() on the oldest inode in the current inactivation batch, bump out the blocking threshold as above (but leave the queueing threshold at 32), and leave the iget side cond_sync_rcu() to catch whatever falls through, my 5m file remove test now completes ~5-10s faster than baseline and I see the following results from the batched alloc/free test: batch: 1 files: 731923 batch: 2 files: 693020 batch: 4 files: 750948 batch: 8 files: 743296 batch: 16 files: 738720 batch: 32 files: 746240 batch: 64 files: 598464 batch: 128 files: 672896 batch: 256 files: 633856 batch: 512 files: 605184 batch: 1024 files: 569344 batch: 2048 files: 555008 batch: 4096 files: 524288 Hm? Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-25 18:30 ` Brian Foster @ 2022-01-25 20:07 ` Brian Foster 2022-01-25 22:45 ` Dave Chinner 1 sibling, 0 replies; 36+ messages in thread From: Brian Foster @ 2022-01-25 20:07 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, Al Viro, Ian Kent, rcu On Tue, Jan 25, 2022 at 01:30:36PM -0500, Brian Foster wrote: > On Tue, Jan 25, 2022 at 11:31:20AM +1100, Dave Chinner wrote: > > On Mon, Jan 24, 2022 at 06:29:18PM -0500, Brian Foster wrote: > > > On Tue, Jan 25, 2022 at 09:08:53AM +1100, Dave Chinner wrote: > > > > > FYI, I modified my repeated alloc/free test to do some batching and form > > > > > it into something more able to measure the potential side effect / cost > > > > > of the grace period sync. The test is a single threaded, file alloc/free > > > > > loop using a variable per iteration batch size. The test runs for ~60s > > > > > and reports how many total files were allocated/freed in that period > > > > > with the specified batch size. Note that this particular test ran > > > > > without any background workload. Results are as follows: > > > > > > > > > > files baseline test > > > > > > > > > > 1 38480 38437 > > > > > 4 126055 111080 > > > > > 8 218299 134469 > > > > > 16 306619 141968 > > > > > 32 397909 152267 > > > > > 64 418603 200875 > > > > > 128 469077 289365 > > > > > 256 684117 566016 > > > > > 512 931328 878933 > > > > > 1024 1126741 1118891 > > > > > > > > Can you post the test code, because 38,000 alloc/unlinks in 60s is > > > > extremely slow for a single tight open-unlink-close loop. I'd be > > > > expecting at least ~10,000 alloc/unlink iterations per second, not > > > > 650/second. > > > > > > > > > > Hm, Ok. My test was just a bash script doing a 'touch <files>; rm > > > <files>' loop. I know there was application overhead because if I > > > tweaked the script to open an fd directly rather than use touch, the > > > single file performance jumped up a bit, but it seemed to wash away as I > > > increased the file count so I kept running it with larger sizes. This > > > seems off so I'll port it over to C code and see how much the numbers > > > change. > > > > Yeah, using touch/rm becomes fork/exec bound very quickly. You'll > > find that using "echo > <file>" is much faster than "touch <file>" > > because it runs a shell built-in operation without fork/exec > > overhead to create the file. But you can't play tricks like that to > > replace rm: > > > > I had used 'exec' to open an fd (same idea) in the single file case and > tested with that, saw that the increase was consistent and took that > along with the increasing performance as batch sizes increased to mean > that the application overhead wasn't a factor as the test scaled. That > was clearly wrong, because if I port the whole thing to a C program the > baseline numbers are way off. I think what also threw me off is that the > single file test kernel case is actually fairly accurate between the two > tests. Anyways, here's a series of (single run, no averaging, etc.) test > runs with the updated test. Note that I reduced the runtime to 10s here > since the test was running so much faster. Otherwise this is the same > batched open/close -> unlink behavior: > > baseline test > batch: 1 files: 893579 files: 41841 > batch: 2 files: 912502 files: 41922 > batch: 4 files: 930424 files: 42084 > batch: 8 files: 932072 files: 41536 > batch: 16 files: 930624 files: 41616 > batch: 32 files: 777088 files: 41120 > batch: 64 files: 567936 files: 57216 > batch: 128 files: 579840 files: 96256 > batch: 256 files: 548608 files: 174080 > batch: 512 files: 546816 files: 246784 > batch: 1024 files: 509952 files: 328704 > batch: 2048 files: 505856 files: 399360 > batch: 4096 files: 479232 files: 438272 > > So this shows that the performance delta is actually massive from the > start. For reference, a single threaded, empty file, non syncing, > fs_mark workload stabilizes at around ~55k files/sec on this fs. Both > kernels sort of converge to that rate as the batch size increases, only > the baseline kernel starts much faster and normalizes while the test > kernel starts much slower and improves (and still really doesn't hit the > mark even at a 4k batch size). > > My takeaway from this is that we may need to find a way to mitigate this > overhead somewhat better than what the current patch does. Otherwise, > this is a significant dropoff from even a pure allocation workload in > simple mixed workload scenarios... > > > $ time for ((i=0;i<1000;i++)); do touch /mnt/scratch/foo; rm /mnt/scratch/foo ; done > > > > real 0m2.653s > > user 0m0.910s > > sys 0m2.051s > > $ time for ((i=0;i<1000;i++)); do echo > /mnt/scratch/foo; rm /mnt/scratch/foo ; done > > > > real 0m1.260s > > user 0m0.452s > > sys 0m0.913s > > $ time ./open-unlink 1000 /mnt/scratch/foo > > > > real 0m0.037s > > user 0m0.001s > > sys 0m0.030s > > $ > > > > Note the difference in system time between the three operations - > > almost all the difference in system CPU time is the overhead of > > fork/exec to run the touch/rm binaries, not do the filesystem > > operations.... > > > > > > > That's just a test of a quick hack, however. Since there is no real > > > > > urgency to inactivate an unlinked inode (it has no potential users until > > > > > it's freed), > > > > > > > > On the contrary, there is extreme urgency to inactivate inodes > > > > quickly. > > > > > > > > > > Ok, I think we're talking about slightly different things. What I mean > > > above is that if a task removes a file and goes off doing unrelated > > > $work, that inode will just sit on the percpu queue indefinitely. That's > > > fine, as there's no functional need for us to process it immediately > > > unless we're around -ENOSPC thresholds or some such that demand reclaim > > > of the inode. > > > > Yup, an occasional unlink sitting around for a while on an unlinked > > list isn't going to cause a performance problem. Indeed, such > > workloads are more likely to benefit from the reduced unlink() > > syscall overhead and won't even notice the increase in background > > CPU overhead for inactivation of those occasional inodes. > > > > > It sounds like what you're talking about is specifically > > > the behavior/performance of sustained file removal (which is important > > > obviously), where apparently there is a notable degradation if the > > > queues become deep enough to push the inode batches out of CPU cache. So > > > that makes sense... > > > > Yup, sustained bulk throughput is where cache residency really > > matters. And for unlink, sustained unlink workloads are quite > > common; they often are something people wait for on the command line > > or make up a performance critical component of a highly concurrent > > workload so it's pretty important to get this part right. > > > > > > Darrick made the original assumption that we could delay > > > > inactivation indefinitely and so he allowed really deep queues of up > > > > to 64k deferred inactivations. But with queues this deep, we could > > > > never get that background inactivation code to perform anywhere near > > > > the original synchronous background inactivation code. e.g. I > > > > measured 60-70% performance degradataions on my scalability tests, > > > > and nothing stood out in the profiles until I started looking at > > > > CPU data cache misses. > > > > > > > > > > ... but could you elaborate on the scalability tests involved here so I > > > can get a better sense of it in practice and perhaps observe the impact > > > of changes in this path? > > > > The same conconrrent fsmark create/traverse/unlink workloads I've > > been running for the past decade+ demonstrates it pretty simply. I > > also saw regressions with dbench (both op latency and throughput) as > > the clinet count (concurrency) increased, and with compilebench. I > > didn't look much further because all the common benchmarks I ran > > showed perf degradations with arbitrary delays that went away with > > the current code we have. ISTR that parts of aim7/reaim scalability > > workloads that the intel zero-day infrastructure runs are quite > > sensitive to background inactivation delays as well because that's a > > CPU bound workload and hence any reduction in cache residency > > results in a reduction of the number of concurrent jobs that can be > > run. > > > > Ok, so if I (single threaded) create (via fs_mark), sync and remove 5m > empty files, the remove takes about a minute. If I just bump out the > current queue and block thresholds by 10x and repeat, that time > increases to about ~1m24s. If I hack up a kernel to disable queueing > entirely (i.e. fully synchronous inactivation), then I'm back to about a > minute again. So I'm not producing any performance benefit with > queueing/batching in this single threaded scenario, but I suspect the > 10x threshold delta is at least measuring the negative effect of poor > caching..? (Any decent way to confirm that..?). > > And of course if I take the baseline kernel and stick a > cond_synchronize_rcu() in xfs_inactive_ifree() it brings the batch test > numbers right back but slows the removal test way down. What I find > interesting however is that if I hack up something more mild like invoke > cond_synchronize_rcu() on the oldest inode in the current inactivation > batch, bump out the blocking threshold as above (but leave the queueing > threshold at 32), and leave the iget side cond_sync_rcu() to catch > whatever falls through, my 5m file remove test now completes ~5-10s > faster than baseline and I see the following results from the batched > alloc/free test: > > batch: 1 files: 731923 > batch: 2 files: 693020 > batch: 4 files: 750948 > batch: 8 files: 743296 > batch: 16 files: 738720 > batch: 32 files: 746240 > batch: 64 files: 598464 > batch: 128 files: 672896 > batch: 256 files: 633856 > batch: 512 files: 605184 > batch: 1024 files: 569344 > batch: 2048 files: 555008 > batch: 4096 files: 524288 > This experiment had a bug that was dropping some inactivations on the floor. With that fixed, the numbers aren't quite as good. The batch test numbers still improve significantly from the posted patch (i.e. up in the range of 38-45k files/sec), but still lag the normal allocation rate, and the large rm test goes up to 1m40s (instead of 1m on baseline). Brian > Hm? > > Brian > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-25 18:30 ` Brian Foster 2022-01-25 20:07 ` Brian Foster @ 2022-01-25 22:45 ` Dave Chinner 2022-01-27 4:19 ` Al Viro 1 sibling, 1 reply; 36+ messages in thread From: Dave Chinner @ 2022-01-25 22:45 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs, Al Viro, Ian Kent, rcu On Tue, Jan 25, 2022 at 01:30:36PM -0500, Brian Foster wrote: > On Tue, Jan 25, 2022 at 11:31:20AM +1100, Dave Chinner wrote: > > On Mon, Jan 24, 2022 at 06:29:18PM -0500, Brian Foster wrote: > > > On Tue, Jan 25, 2022 at 09:08:53AM +1100, Dave Chinner wrote: > > > ... but could you elaborate on the scalability tests involved here so I > > > can get a better sense of it in practice and perhaps observe the impact > > > of changes in this path? > > > > The same conconrrent fsmark create/traverse/unlink workloads I've > > been running for the past decade+ demonstrates it pretty simply. I > > also saw regressions with dbench (both op latency and throughput) as > > the clinet count (concurrency) increased, and with compilebench. I > > didn't look much further because all the common benchmarks I ran > > showed perf degradations with arbitrary delays that went away with > > the current code we have. ISTR that parts of aim7/reaim scalability > > workloads that the intel zero-day infrastructure runs are quite > > sensitive to background inactivation delays as well because that's a > > CPU bound workload and hence any reduction in cache residency > > results in a reduction of the number of concurrent jobs that can be > > run. > > > > Ok, so if I (single threaded) create (via fs_mark), sync and remove 5m > empty files, the remove takes about a minute. If I just bump out the > current queue and block thresholds by 10x and repeat, that time > increases to about ~1m24s. If I hack up a kernel to disable queueing > entirely (i.e. fully synchronous inactivation), then I'm back to about a > minute again. So I'm not producing any performance benefit with > queueing/batching in this single threaded scenario, but I suspect the > 10x threshold delta is at least measuring the negative effect of poor > caching..? (Any decent way to confirm that..?). Right, background inactivation does not improve performance - it's necessary to get the transactions out of the evict() path. All we wanted was to ensure that there were no performance degradations as a result of background inactivation, not that it was faster. If you want to confirm that there is an increase in cold cache access when the batch size is increased, cpu profiles with 'perf top'/'perf record/report' and CPU cache performance metric reporting via 'perf stat -dddd' are your friend. See elsewhere in the thread where I mention those things to Paul. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-25 22:45 ` Dave Chinner @ 2022-01-27 4:19 ` Al Viro 2022-01-27 5:26 ` Dave Chinner 0 siblings, 1 reply; 36+ messages in thread From: Al Viro @ 2022-01-27 4:19 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, linux-xfs, Ian Kent, rcu On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote: > Right, background inactivation does not improve performance - it's > necessary to get the transactions out of the evict() path. All we > wanted was to ensure that there were no performance degradations as > a result of background inactivation, not that it was faster. > > If you want to confirm that there is an increase in cold cache > access when the batch size is increased, cpu profiles with 'perf > top'/'perf record/report' and CPU cache performance metric reporting > via 'perf stat -dddd' are your friend. See elsewhere in the thread > where I mention those things to Paul. Dave, do you see a plausible way to eventually drop Ian's bandaid? I'm not asking for that to happen this cycle and for backports Ian's patch is obviously fine. What I really want to avoid is the situation when we are stuck with keeping that bandaid in fs/namei.c, since all ways to avoid seeing reused inodes would hurt XFS too badly. And the benchmarks in this thread do look like that. Are there any realistic prospects of having xfs_iget() deal with reuse case by allocating new in-core inode and flipping whatever references you've got in XFS journalling data structures to the new copy? If I understood what you said on IRC correctly, that is... Again, I'm not asking if it can be done this cycle; having a realistic path to doing that eventually would be fine by me. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-27 4:19 ` Al Viro @ 2022-01-27 5:26 ` Dave Chinner 2022-01-27 19:01 ` Brian Foster 0 siblings, 1 reply; 36+ messages in thread From: Dave Chinner @ 2022-01-27 5:26 UTC (permalink / raw) To: Al Viro; +Cc: Brian Foster, linux-xfs, Ian Kent, rcu On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote: > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote: > > > Right, background inactivation does not improve performance - it's > > necessary to get the transactions out of the evict() path. All we > > wanted was to ensure that there were no performance degradations as > > a result of background inactivation, not that it was faster. > > > > If you want to confirm that there is an increase in cold cache > > access when the batch size is increased, cpu profiles with 'perf > > top'/'perf record/report' and CPU cache performance metric reporting > > via 'perf stat -dddd' are your friend. See elsewhere in the thread > > where I mention those things to Paul. > > Dave, do you see a plausible way to eventually drop Ian's bandaid? > I'm not asking for that to happen this cycle and for backports Ian's > patch is obviously fine. Yes, but not in the near term. > What I really want to avoid is the situation when we are stuck with > keeping that bandaid in fs/namei.c, since all ways to avoid seeing > reused inodes would hurt XFS too badly. And the benchmarks in this > thread do look like that. The simplest way I think is to have the XFS inode allocation track "busy inodes" in the same way we track "busy extents". A busy extent is an extent that has been freed by the user, but is not yet marked free in the journal/on disk. If we try to reallocate that busy extent, we either select a different free extent to allocate, or if we can't find any we force the journal to disk, wait for it to complete (hence unbusying the extents) and retry the allocation again. We can do something similar for inode allocation - it's actually a lockless tag lookup on the radix tree entry for the candidate inode number. If we find the reclaimable radix tree tag set, the we select a different inode. If we can't allocate a new inode, then we kick synchronize_rcu() and retry the allocation, allowing inodes to be recycled this time. > Are there any realistic prospects of having xfs_iget() deal with > reuse case by allocating new in-core inode and flipping whatever > references you've got in XFS journalling data structures to the > new copy? If I understood what you said on IRC correctly, that is... That's ... much harder. One of the problems is that once an inode has a log item attached to it, it assumes that it can be accessed without specific locking, etc. see xfs_inode_clean(), for example. So there's some life-cycle stuff that needs to be taken care of in XFS first, and the inode <-> log item relationship is tangled. I've been working towards removing that tangle - but taht stuff is quite a distance down my logging rework patch queue. THat queue has been stuck now for a year trying to get the first handful of rework and scalability modifications reviewed and merged, so I'm not holding my breathe as to how long a more substantial rework of internal logging code will take to review and merge. Really, though, we need the inactivation stuff to be done as part of the VFS inode lifecycle. I have some ideas on what to do here, but I suspect we'll need some changes to iput_final()/evict() to allow us to process final unlinks in the bakground and then call evict() ourselves when the unlink completes. That way ->destroy_inode() can just call xfs_reclaim_inode() to free it directly, which also helps us get rid of background inode freeing and hence inode recycling from XFS altogether. I think we _might_ be able to do this without needing to change any of the logging code in XFS, but I haven't looked any further than this into it as yet. > Again, I'm not asking if it can be done this cycle; having a > realistic path to doing that eventually would be fine by me. We're talking a year at least, probably two, before we get there... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-27 5:26 ` Dave Chinner @ 2022-01-27 19:01 ` Brian Foster 2022-01-27 22:18 ` Dave Chinner 2022-01-28 21:39 ` Paul E. McKenney 0 siblings, 2 replies; 36+ messages in thread From: Brian Foster @ 2022-01-27 19:01 UTC (permalink / raw) To: Dave Chinner; +Cc: Al Viro, linux-xfs, Ian Kent, rcu On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote: > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote: > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote: > > > > > Right, background inactivation does not improve performance - it's > > > necessary to get the transactions out of the evict() path. All we > > > wanted was to ensure that there were no performance degradations as > > > a result of background inactivation, not that it was faster. > > > > > > If you want to confirm that there is an increase in cold cache > > > access when the batch size is increased, cpu profiles with 'perf > > > top'/'perf record/report' and CPU cache performance metric reporting > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread > > > where I mention those things to Paul. > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid? > > I'm not asking for that to happen this cycle and for backports Ian's > > patch is obviously fine. > > Yes, but not in the near term. > > > What I really want to avoid is the situation when we are stuck with > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing > > reused inodes would hurt XFS too badly. And the benchmarks in this > > thread do look like that. > > The simplest way I think is to have the XFS inode allocation track > "busy inodes" in the same way we track "busy extents". A busy extent > is an extent that has been freed by the user, but is not yet marked > free in the journal/on disk. If we try to reallocate that busy > extent, we either select a different free extent to allocate, or if > we can't find any we force the journal to disk, wait for it to > complete (hence unbusying the extents) and retry the allocation > again. > > We can do something similar for inode allocation - it's actually a > lockless tag lookup on the radix tree entry for the candidate inode > number. If we find the reclaimable radix tree tag set, the we select > a different inode. If we can't allocate a new inode, then we kick > synchronize_rcu() and retry the allocation, allowing inodes to be > recycled this time. > I'm starting to poke around this area since it's become clear that the currently proposed scheme just involves too much latency (unless Paul chimes in with his expedited grace period variant, at which point I will revisit) in the fast allocation/recycle path. ISTM so far that a simple "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will have pretty much the same pattern of behavior as this patch: one synchronize_rcu() per batch. IOW, background reclaim only kicks in after 30s by default, so the pool of free inodes pretty much always consists of 100% reclaimable inodes. On top of that, at smaller batch sizes, the pool tends to have a uniform (!elapsed) grace period cookie, so a stall is required to be able to allocate any of them. As the batch size increases, I do see the population of free inodes start to contain a mix of expired and non-expired grace period cookies. It's fairly easy to hack up an internal icwalk scan to locate already expired inodes, but the problem is that the recycle rate is so much faster than the grace period latency that it doesn't really matter. We'll still have to stall by the time we get to the non-expired inodes, and so we're back to one stall per batch and the same general performance characteristic of this patch. So given all of this, I'm wondering about something like the following high level inode allocation algorithm: 1. If the AG has any reclaimable inodes, scan for one with an expired grace period. If found, target that inode for physical allocation. 2. If the AG free inode count == the AG reclaimable count and we know all reclaimable inodes are most likely pending a grace period (because the previous step failed), allocate a new inode chunk (and target it in this allocation). 3. If the AG free inode count > the reclaimable count, scan the finobt for an inode that is not present in the radix tree (i.e. Dave's logic above). Each of those steps could involve some heuristics to maintain predictable behavior and avoid large scans and such, but the general idea is that the repeated alloc/free inode workload naturally populates the AG with enough physical inodes to always be able to satisfy an allocation without waiting on a grace period. IOW, this is effectively similar behavior to if physical inode freeing was delayed to an rcu callback, with the tradeoff of complicating the allocation path rather than stalling in the inactivation pipeline. Thoughts? This of course is more involved than this patch (or similarly simple variants of RCU delaying preexisting bits of code) and requires some more investigation, but certainly shouldn't be a multi-year thing. The question is probably more of whether it's enough complexity to justify in the meantime... > > Are there any realistic prospects of having xfs_iget() deal with > > reuse case by allocating new in-core inode and flipping whatever > > references you've got in XFS journalling data structures to the > > new copy? If I understood what you said on IRC correctly, that is... > > That's ... much harder. > > One of the problems is that once an inode has a log item attached to > it, it assumes that it can be accessed without specific locking, > etc. see xfs_inode_clean(), for example. So there's some life-cycle > stuff that needs to be taken care of in XFS first, and the inode <-> > log item relationship is tangled. > > I've been working towards removing that tangle - but taht stuff is > quite a distance down my logging rework patch queue. THat queue has > been stuck now for a year trying to get the first handful of rework > and scalability modifications reviewed and merged, so I'm not > holding my breathe as to how long a more substantial rework of > internal logging code will take to review and merge. > > Really, though, we need the inactivation stuff to be done as part of > the VFS inode lifecycle. I have some ideas on what to do here, but I > suspect we'll need some changes to iput_final()/evict() to allow us > to process final unlinks in the bakground and then call evict() > ourselves when the unlink completes. That way ->destroy_inode() can > just call xfs_reclaim_inode() to free it directly, which also helps > us get rid of background inode freeing and hence inode recycling > from XFS altogether. I think we _might_ be able to do this without > needing to change any of the logging code in XFS, but I haven't > looked any further than this into it as yet. > ... of whatever this ends up looking like. Can you elaborate on what you mean by processing unlinks in the background? I can see the value of being able to eliminate the recycle code in XFS, but wouldn't we still have to limit and throttle against background work to maintain sustained removal performance? IOW, what's the general teardown behavior you're getting at here, aside from what parts push into the vfs or not? Brian > > Again, I'm not asking if it can be done this cycle; having a > > realistic path to doing that eventually would be fine by me. > > We're talking a year at least, probably two, before we get there... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-27 19:01 ` Brian Foster @ 2022-01-27 22:18 ` Dave Chinner 2022-01-28 14:11 ` Brian Foster 2022-01-28 21:39 ` Paul E. McKenney 1 sibling, 1 reply; 36+ messages in thread From: Dave Chinner @ 2022-01-27 22:18 UTC (permalink / raw) To: Brian Foster; +Cc: Al Viro, linux-xfs, Ian Kent, rcu On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote: > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote: > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote: > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote: > > > > > > > Right, background inactivation does not improve performance - it's > > > > necessary to get the transactions out of the evict() path. All we > > > > wanted was to ensure that there were no performance degradations as > > > > a result of background inactivation, not that it was faster. > > > > > > > > If you want to confirm that there is an increase in cold cache > > > > access when the batch size is increased, cpu profiles with 'perf > > > > top'/'perf record/report' and CPU cache performance metric reporting > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread > > > > where I mention those things to Paul. > > > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid? > > > I'm not asking for that to happen this cycle and for backports Ian's > > > patch is obviously fine. > > > > Yes, but not in the near term. > > > > > What I really want to avoid is the situation when we are stuck with > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing > > > reused inodes would hurt XFS too badly. And the benchmarks in this > > > thread do look like that. > > > > The simplest way I think is to have the XFS inode allocation track > > "busy inodes" in the same way we track "busy extents". A busy extent > > is an extent that has been freed by the user, but is not yet marked > > free in the journal/on disk. If we try to reallocate that busy > > extent, we either select a different free extent to allocate, or if > > we can't find any we force the journal to disk, wait for it to > > complete (hence unbusying the extents) and retry the allocation > > again. > > > > We can do something similar for inode allocation - it's actually a > > lockless tag lookup on the radix tree entry for the candidate inode > > number. If we find the reclaimable radix tree tag set, the we select > > a different inode. If we can't allocate a new inode, then we kick > > synchronize_rcu() and retry the allocation, allowing inodes to be > > recycled this time. > > > > I'm starting to poke around this area since it's become clear that the > currently proposed scheme just involves too much latency (unless Paul > chimes in with his expedited grace period variant, at which point I will > revisit) in the fast allocation/recycle path. ISTM so far that a simple > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will > have pretty much the same pattern of behavior as this patch: one > synchronize_rcu() per batch. That's not really what I proposed - what I suggested was that if we can't allocate a usable inode from the finobt, and we can't allocate a new inode cluster from the AG (i.e. populate the finobt with more inodes), only then call synchronise_rcu() and recycle an inode. We don't need to scan the inode cache or the finobt to determine if there are reclaimable inodes immediately available - do a gang tag lookup on the radix tree for newino. If it comes back with an inode number that is not equal to the node number we looked up, then we can allocate an newino immediately. If it comes back with newino, then check the first inode in the finobt. If that comes back with an inode that is not the first inode in the finobt, we can immediately allocate the first inode in the finobt. If not, check the last inode. if that fails, assume all inodes in the finobt need recycling and allocate a new cluster, pointing newino at it. Then we get another 64 inodes starting at the newino cursor we can allocate from while we wait for the current RCU grace period to expire for inodes already in the reclaimable state. An algorithm like this will allow the free inode pool to resize automatically based on the unlink frequency of the workload and RCU grace period latency... > IOW, background reclaim only kicks in after 30s by default, 5 seconds, by default, not 30s. > so the pool > of free inodes pretty much always consists of 100% reclaimable inodes. > On top of that, at smaller batch sizes, the pool tends to have a uniform > (!elapsed) grace period cookie, so a stall is required to be able to > allocate any of them. As the batch size increases, I do see the > population of free inodes start to contain a mix of expired and > non-expired grace period cookies. It's fairly easy to hack up an > internal icwalk scan to locate already expired inodes, We don't want or need to do exhaustive, exactly correct scans here. We want *fast and loose* because this is a critical performance fast path. We don't care if we skip the occasional recyclable inode, what we need to to is minimise the CPU overhead and search latency for the case where recycling will never occur. > but the problem > is that the recycle rate is so much faster than the grace period latency > that it doesn't really matter. We'll still have to stall by the time we > get to the non-expired inodes, and so we're back to one stall per batch > and the same general performance characteristic of this patch. Yes, but that's why I suggested that we allocate a new inode cluster rather than calling synchronise_rcu() when we don't have a recyclable inode candidate. > So given all of this, I'm wondering about something like the following > high level inode allocation algorithm: > > 1. If the AG has any reclaimable inodes, scan for one with an expired > grace period. If found, target that inode for physical allocation. How do you efficiently discriminate between "reclaimable w/ nlink > 0" and "reclaimable w/ nlink == 0" so we don't get hung up searching millions of reclaimable inodes for the one that has been unlinked and has an expired grace period? Also, this will need to be done on every inode allocation when we have inodes in reclaimable state (which is almost always on a busy system). Workloads with sequential allocation (as per untar, rsync, git checkout, cp -r, etc) will do this scan unnecessarily as they will almost never hit this inode recycle path as there aren't a lot of unlinks occurring while they are working. > 2. If the AG free inode count == the AG reclaimable count and we know > all reclaimable inodes are most likely pending a grace period (because > the previous step failed), allocate a new inode chunk (and target it in > this allocation). That's good for the allocation that allocates the chunk, but... > 3. If the AG free inode count > the reclaimable count, scan the finobt > for an inode that is not present in the radix tree (i.e. Dave's logic > above). ... now we are repeating the radix tree walk that we've already done in #1 to find the newly allocated inodes we allocated in #2. We don't need to walk the inodes in the inode radix tree to look at individual inode state - we can use the reclaimable radix tree tag to shortcut those walks and minimise the number of actual lookups we need to do. By definition, and inode in the finobt and XFS_IRECLAIMABLE state is an inode that needs recycling, so we can just use the finobt and the inode radix tree tags to avoid inodes that need recycling altogether. i.e. If we fail a tag lookup, we have no reclaimable inodes in the range we asked the lookup to search so we can immediately allocate - we don't need to actually need to look at the inode in the fast path no-recycling case at all. Keep in mind that the fast path we really care about is not the unlink/allocate looping case, it's the allocation case where no recycling will ever occur and so that's the one we really have to try hard to minimise the overhead for. The moment we get into reclaimable inodes within the finobt range we're hitting the "lots of temp files" use case, so we can detect that and keep the overhead of that algorithm as separate as we possibly can. Hence we need the initial "can we allocate this inode number" decision to be as fast and as low overhead as possible so we can determine which algorithm we need to run. A lockless radix tree gang tag lookup will give us that and if the lookup finds a reclaimable inode only then do we move into the "recycle RCU avoidance" algorithm path.... > > > Are there any realistic prospects of having xfs_iget() deal with > > > reuse case by allocating new in-core inode and flipping whatever > > > references you've got in XFS journalling data structures to the > > > new copy? If I understood what you said on IRC correctly, that is... > > > > That's ... much harder. > > > > One of the problems is that once an inode has a log item attached to > > it, it assumes that it can be accessed without specific locking, > > etc. see xfs_inode_clean(), for example. So there's some life-cycle > > stuff that needs to be taken care of in XFS first, and the inode <-> > > log item relationship is tangled. > > > > I've been working towards removing that tangle - but taht stuff is > > quite a distance down my logging rework patch queue. THat queue has > > been stuck now for a year trying to get the first handful of rework > > and scalability modifications reviewed and merged, so I'm not > > holding my breathe as to how long a more substantial rework of > > internal logging code will take to review and merge. > > > > Really, though, we need the inactivation stuff to be done as part of > > the VFS inode lifecycle. I have some ideas on what to do here, but I > > suspect we'll need some changes to iput_final()/evict() to allow us > > to process final unlinks in the bakground and then call evict() > > ourselves when the unlink completes. That way ->destroy_inode() can > > just call xfs_reclaim_inode() to free it directly, which also helps > > us get rid of background inode freeing and hence inode recycling > > from XFS altogether. I think we _might_ be able to do this without > > needing to change any of the logging code in XFS, but I haven't > > looked any further than this into it as yet. > > > > ... of whatever this ends up looking like. > > Can you elaborate on what you mean by processing unlinks in the > background? I can see the value of being able to eliminate the recycle > code in XFS, but wouldn't we still have to limit and throttle against > background work to maintain sustained removal performance? Yes, but that's irrelevant because all we would be doing is slightly changing where that throttling occurs (i.e. in iput_final->drop_inode instead of iput_final->evict->destroy_inode). However, moving the throttling up the stack is a good thing because it gets rid of the current problem with the inactivation throttling blocking the shrinker via shrinker->super_cache_scan-> prune_icache_sb->dispose_list->evict-> destroy_inode->throttle on full inactivation queue because all the inodes need EOF block trimming to be done. > IOW, what's > the general teardown behavior you're getting at here, aside from what > parts push into the vfs or not? ->drop_inode() triggers background inactivation for both blockgc and inode unlink. For unlink, we set I_WILL_FREE so the VFS will not attempt to re-use it, add the inode # to the internal AG "busy inode" tree and return drop = true and the VFS then stops processing that inode. For blockgc, we queue the work and return drop = false and the VFS puts it onto the LRU. Now we have asynchronous inactivation while the inode is still present and visible at the VFS level. For background blockgc - that now happens while the inode is idle on the LRU before it gets reclaimed by the shrinker. i.e. we trigger block gc when the last reference to the inode goes away instead of when it gets removed from memory by the shrinker. For unlink, that now runs in the bacgrkoud until the inode unlink has been journalled and the cleared inode written to the backing inode cluster buffer. The inode is then no longer visisble to the journal and it can't be reallocated because it is still busy. We then change the inode state from I_WILL_FREE to I_FREEING and call evict(). The inode then gets torn down, and in ->destroy_inode we remove the inode from the radix tree, clear the per-ag busy record and free the inode via RCU as expected by the VFS. Another possible mechanism instead of exporting evict() is that background inactivation takes a new reference to the inode from ->drop_inode so that even if we put it on the LRU the inode cache shrinker will skip it while we are doing background inactivation. That would mean that when background inactivation is done, we call iput_final() again. The inode will either then be left on the LRU or go through the normal evict() path. This also it gets the memory demand and overhead of EOF block trimming out of the memory reclaim path, and it also gets rid of the need for the special superblock shrinker hooks that XFS has for reclaiming it's internal inode cache. Overall, lifting this stuff up to the VFS is full of "less complexity in XFS" wins if we can make it work... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-27 22:18 ` Dave Chinner @ 2022-01-28 14:11 ` Brian Foster 2022-01-28 23:53 ` Dave Chinner 0 siblings, 1 reply; 36+ messages in thread From: Brian Foster @ 2022-01-28 14:11 UTC (permalink / raw) To: Dave Chinner; +Cc: Al Viro, linux-xfs, Ian Kent, rcu On Fri, Jan 28, 2022 at 09:18:17AM +1100, Dave Chinner wrote: > On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote: > > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote: > > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote: > > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote: > > > > > > > > > Right, background inactivation does not improve performance - it's > > > > > necessary to get the transactions out of the evict() path. All we > > > > > wanted was to ensure that there were no performance degradations as > > > > > a result of background inactivation, not that it was faster. > > > > > > > > > > If you want to confirm that there is an increase in cold cache > > > > > access when the batch size is increased, cpu profiles with 'perf > > > > > top'/'perf record/report' and CPU cache performance metric reporting > > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread > > > > > where I mention those things to Paul. > > > > > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid? > > > > I'm not asking for that to happen this cycle and for backports Ian's > > > > patch is obviously fine. > > > > > > Yes, but not in the near term. > > > > > > > What I really want to avoid is the situation when we are stuck with > > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing > > > > reused inodes would hurt XFS too badly. And the benchmarks in this > > > > thread do look like that. > > > > > > The simplest way I think is to have the XFS inode allocation track > > > "busy inodes" in the same way we track "busy extents". A busy extent > > > is an extent that has been freed by the user, but is not yet marked > > > free in the journal/on disk. If we try to reallocate that busy > > > extent, we either select a different free extent to allocate, or if > > > we can't find any we force the journal to disk, wait for it to > > > complete (hence unbusying the extents) and retry the allocation > > > again. > > > > > > We can do something similar for inode allocation - it's actually a > > > lockless tag lookup on the radix tree entry for the candidate inode > > > number. If we find the reclaimable radix tree tag set, the we select > > > a different inode. If we can't allocate a new inode, then we kick > > > synchronize_rcu() and retry the allocation, allowing inodes to be > > > recycled this time. > > > > > > > I'm starting to poke around this area since it's become clear that the > > currently proposed scheme just involves too much latency (unless Paul > > chimes in with his expedited grace period variant, at which point I will > > revisit) in the fast allocation/recycle path. ISTM so far that a simple > > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will > > have pretty much the same pattern of behavior as this patch: one > > synchronize_rcu() per batch. > > That's not really what I proposed - what I suggested was that if we > can't allocate a usable inode from the finobt, and we can't allocate > a new inode cluster from the AG (i.e. populate the finobt with more > inodes), only then call synchronise_rcu() and recycle an inode. > That's not how I read it... Regardless, that was my suggestion as well, so we're on the same page on that front. > We don't need to scan the inode cache or the finobt to determine if > there are reclaimable inodes immediately available - do a gang tag > lookup on the radix tree for newino. > If it comes back with an inode number that is not > equal to the node number we looked up, then we can allocate an > newino immediately. > > If it comes back with newino, then check the first inode in the > finobt. If that comes back with an inode that is not the first inode > in the finobt, we can immediately allocate the first inode in the > finobt. If not, check the last inode. if that fails, assume all > inodes in the finobt need recycling and allocate a new cluster, > pointing newino at it. > Hrm, I'll have to think about this some more. I don't mind something like this as a possible scanning allocation algorithm, but I don't love the idea of doing a few predictable btree/radix tree lookups and inferring broader AG state from that, particularly when I think it's possible to get more accurate information in a way that's easier and probably more efficient. For example, we already have counts of the number of reclaimable and free inodes in the perag. We could fairly easily add a counter to track the subset of reclaimable inodes that are unlinked. With something like that, it's easier to make higher level decisions like when to just allocate a new inode chunk (because the free inode pool consists mostly of reclaimable inodes) or just scanning through the finobt for a good candidate (because there are none or very few unlinked reclaimable inodes relative to the number of free inodes in the btree). So in general I think the two obvious ends of the spectrum (i.e. the repeated alloc/free workload I'm testing above vs. the tar/cp use case where there are many allocs and few unlinks) are probably the most straightforward to handle and don't require major search algorithm changes. It's the middle ground (i.e. a large number of free inodes with half or whatever more sitting in the radix tree) that I think requires some more thought and I don't quite have an answer for atm. I don't want to go off allocating new inode chunks too aggressively, but also don't want to turn the finobt allocation algorithm into something like the historical inobt search algorithm with poor worst case behavior. > Then we get another 64 inodes starting at the newino cursor we can > allocate from while we wait for the current RCU grace period to > expire for inodes already in the reclaimable state. An algorithm > like this will allow the free inode pool to resize automatically > based on the unlink frequency of the workload and RCU grace period > latency... > > > IOW, background reclaim only kicks in after 30s by default, > > 5 seconds, by default, not 30s. > xfs_reclaim_work_queue() keys off xfs_syncd_centisecs, which corresponds to xfs_params.syncd_timer, which is initialized as: .syncd_timer = { 1*100, 30*100, 7200*100}, Am I missing something? Not that it really matters much for this discussion anyways. Whether it's 30s or 5s, either way the reallocation workload is going to pretty much always recycle these inodes long before background reclaim gets to them. > > so the pool > > of free inodes pretty much always consists of 100% reclaimable inodes. > > On top of that, at smaller batch sizes, the pool tends to have a uniform > > (!elapsed) grace period cookie, so a stall is required to be able to > > allocate any of them. As the batch size increases, I do see the > > population of free inodes start to contain a mix of expired and > > non-expired grace period cookies. It's fairly easy to hack up an > > internal icwalk scan to locate already expired inodes, > > We don't want or need to do exhaustive, exactly correct scans here. > We want *fast and loose* because this is a critical performance fast > path. We don't care if we skip the occasional recyclable inode, what > we need to to is minimise the CPU overhead and search latency for > the case where recycling will never occur. > Agreed. That's what I meant by my comment about having heuristics to avoid large/long scans. > > but the problem > > is that the recycle rate is so much faster than the grace period latency > > that it doesn't really matter. We'll still have to stall by the time we > > get to the non-expired inodes, and so we're back to one stall per batch > > and the same general performance characteristic of this patch. > > Yes, but that's why I suggested that we allocate a new inode cluster > rather than calling synchronise_rcu() when we don't have a > recyclable inode candidate. > Ok. > > So given all of this, I'm wondering about something like the following > > high level inode allocation algorithm: > > > > 1. If the AG has any reclaimable inodes, scan for one with an expired > > grace period. If found, target that inode for physical allocation. > > How do you efficiently discriminate between "reclaimable w/ nlink > > 0" and "reclaimable w/ nlink == 0" so we don't get hung up searching > millions of reclaimable inodes for the one that has been unlinked > and has an expired grace period? > A counter or some other form of hinting structure.. > Also, this will need to be done on every inode allocation when we > have inodes in reclaimable state (which is almost always on a busy > system). Workloads with sequential allocation (as per untar, rsync, > git checkout, cp -r, etc) will do this scan unnecessarily as they > will almost never hit this inode recycle path as there aren't a lot > of unlinks occurring while they are working. > I'm not necessarily suggesting a full radix tree scan per inode allocation. I was more thinking about an occasionally updated hinting structure to efficiently locate the least recently freed inode numbers, or something similar. This would serve no purpose in scenarios where it just makes more sense to allocate new chunks, but otherwise could just serve as an allocation target, a metric to determine likelihood of reclaimable inodes w/ expired grace periods being present, or just a starting point for a finobt search algorithm like what you describe above, etc. > > 2. If the AG free inode count == the AG reclaimable count and we know > > all reclaimable inodes are most likely pending a grace period (because > > the previous step failed), allocate a new inode chunk (and target it in > > this allocation). > > That's good for the allocation that allocates the chunk, but... > > > 3. If the AG free inode count > the reclaimable count, scan the finobt > > for an inode that is not present in the radix tree (i.e. Dave's logic > > above). > > ... now we are repeating the radix tree walk that we've already done > in #1 to find the newly allocated inodes we allocated in #2. > > We don't need to walk the inodes in the inode radix tree to look at > individual inode state - we can use the reclaimable radix tree tag > to shortcut those walks and minimise the number of actual lookups we > need to do. By definition, and inode in the finobt and > XFS_IRECLAIMABLE state is an inode that needs recycling, so we can > just use the finobt and the inode radix tree tags to avoid inodes > that need recycling altogether. i.e. If we fail a tag lookup, we > have no reclaimable inodes in the range we asked the lookup to > search so we can immediately allocate - we don't need to actually > need to look at the inode in the fast path no-recycling case at all. > This is starting to make some odd (to me) assumptions about thus far undefined implementation details. For example, the very little amount of code I have already for experimentation purposes only scans tagged reclaimable inodes, so that you suggest doing exactly that instead of full radix tree scans suggests to me that there are some details here that are clearly not getting across in email. ;) That's fine, I'm not trying to cover details. Details are easier to work through with code, and TBH I don't have enough concrete ideas to hash through details in email just yet anyways. The primary concepts in my previous description were that we should prioritize allocation of new chunks over taking RCU stalls whenever possible, and that there might be ways to use existing radix tree state to maintain predictable worst case performance for finobt searches (TBD). With regard to the general principles you mention of avoiding repeated large scans, maintaing common workload and fast path performance, etc., I think we're pretty much on the same page. > Keep in mind that the fast path we really care about is not the > unlink/allocate looping case, it's the allocation case where no > recycling will ever occur and so that's the one we really have to > try hard to minimise the overhead for. The moment we get into > reclaimable inodes within the finobt range we're hitting the "lots > of temp files" use case, so we can detect that and keep the overhead > of that algorithm as separate as we possibly can. > > Hence we need the initial "can we allocate this inode number" > decision to be as fast and as low overhead as possible so we can > determine which algorithm we need to run. A lockless radix tree gang > tag lookup will give us that and if the lookup finds a reclaimable > inode only then do we move into the "recycle RCU avoidance" > algorithm path.... > > > > > Are there any realistic prospects of having xfs_iget() deal with > > > > reuse case by allocating new in-core inode and flipping whatever > > > > references you've got in XFS journalling data structures to the > > > > new copy? If I understood what you said on IRC correctly, that is... > > > > > > That's ... much harder. > > > > > > One of the problems is that once an inode has a log item attached to > > > it, it assumes that it can be accessed without specific locking, > > > etc. see xfs_inode_clean(), for example. So there's some life-cycle > > > stuff that needs to be taken care of in XFS first, and the inode <-> > > > log item relationship is tangled. > > > > > > I've been working towards removing that tangle - but taht stuff is > > > quite a distance down my logging rework patch queue. THat queue has > > > been stuck now for a year trying to get the first handful of rework > > > and scalability modifications reviewed and merged, so I'm not > > > holding my breathe as to how long a more substantial rework of > > > internal logging code will take to review and merge. > > > > > > Really, though, we need the inactivation stuff to be done as part of > > > the VFS inode lifecycle. I have some ideas on what to do here, but I > > > suspect we'll need some changes to iput_final()/evict() to allow us > > > to process final unlinks in the bakground and then call evict() > > > ourselves when the unlink completes. That way ->destroy_inode() can > > > just call xfs_reclaim_inode() to free it directly, which also helps > > > us get rid of background inode freeing and hence inode recycling > > > from XFS altogether. I think we _might_ be able to do this without > > > needing to change any of the logging code in XFS, but I haven't > > > looked any further than this into it as yet. > > > > > > > ... of whatever this ends up looking like. > > > > Can you elaborate on what you mean by processing unlinks in the > > background? I can see the value of being able to eliminate the recycle > > code in XFS, but wouldn't we still have to limit and throttle against > > background work to maintain sustained removal performance? > > Yes, but that's irrelevant because all we would be doing is slightly > changing where that throttling occurs (i.e. in > iput_final->drop_inode instead of iput_final->evict->destroy_inode). > > However, moving the throttling up the stack is a good thing because > it gets rid of the current problem with the inactivation throttling > blocking the shrinker via shrinker->super_cache_scan-> > prune_icache_sb->dispose_list->evict-> destroy_inode->throttle on > full inactivation queue because all the inodes need EOF block > trimming to be done. > What I'm trying to understand is whether inodes will have cycled through the requisite grace period before ->destroy_inode() or not, and if so, how that is done to avoid the sustained removal performance problem we've run into here (caused by the extra latency leading to increasing cacheline misses)..? > > IOW, what's > > the general teardown behavior you're getting at here, aside from what > > parts push into the vfs or not? > > ->drop_inode() triggers background inactivation for both blockgc and > inode unlink. For unlink, we set I_WILL_FREE so the VFS will not > attempt to re-use it, add the inode # to the internal AG "busy > inode" tree and return drop = true and the VFS then stops processing > that inode. For blockgc, we queue the work and return drop = false > and the VFS puts it onto the LRU. Now we have asynchronous > inactivation while the inode is still present and visible at the VFS > level. > > For background blockgc - that now happens while the inode is idle on > the LRU before it gets reclaimed by the shrinker. i.e. we trigger > block gc when the last reference to the inode goes away instead of > when it gets removed from memory by the shrinker. > > For unlink, that now runs in the bacgrkoud until the inode unlink > has been journalled and the cleared inode written to the backing > inode cluster buffer. The inode is then no longer visisble to the > journal and it can't be reallocated because it is still busy. We > then change the inode state from I_WILL_FREE to I_FREEING and call > evict(). The inode then gets torn down, and in ->destroy_inode we > remove the inode from the radix tree, clear the per-ag busy record > and free the inode via RCU as expected by the VFS. > Ok, so this sort of sounds like these are separate things. I'm all for creating more flexibility with the VFS to allow XFS to remove or simplify codepaths, but this still depends on some form of grace period tracking to avoid allocation of inodes that are free in the btrees but still might have in-core struct inode's laying around, yes? The reason I'm asking about this is because as this patch to avoid recycling non-expired inodes becomes more complex in order to satisfy performance requirements, longer term usefulness becomes more relevant. I don't want us to come up with some complex scheme to avoid RCU stalls when there's already a plan to rip it out and replace it in a year or so. OTOH if the resulting logic is part of that longer term strategy, then this is less of a concern. Brian > Another possible mechanism instead of exporting evict() is that > background inactivation takes a new reference to the inode from > ->drop_inode so that even if we put it on the LRU the inode cache > shrinker will skip it while we are doing background inactivation. > That would mean that when background inactivation is done, we call > iput_final() again. The inode will either then be left on the LRU or > go through the normal evict() path. > > This also it gets the memory demand and overhead of EOF block > trimming out of the memory reclaim path, and it also gets rid of > the need for the special superblock shrinker hooks that XFS has for > reclaiming it's internal inode cache. > > Overall, lifting this stuff up to the VFS is full of "less > complexity in XFS" wins if we can make it work... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-28 14:11 ` Brian Foster @ 2022-01-28 23:53 ` Dave Chinner 2022-01-31 13:28 ` Brian Foster 0 siblings, 1 reply; 36+ messages in thread From: Dave Chinner @ 2022-01-28 23:53 UTC (permalink / raw) To: Brian Foster; +Cc: Al Viro, linux-xfs, Ian Kent, rcu On Fri, Jan 28, 2022 at 09:11:07AM -0500, Brian Foster wrote: > On Fri, Jan 28, 2022 at 09:18:17AM +1100, Dave Chinner wrote: > > On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote: > > > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote: > > > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote: > > > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote: > > > > > > > > > > > Right, background inactivation does not improve performance - it's > > > > > > necessary to get the transactions out of the evict() path. All we > > > > > > wanted was to ensure that there were no performance degradations as > > > > > > a result of background inactivation, not that it was faster. > > > > > > > > > > > > If you want to confirm that there is an increase in cold cache > > > > > > access when the batch size is increased, cpu profiles with 'perf > > > > > > top'/'perf record/report' and CPU cache performance metric reporting > > > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread > > > > > > where I mention those things to Paul. > > > > > > > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid? > > > > > I'm not asking for that to happen this cycle and for backports Ian's > > > > > patch is obviously fine. > > > > > > > > Yes, but not in the near term. > > > > > > > > > What I really want to avoid is the situation when we are stuck with > > > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing > > > > > reused inodes would hurt XFS too badly. And the benchmarks in this > > > > > thread do look like that. > > > > > > > > The simplest way I think is to have the XFS inode allocation track > > > > "busy inodes" in the same way we track "busy extents". A busy extent > > > > is an extent that has been freed by the user, but is not yet marked > > > > free in the journal/on disk. If we try to reallocate that busy > > > > extent, we either select a different free extent to allocate, or if > > > > we can't find any we force the journal to disk, wait for it to > > > > complete (hence unbusying the extents) and retry the allocation > > > > again. > > > > > > > > We can do something similar for inode allocation - it's actually a > > > > lockless tag lookup on the radix tree entry for the candidate inode > > > > number. If we find the reclaimable radix tree tag set, the we select > > > > a different inode. If we can't allocate a new inode, then we kick > > > > synchronize_rcu() and retry the allocation, allowing inodes to be > > > > recycled this time. > > > > > > > > > > I'm starting to poke around this area since it's become clear that the > > > currently proposed scheme just involves too much latency (unless Paul > > > chimes in with his expedited grace period variant, at which point I will > > > revisit) in the fast allocation/recycle path. ISTM so far that a simple > > > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will > > > have pretty much the same pattern of behavior as this patch: one > > > synchronize_rcu() per batch. > > > > That's not really what I proposed - what I suggested was that if we > > can't allocate a usable inode from the finobt, and we can't allocate > > a new inode cluster from the AG (i.e. populate the finobt with more > > inodes), only then call synchronise_rcu() and recycle an inode. > > > > That's not how I read it... Regardless, that was my suggestion as well, > so we're on the same page on that front. > > > We don't need to scan the inode cache or the finobt to determine if > > there are reclaimable inodes immediately available - do a gang tag > > lookup on the radix tree for newino. > > If it comes back with an inode number that is not > > equal to the node number we looked up, then we can allocate an > > newino immediately. > > > > If it comes back with newino, then check the first inode in the > > finobt. If that comes back with an inode that is not the first inode > > in the finobt, we can immediately allocate the first inode in the > > finobt. If not, check the last inode. if that fails, assume all > > inodes in the finobt need recycling and allocate a new cluster, > > pointing newino at it. > > > > Hrm, I'll have to think about this some more. I don't mind something > like this as a possible scanning allocation algorithm, but I don't love > the idea of doing a few predictable btree/radix tree lookups and > inferring broader AG state from that, particularly when I think it's > possible to get more accurate information in a way that's easier and > probably more efficient. > > For example, we already have counts of the number of reclaimable and > free inodes in the perag. We could fairly easily add a counter to track > the subset of reclaimable inodes that are unlinked. With something like > that, it's easier to make higher level decisions like when to just > allocate a new inode chunk (because the free inode pool consists mostly > of reclaimable inodes) or just scanning through the finobt for a good > candidate (because there are none or very few unlinked reclaimable > inodes relative to the number of free inodes in the btree). > > So in general I think the two obvious ends of the spectrum (i.e. the > repeated alloc/free workload I'm testing above vs. the tar/cp use case > where there are many allocs and few unlinks) are probably the most > straightforward to handle and don't require major search algorithm > changes. It's the middle ground (i.e. a large number of free inodes > with half or whatever more sitting in the radix tree) that I think > requires some more thought and I don't quite have an answer for atm. I > don't want to go off allocating new inode chunks too aggressively, but > also don't want to turn the finobt allocation algorithm into something > like the historical inobt search algorithm with poor worst case > behavior. > > > Then we get another 64 inodes starting at the newino cursor we can > > allocate from while we wait for the current RCU grace period to > > expire for inodes already in the reclaimable state. An algorithm > > like this will allow the free inode pool to resize automatically > > based on the unlink frequency of the workload and RCU grace period > > latency... > > > > > IOW, background reclaim only kicks in after 30s by default, > > > > 5 seconds, by default, not 30s. > > > > xfs_reclaim_work_queue() keys off xfs_syncd_centisecs, which corresponds > to xfs_params.syncd_timer, which is initialized as: > > .syncd_timer = { 1*100, 30*100, 7200*100}, > > Am I missing something? static void xfs_reclaim_work_queue( struct xfs_mount *mp) { rcu_read_lock(); if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_RECLAIM_TAG)) { queue_delayed_work(mp->m_reclaim_workqueue, &mp->m_reclaim_work, msecs_to_jiffies(xfs_syncd_centisecs / 6 * 10)); } rcu_read_unlock(); } .... > > > > Really, though, we need the inactivation stuff to be done as part of > > > > the VFS inode lifecycle. I have some ideas on what to do here, but I > > > > suspect we'll need some changes to iput_final()/evict() to allow us > > > > to process final unlinks in the bakground and then call evict() > > > > ourselves when the unlink completes. That way ->destroy_inode() can > > > > just call xfs_reclaim_inode() to free it directly, which also helps > > > > us get rid of background inode freeing and hence inode recycling > > > > from XFS altogether. I think we _might_ be able to do this without > > > > needing to change any of the logging code in XFS, but I haven't > > > > looked any further than this into it as yet. > > > > > > > > > > ... of whatever this ends up looking like. > > > > > > Can you elaborate on what you mean by processing unlinks in the > > > background? I can see the value of being able to eliminate the recycle > > > code in XFS, but wouldn't we still have to limit and throttle against > > > background work to maintain sustained removal performance? > > > > Yes, but that's irrelevant because all we would be doing is slightly > > changing where that throttling occurs (i.e. in > > iput_final->drop_inode instead of iput_final->evict->destroy_inode). > > > > However, moving the throttling up the stack is a good thing because > > it gets rid of the current problem with the inactivation throttling > > blocking the shrinker via shrinker->super_cache_scan-> > > prune_icache_sb->dispose_list->evict-> destroy_inode->throttle on > > full inactivation queue because all the inodes need EOF block > > trimming to be done. > > > > What I'm trying to understand is whether inodes will have cycled through > the requisite grace period before ->destroy_inode() or not, and if so, The whole point of moving stuff up in the VFS is that inodes don't get recycled by XFS at all so we don't even have to think about RCU grace periods anywhere inside XFS. > how that is done to avoid the sustained removal performance problem > we've run into here (caused by the extra latency leading to increasing > cacheline misses)..? The background work is done _before_ evict() is called by the VFS to get the inode freed via RCU callbacks. The perf constraints are unchanged, we just change the layer at which the background work is performance. > > > IOW, what's > > > the general teardown behavior you're getting at here, aside from what > > > parts push into the vfs or not? > > > > ->drop_inode() triggers background inactivation for both blockgc and > > inode unlink. For unlink, we set I_WILL_FREE so the VFS will not > > attempt to re-use it, add the inode # to the internal AG "busy > > inode" tree and return drop = true and the VFS then stops processing > > that inode. For blockgc, we queue the work and return drop = false > > and the VFS puts it onto the LRU. Now we have asynchronous > > inactivation while the inode is still present and visible at the VFS > > level. > > > > For background blockgc - that now happens while the inode is idle on > > the LRU before it gets reclaimed by the shrinker. i.e. we trigger > > block gc when the last reference to the inode goes away instead of > > when it gets removed from memory by the shrinker. > > > > For unlink, that now runs in the bacgrkoud until the inode unlink > > has been journalled and the cleared inode written to the backing > > inode cluster buffer. The inode is then no longer visisble to the > > journal and it can't be reallocated because it is still busy. We > > then change the inode state from I_WILL_FREE to I_FREEING and call > > evict(). The inode then gets torn down, and in ->destroy_inode we > > remove the inode from the radix tree, clear the per-ag busy record > > and free the inode via RCU as expected by the VFS. > > > > Ok, so this sort of sounds like these are separate things. I'm all for > creating more flexibility with the VFS to allow XFS to remove or > simplify codepaths, but this still depends on some form of grace period > tracking to avoid allocation of inodes that are free in the btrees but > still might have in-core struct inode's laying around, yes? > The reason I'm asking about this is because as this patch to avoid > recycling non-expired inodes becomes more complex in order to satisfy > performance requirements, longer term usefulness becomes more relevant. You say this like I haven't already thought about this.... > I don't want us to come up with some complex scheme to avoid RCU stalls > when there's already a plan to rip it out and replace it in a year or > so. OTOH if the resulting logic is part of that longer term strategy, > then this is less of a concern. .... and so maybe you haven't realised why I keep suggesting something along the lines of a busy inode mechanism similar to busy extent tracking? Essentially, we can't reallocate the inode until the previous use has been retired. Which means we'd create the busy inode record in xfs_inactive() before we free the inode and xfs_reclaim_inode() would remove the inode from the busy tree when it reclaims the inode and removes it from the radix tree after marking it dead for RCU lookup purposes. That would prevent reallocation of the inode until we can allocate a new in-core inode structure for the inode. In the lifted VFS case I describe, ->drop_inode() would result in background inactivation inserting the inode into the busy tree. Once that is all done and we call evict() on the inode, ->destroy_inode calls xfs-reclaim_inode() directly. IOWs, the busy inode mechanism works for both existing and future inactivation mechanisms. Now, lets take a step further back from this, and consider the current inode cache implementation. The fast and dirty method for tracking busy inodes is to use the fact that a busy inode is defined as being in the finobt whilst the in-core inode is in an IRECLAIMABLE state. Hence, at least initially, we don't need a separate tree to determine if an inode is "busy" efficiently. The allocation policy that selects the inode to allocate doesn't care what mechanism we use to determine if an inode is busy - it's just concerned with finding a non-busy inode efficiently. Hence we can use a simple "best, first, last" hueristic to determine if the finobt is likely to be largely made up of busy inodes and decide to allocate new inode chunks instead of searching the finobt for an unbusy inode. IOWs, the "busy extent tracking" implementation will need to change to be something more explicit as we move inactivation up in the VFS because the IRCELAIMABLE state goes away, but that doesn't change the allocation algorithm or heuristics that are based on detecting busy inodes at allocation time. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-28 23:53 ` Dave Chinner @ 2022-01-31 13:28 ` Brian Foster 0 siblings, 0 replies; 36+ messages in thread From: Brian Foster @ 2022-01-31 13:28 UTC (permalink / raw) To: Dave Chinner; +Cc: Al Viro, linux-xfs, Ian Kent, rcu On Sat, Jan 29, 2022 at 10:53:13AM +1100, Dave Chinner wrote: > On Fri, Jan 28, 2022 at 09:11:07AM -0500, Brian Foster wrote: > > On Fri, Jan 28, 2022 at 09:18:17AM +1100, Dave Chinner wrote: > > > On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote: > > > > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote: > > > > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote: > > > > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote: > > > > > > > > > > > > > Right, background inactivation does not improve performance - it's > > > > > > > necessary to get the transactions out of the evict() path. All we > > > > > > > wanted was to ensure that there were no performance degradations as > > > > > > > a result of background inactivation, not that it was faster. > > > > > > > > > > > > > > If you want to confirm that there is an increase in cold cache > > > > > > > access when the batch size is increased, cpu profiles with 'perf > > > > > > > top'/'perf record/report' and CPU cache performance metric reporting > > > > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread > > > > > > > where I mention those things to Paul. > > > > > > > > > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid? > > > > > > I'm not asking for that to happen this cycle and for backports Ian's > > > > > > patch is obviously fine. > > > > > > > > > > Yes, but not in the near term. > > > > > > > > > > > What I really want to avoid is the situation when we are stuck with > > > > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing > > > > > > reused inodes would hurt XFS too badly. And the benchmarks in this > > > > > > thread do look like that. > > > > > > > > > > The simplest way I think is to have the XFS inode allocation track > > > > > "busy inodes" in the same way we track "busy extents". A busy extent > > > > > is an extent that has been freed by the user, but is not yet marked > > > > > free in the journal/on disk. If we try to reallocate that busy > > > > > extent, we either select a different free extent to allocate, or if > > > > > we can't find any we force the journal to disk, wait for it to > > > > > complete (hence unbusying the extents) and retry the allocation > > > > > again. > > > > > > > > > > We can do something similar for inode allocation - it's actually a > > > > > lockless tag lookup on the radix tree entry for the candidate inode > > > > > number. If we find the reclaimable radix tree tag set, the we select > > > > > a different inode. If we can't allocate a new inode, then we kick > > > > > synchronize_rcu() and retry the allocation, allowing inodes to be > > > > > recycled this time. > > > > > > > > > > > > > I'm starting to poke around this area since it's become clear that the > > > > currently proposed scheme just involves too much latency (unless Paul > > > > chimes in with his expedited grace period variant, at which point I will > > > > revisit) in the fast allocation/recycle path. ISTM so far that a simple > > > > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will > > > > have pretty much the same pattern of behavior as this patch: one > > > > synchronize_rcu() per batch. > > > > > > That's not really what I proposed - what I suggested was that if we > > > can't allocate a usable inode from the finobt, and we can't allocate > > > a new inode cluster from the AG (i.e. populate the finobt with more > > > inodes), only then call synchronise_rcu() and recycle an inode. > > > > > > > That's not how I read it... Regardless, that was my suggestion as well, > > so we're on the same page on that front. > > > > > We don't need to scan the inode cache or the finobt to determine if > > > there are reclaimable inodes immediately available - do a gang tag > > > lookup on the radix tree for newino. > > > If it comes back with an inode number that is not > > > equal to the node number we looked up, then we can allocate an > > > newino immediately. > > > > > > If it comes back with newino, then check the first inode in the > > > finobt. If that comes back with an inode that is not the first inode > > > in the finobt, we can immediately allocate the first inode in the > > > finobt. If not, check the last inode. if that fails, assume all > > > inodes in the finobt need recycling and allocate a new cluster, > > > pointing newino at it. > > > > > > > Hrm, I'll have to think about this some more. I don't mind something > > like this as a possible scanning allocation algorithm, but I don't love > > the idea of doing a few predictable btree/radix tree lookups and > > inferring broader AG state from that, particularly when I think it's > > possible to get more accurate information in a way that's easier and > > probably more efficient. > > > > For example, we already have counts of the number of reclaimable and > > free inodes in the perag. We could fairly easily add a counter to track > > the subset of reclaimable inodes that are unlinked. With something like > > that, it's easier to make higher level decisions like when to just > > allocate a new inode chunk (because the free inode pool consists mostly > > of reclaimable inodes) or just scanning through the finobt for a good > > candidate (because there are none or very few unlinked reclaimable > > inodes relative to the number of free inodes in the btree). > > > > So in general I think the two obvious ends of the spectrum (i.e. the > > repeated alloc/free workload I'm testing above vs. the tar/cp use case > > where there are many allocs and few unlinks) are probably the most > > straightforward to handle and don't require major search algorithm > > changes. It's the middle ground (i.e. a large number of free inodes > > with half or whatever more sitting in the radix tree) that I think > > requires some more thought and I don't quite have an answer for atm. I > > don't want to go off allocating new inode chunks too aggressively, but > > also don't want to turn the finobt allocation algorithm into something > > like the historical inobt search algorithm with poor worst case > > behavior. > > > > > Then we get another 64 inodes starting at the newino cursor we can > > > allocate from while we wait for the current RCU grace period to > > > expire for inodes already in the reclaimable state. An algorithm > > > like this will allow the free inode pool to resize automatically > > > based on the unlink frequency of the workload and RCU grace period > > > latency... > > > > > > > IOW, background reclaim only kicks in after 30s by default, > > > > > > 5 seconds, by default, not 30s. > > > > > > > xfs_reclaim_work_queue() keys off xfs_syncd_centisecs, which corresponds > > to xfs_params.syncd_timer, which is initialized as: > > > > .syncd_timer = { 1*100, 30*100, 7200*100}, > > > > Am I missing something? > > static void > xfs_reclaim_work_queue( > struct xfs_mount *mp) > { > > rcu_read_lock(); > if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_RECLAIM_TAG)) { > queue_delayed_work(mp->m_reclaim_workqueue, &mp->m_reclaim_work, > msecs_to_jiffies(xfs_syncd_centisecs / 6 * 10)); > } > rcu_read_unlock(); > } > Ah, thanks. > .... > > > > > > Really, though, we need the inactivation stuff to be done as part of > > > > > the VFS inode lifecycle. I have some ideas on what to do here, but I > > > > > suspect we'll need some changes to iput_final()/evict() to allow us > > > > > to process final unlinks in the bakground and then call evict() > > > > > ourselves when the unlink completes. That way ->destroy_inode() can > > > > > just call xfs_reclaim_inode() to free it directly, which also helps > > > > > us get rid of background inode freeing and hence inode recycling > > > > > from XFS altogether. I think we _might_ be able to do this without > > > > > needing to change any of the logging code in XFS, but I haven't > > > > > looked any further than this into it as yet. > > > > > > > > > > > > > ... of whatever this ends up looking like. > > > > > > > > Can you elaborate on what you mean by processing unlinks in the > > > > background? I can see the value of being able to eliminate the recycle > > > > code in XFS, but wouldn't we still have to limit and throttle against > > > > background work to maintain sustained removal performance? > > > > > > Yes, but that's irrelevant because all we would be doing is slightly > > > changing where that throttling occurs (i.e. in > > > iput_final->drop_inode instead of iput_final->evict->destroy_inode). > > > > > > However, moving the throttling up the stack is a good thing because > > > it gets rid of the current problem with the inactivation throttling > > > blocking the shrinker via shrinker->super_cache_scan-> > > > prune_icache_sb->dispose_list->evict-> destroy_inode->throttle on > > > full inactivation queue because all the inodes need EOF block > > > trimming to be done. > > > > > > > What I'm trying to understand is whether inodes will have cycled through > > the requisite grace period before ->destroy_inode() or not, and if so, > > The whole point of moving stuff up in the VFS is that inodes > don't get recycled by XFS at all so we don't even have to think > about RCU grace periods anywhere inside XFS. > > > how that is done to avoid the sustained removal performance problem > > we've run into here (caused by the extra latency leading to increasing > > cacheline misses)..? > > The background work is done _before_ evict() is called by the VFS to > get the inode freed via RCU callbacks. The perf constraints are > unchanged, we just change the layer at which the background work is > performance. > Ok. > > > > IOW, what's > > > > the general teardown behavior you're getting at here, aside from what > > > > parts push into the vfs or not? > > > > > > ->drop_inode() triggers background inactivation for both blockgc and > > > inode unlink. For unlink, we set I_WILL_FREE so the VFS will not > > > attempt to re-use it, add the inode # to the internal AG "busy > > > inode" tree and return drop = true and the VFS then stops processing > > > that inode. For blockgc, we queue the work and return drop = false > > > and the VFS puts it onto the LRU. Now we have asynchronous > > > inactivation while the inode is still present and visible at the VFS > > > level. > > > > > > For background blockgc - that now happens while the inode is idle on > > > the LRU before it gets reclaimed by the shrinker. i.e. we trigger > > > block gc when the last reference to the inode goes away instead of > > > when it gets removed from memory by the shrinker. > > > > > > For unlink, that now runs in the bacgrkoud until the inode unlink > > > has been journalled and the cleared inode written to the backing > > > inode cluster buffer. The inode is then no longer visisble to the > > > journal and it can't be reallocated because it is still busy. We > > > then change the inode state from I_WILL_FREE to I_FREEING and call > > > evict(). The inode then gets torn down, and in ->destroy_inode we > > > remove the inode from the radix tree, clear the per-ag busy record > > > and free the inode via RCU as expected by the VFS. > > > > > > > Ok, so this sort of sounds like these are separate things. I'm all for > > creating more flexibility with the VFS to allow XFS to remove or > > simplify codepaths, but this still depends on some form of grace period > > tracking to avoid allocation of inodes that are free in the btrees but > > still might have in-core struct inode's laying around, yes? > > > The reason I'm asking about this is because as this patch to avoid > > recycling non-expired inodes becomes more complex in order to satisfy > > performance requirements, longer term usefulness becomes more relevant. > > You say this like I haven't already thought about this.... > > > I don't want us to come up with some complex scheme to avoid RCU stalls > > when there's already a plan to rip it out and replace it in a year or > > so. OTOH if the resulting logic is part of that longer term strategy, > > then this is less of a concern. > > .... and so maybe you haven't realised why I keep suggesting > something along the lines of a busy inode mechanism similar to busy > extent tracking? > > Essentially, we can't reallocate the inode until the previous use > has been retired. Which means we'd create the busy inode record in > xfs_inactive() before we free the inode and xfs_reclaim_inode() > would remove the inode from the busy tree when it reclaims the inode > and removes it from the radix tree after marking it dead for RCU > lookup purposes. That would prevent reallocation of the inode until > we can allocate a new in-core inode structure for the inode. > > In the lifted VFS case I describe, ->drop_inode() would result in > background inactivation inserting the inode into the busy tree. Once > that is all done and we call evict() on the inode, ->destroy_inode > calls xfs-reclaim_inode() directly. IOWs, the busy inode mechanism > works for both existing and future inactivation mechanisms. > This is what I was trying to understand. The discussion to this point around eventually moving lifecycle bits into the VFS gave the impression that the grace period sequence would essentially be hidden from XFS, so that's why I've been asking how we expect to accomplish that. ISTM that's not necessarily the case... the notion of a free (on disk) inode that cannot be used due to a pending grace period still exists, it's just abstracted as a "busy inode" and used to implement a rule that such inodes cannot be reallocated until the VFS indicates so. At that point we reclaim the struct inode so this presumably eliminates the need for the recycling logic and perhaps various other lifecycle related bits (that I've not thought through) in XFS, providing further simplification opportunities, etc. If I'm following the general idea correctly, this makes more sense to me. Thanks. Brian > Now, lets take a step further back from this, and consider the > current inode cache implementation. The fast and dirty method for > tracking busy inodes is to use the fact that a busy inode is defined > as being in the finobt whilst the in-core inode is in an > IRECLAIMABLE state. > > Hence, at least initially, we don't need a separate tree to > determine if an inode is "busy" efficiently. The allocation policy > that selects the inode to allocate doesn't care what mechanism we > use to determine if an inode is busy - it's just concerned with > finding a non-busy inode efficiently. Hence we can use a simple > "best, first, last" hueristic to determine if the finobt is likely > to be largely made up of busy inodes and decide to allocate new > inode chunks instead of searching the finobt for an unbusy inode. > > IOWs, the "busy extent tracking" implementation will need to change > to be something more explicit as we move inactivation up in the VFS > because the IRCELAIMABLE state goes away, but that doesn't change > the allocation algorithm or heuristics that are based on detecting > busy inodes at allocation time. > > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-27 19:01 ` Brian Foster 2022-01-27 22:18 ` Dave Chinner @ 2022-01-28 21:39 ` Paul E. McKenney 2022-01-31 13:22 ` Brian Foster 1 sibling, 1 reply; 36+ messages in thread From: Paul E. McKenney @ 2022-01-28 21:39 UTC (permalink / raw) To: Brian Foster; +Cc: Dave Chinner, Al Viro, linux-xfs, Ian Kent, rcu On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote: > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote: > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote: > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote: > > > > > > > Right, background inactivation does not improve performance - it's > > > > necessary to get the transactions out of the evict() path. All we > > > > wanted was to ensure that there were no performance degradations as > > > > a result of background inactivation, not that it was faster. > > > > > > > > If you want to confirm that there is an increase in cold cache > > > > access when the batch size is increased, cpu profiles with 'perf > > > > top'/'perf record/report' and CPU cache performance metric reporting > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread > > > > where I mention those things to Paul. > > > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid? > > > I'm not asking for that to happen this cycle and for backports Ian's > > > patch is obviously fine. > > > > Yes, but not in the near term. > > > > > What I really want to avoid is the situation when we are stuck with > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing > > > reused inodes would hurt XFS too badly. And the benchmarks in this > > > thread do look like that. > > > > The simplest way I think is to have the XFS inode allocation track > > "busy inodes" in the same way we track "busy extents". A busy extent > > is an extent that has been freed by the user, but is not yet marked > > free in the journal/on disk. If we try to reallocate that busy > > extent, we either select a different free extent to allocate, or if > > we can't find any we force the journal to disk, wait for it to > > complete (hence unbusying the extents) and retry the allocation > > again. > > > > We can do something similar for inode allocation - it's actually a > > lockless tag lookup on the radix tree entry for the candidate inode > > number. If we find the reclaimable radix tree tag set, the we select > > a different inode. If we can't allocate a new inode, then we kick > > synchronize_rcu() and retry the allocation, allowing inodes to be > > recycled this time. > > I'm starting to poke around this area since it's become clear that the > currently proposed scheme just involves too much latency (unless Paul > chimes in with his expedited grace period variant, at which point I will > revisit) in the fast allocation/recycle path. ISTM so far that a simple > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will > have pretty much the same pattern of behavior as this patch: one > synchronize_rcu() per batch. Apologies for being slow, but there have been some distractions. One of the distractions was trying to put together atheoretically attractive but massively overcomplicated implementation of poll_state_synchronize_rcu_expedited(). It currently looks like a somewhat suboptimal but much simpler approach is available. This assumes that XFS is not in the picture until after both the scheduler and workqueues are operational. And yes, the complicated version might prove necessary, but let's see if this whole thing is even useful first. ;-) In the meantime, if you want to look at an extremely unbaked view, here you go: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing Thanx, Paul > IOW, background reclaim only kicks in after 30s by default, so the pool > of free inodes pretty much always consists of 100% reclaimable inodes. > On top of that, at smaller batch sizes, the pool tends to have a uniform > (!elapsed) grace period cookie, so a stall is required to be able to > allocate any of them. As the batch size increases, I do see the > population of free inodes start to contain a mix of expired and > non-expired grace period cookies. It's fairly easy to hack up an > internal icwalk scan to locate already expired inodes, but the problem > is that the recycle rate is so much faster than the grace period latency > that it doesn't really matter. We'll still have to stall by the time we > get to the non-expired inodes, and so we're back to one stall per batch > and the same general performance characteristic of this patch. > > So given all of this, I'm wondering about something like the following > high level inode allocation algorithm: > > 1. If the AG has any reclaimable inodes, scan for one with an expired > grace period. If found, target that inode for physical allocation. > > 2. If the AG free inode count == the AG reclaimable count and we know > all reclaimable inodes are most likely pending a grace period (because > the previous step failed), allocate a new inode chunk (and target it in > this allocation). > > 3. If the AG free inode count > the reclaimable count, scan the finobt > for an inode that is not present in the radix tree (i.e. Dave's logic > above). > > Each of those steps could involve some heuristics to maintain > predictable behavior and avoid large scans and such, but the general > idea is that the repeated alloc/free inode workload naturally populates > the AG with enough physical inodes to always be able to satisfy an > allocation without waiting on a grace period. IOW, this is effectively > similar behavior to if physical inode freeing was delayed to an rcu > callback, with the tradeoff of complicating the allocation path rather > than stalling in the inactivation pipeline. Thoughts? > > This of course is more involved than this patch (or similarly simple > variants of RCU delaying preexisting bits of code) and requires some > more investigation, but certainly shouldn't be a multi-year thing. The > question is probably more of whether it's enough complexity to justify > in the meantime... > > > > Are there any realistic prospects of having xfs_iget() deal with > > > reuse case by allocating new in-core inode and flipping whatever > > > references you've got in XFS journalling data structures to the > > > new copy? If I understood what you said on IRC correctly, that is... > > > > That's ... much harder. > > > > One of the problems is that once an inode has a log item attached to > > it, it assumes that it can be accessed without specific locking, > > etc. see xfs_inode_clean(), for example. So there's some life-cycle > > stuff that needs to be taken care of in XFS first, and the inode <-> > > log item relationship is tangled. > > > > I've been working towards removing that tangle - but taht stuff is > > quite a distance down my logging rework patch queue. THat queue has > > been stuck now for a year trying to get the first handful of rework > > and scalability modifications reviewed and merged, so I'm not > > holding my breathe as to how long a more substantial rework of > > internal logging code will take to review and merge. > > > > Really, though, we need the inactivation stuff to be done as part of > > the VFS inode lifecycle. I have some ideas on what to do here, but I > > suspect we'll need some changes to iput_final()/evict() to allow us > > to process final unlinks in the bakground and then call evict() > > ourselves when the unlink completes. That way ->destroy_inode() can > > just call xfs_reclaim_inode() to free it directly, which also helps > > us get rid of background inode freeing and hence inode recycling > > from XFS altogether. I think we _might_ be able to do this without > > needing to change any of the logging code in XFS, but I haven't > > looked any further than this into it as yet. > > > > ... of whatever this ends up looking like. > > Can you elaborate on what you mean by processing unlinks in the > background? I can see the value of being able to eliminate the recycle > code in XFS, but wouldn't we still have to limit and throttle against > background work to maintain sustained removal performance? IOW, what's > the general teardown behavior you're getting at here, aside from what > parts push into the vfs or not? > > Brian > > > > Again, I'm not asking if it can be done this cycle; having a > > > realistic path to doing that eventually would be fine by me. > > > > We're talking a year at least, probably two, before we get there... > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-28 21:39 ` Paul E. McKenney @ 2022-01-31 13:22 ` Brian Foster 2022-02-01 22:00 ` Paul E. McKenney 0 siblings, 1 reply; 36+ messages in thread From: Brian Foster @ 2022-01-31 13:22 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Dave Chinner, Al Viro, linux-xfs, Ian Kent, rcu On Fri, Jan 28, 2022 at 01:39:11PM -0800, Paul E. McKenney wrote: > On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote: > > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote: > > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote: > > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote: > > > > > > > > > Right, background inactivation does not improve performance - it's > > > > > necessary to get the transactions out of the evict() path. All we > > > > > wanted was to ensure that there were no performance degradations as > > > > > a result of background inactivation, not that it was faster. > > > > > > > > > > If you want to confirm that there is an increase in cold cache > > > > > access when the batch size is increased, cpu profiles with 'perf > > > > > top'/'perf record/report' and CPU cache performance metric reporting > > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread > > > > > where I mention those things to Paul. > > > > > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid? > > > > I'm not asking for that to happen this cycle and for backports Ian's > > > > patch is obviously fine. > > > > > > Yes, but not in the near term. > > > > > > > What I really want to avoid is the situation when we are stuck with > > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing > > > > reused inodes would hurt XFS too badly. And the benchmarks in this > > > > thread do look like that. > > > > > > The simplest way I think is to have the XFS inode allocation track > > > "busy inodes" in the same way we track "busy extents". A busy extent > > > is an extent that has been freed by the user, but is not yet marked > > > free in the journal/on disk. If we try to reallocate that busy > > > extent, we either select a different free extent to allocate, or if > > > we can't find any we force the journal to disk, wait for it to > > > complete (hence unbusying the extents) and retry the allocation > > > again. > > > > > > We can do something similar for inode allocation - it's actually a > > > lockless tag lookup on the radix tree entry for the candidate inode > > > number. If we find the reclaimable radix tree tag set, the we select > > > a different inode. If we can't allocate a new inode, then we kick > > > synchronize_rcu() and retry the allocation, allowing inodes to be > > > recycled this time. > > > > I'm starting to poke around this area since it's become clear that the > > currently proposed scheme just involves too much latency (unless Paul > > chimes in with his expedited grace period variant, at which point I will > > revisit) in the fast allocation/recycle path. ISTM so far that a simple > > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will > > have pretty much the same pattern of behavior as this patch: one > > synchronize_rcu() per batch. > > Apologies for being slow, but there have been some distractions. > One of the distractions was trying to put together atheoretically > attractive but massively overcomplicated implementation of > poll_state_synchronize_rcu_expedited(). It currently looks like a > somewhat suboptimal but much simpler approach is available. This > assumes that XFS is not in the picture until after both the scheduler > and workqueues are operational. > No worries.. I don't think that would be a roadblock for us. ;) > And yes, the complicated version might prove necessary, but let's > see if this whole thing is even useful first. ;-) > Indeed. This patch only really requires a single poll/sync pair of calls, so assuming the expedited grace period usage plays nice enough with typical !expedited usage elsewhere in the kernel for some basic tests, it would be fairly trivial to port this over and at least get an idea of what the worst case behavior might be with expedited grace periods, whether it satisfies the existing latency requirements, etc. Brian > In the meantime, if you want to look at an extremely unbaked view, > here you go: > > https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing > > Thanx, Paul > > > IOW, background reclaim only kicks in after 30s by default, so the pool > > of free inodes pretty much always consists of 100% reclaimable inodes. > > On top of that, at smaller batch sizes, the pool tends to have a uniform > > (!elapsed) grace period cookie, so a stall is required to be able to > > allocate any of them. As the batch size increases, I do see the > > population of free inodes start to contain a mix of expired and > > non-expired grace period cookies. It's fairly easy to hack up an > > internal icwalk scan to locate already expired inodes, but the problem > > is that the recycle rate is so much faster than the grace period latency > > that it doesn't really matter. We'll still have to stall by the time we > > get to the non-expired inodes, and so we're back to one stall per batch > > and the same general performance characteristic of this patch. > > > > So given all of this, I'm wondering about something like the following > > high level inode allocation algorithm: > > > > 1. If the AG has any reclaimable inodes, scan for one with an expired > > grace period. If found, target that inode for physical allocation. > > > > 2. If the AG free inode count == the AG reclaimable count and we know > > all reclaimable inodes are most likely pending a grace period (because > > the previous step failed), allocate a new inode chunk (and target it in > > this allocation). > > > > 3. If the AG free inode count > the reclaimable count, scan the finobt > > for an inode that is not present in the radix tree (i.e. Dave's logic > > above). > > > > Each of those steps could involve some heuristics to maintain > > predictable behavior and avoid large scans and such, but the general > > idea is that the repeated alloc/free inode workload naturally populates > > the AG with enough physical inodes to always be able to satisfy an > > allocation without waiting on a grace period. IOW, this is effectively > > similar behavior to if physical inode freeing was delayed to an rcu > > callback, with the tradeoff of complicating the allocation path rather > > than stalling in the inactivation pipeline. Thoughts? > > > > This of course is more involved than this patch (or similarly simple > > variants of RCU delaying preexisting bits of code) and requires some > > more investigation, but certainly shouldn't be a multi-year thing. The > > question is probably more of whether it's enough complexity to justify > > in the meantime... > > > > > > Are there any realistic prospects of having xfs_iget() deal with > > > > reuse case by allocating new in-core inode and flipping whatever > > > > references you've got in XFS journalling data structures to the > > > > new copy? If I understood what you said on IRC correctly, that is... > > > > > > That's ... much harder. > > > > > > One of the problems is that once an inode has a log item attached to > > > it, it assumes that it can be accessed without specific locking, > > > etc. see xfs_inode_clean(), for example. So there's some life-cycle > > > stuff that needs to be taken care of in XFS first, and the inode <-> > > > log item relationship is tangled. > > > > > > I've been working towards removing that tangle - but taht stuff is > > > quite a distance down my logging rework patch queue. THat queue has > > > been stuck now for a year trying to get the first handful of rework > > > and scalability modifications reviewed and merged, so I'm not > > > holding my breathe as to how long a more substantial rework of > > > internal logging code will take to review and merge. > > > > > > Really, though, we need the inactivation stuff to be done as part of > > > the VFS inode lifecycle. I have some ideas on what to do here, but I > > > suspect we'll need some changes to iput_final()/evict() to allow us > > > to process final unlinks in the bakground and then call evict() > > > ourselves when the unlink completes. That way ->destroy_inode() can > > > just call xfs_reclaim_inode() to free it directly, which also helps > > > us get rid of background inode freeing and hence inode recycling > > > from XFS altogether. I think we _might_ be able to do this without > > > needing to change any of the logging code in XFS, but I haven't > > > looked any further than this into it as yet. > > > > > > > ... of whatever this ends up looking like. > > > > Can you elaborate on what you mean by processing unlinks in the > > background? I can see the value of being able to eliminate the recycle > > code in XFS, but wouldn't we still have to limit and throttle against > > background work to maintain sustained removal performance? IOW, what's > > the general teardown behavior you're getting at here, aside from what > > parts push into the vfs or not? > > > > Brian > > > > > > Again, I'm not asking if it can be done this cycle; having a > > > > realistic path to doing that eventually would be fine by me. > > > > > > We're talking a year at least, probably two, before we get there... > > > > > > Cheers, > > > > > > Dave. > > > -- > > > Dave Chinner > > > david@fromorbit.com > > > > > > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-01-31 13:22 ` Brian Foster @ 2022-02-01 22:00 ` Paul E. McKenney 2022-02-03 18:49 ` Paul E. McKenney 2022-02-07 13:30 ` Brian Foster 0 siblings, 2 replies; 36+ messages in thread From: Paul E. McKenney @ 2022-02-01 22:00 UTC (permalink / raw) To: Brian Foster; +Cc: Dave Chinner, Al Viro, linux-xfs, Ian Kent, rcu On Mon, Jan 31, 2022 at 08:22:43AM -0500, Brian Foster wrote: > On Fri, Jan 28, 2022 at 01:39:11PM -0800, Paul E. McKenney wrote: > > On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote: > > > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote: > > > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote: > > > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote: > > > > > > > > > > > Right, background inactivation does not improve performance - it's > > > > > > necessary to get the transactions out of the evict() path. All we > > > > > > wanted was to ensure that there were no performance degradations as > > > > > > a result of background inactivation, not that it was faster. > > > > > > > > > > > > If you want to confirm that there is an increase in cold cache > > > > > > access when the batch size is increased, cpu profiles with 'perf > > > > > > top'/'perf record/report' and CPU cache performance metric reporting > > > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread > > > > > > where I mention those things to Paul. > > > > > > > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid? > > > > > I'm not asking for that to happen this cycle and for backports Ian's > > > > > patch is obviously fine. > > > > > > > > Yes, but not in the near term. > > > > > > > > > What I really want to avoid is the situation when we are stuck with > > > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing > > > > > reused inodes would hurt XFS too badly. And the benchmarks in this > > > > > thread do look like that. > > > > > > > > The simplest way I think is to have the XFS inode allocation track > > > > "busy inodes" in the same way we track "busy extents". A busy extent > > > > is an extent that has been freed by the user, but is not yet marked > > > > free in the journal/on disk. If we try to reallocate that busy > > > > extent, we either select a different free extent to allocate, or if > > > > we can't find any we force the journal to disk, wait for it to > > > > complete (hence unbusying the extents) and retry the allocation > > > > again. > > > > > > > > We can do something similar for inode allocation - it's actually a > > > > lockless tag lookup on the radix tree entry for the candidate inode > > > > number. If we find the reclaimable radix tree tag set, the we select > > > > a different inode. If we can't allocate a new inode, then we kick > > > > synchronize_rcu() and retry the allocation, allowing inodes to be > > > > recycled this time. > > > > > > I'm starting to poke around this area since it's become clear that the > > > currently proposed scheme just involves too much latency (unless Paul > > > chimes in with his expedited grace period variant, at which point I will > > > revisit) in the fast allocation/recycle path. ISTM so far that a simple > > > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will > > > have pretty much the same pattern of behavior as this patch: one > > > synchronize_rcu() per batch. > > > > Apologies for being slow, but there have been some distractions. > > One of the distractions was trying to put together atheoretically > > attractive but massively overcomplicated implementation of > > poll_state_synchronize_rcu_expedited(). It currently looks like a > > somewhat suboptimal but much simpler approach is available. This > > assumes that XFS is not in the picture until after both the scheduler > > and workqueues are operational. > > > > No worries.. I don't think that would be a roadblock for us. ;) > > > And yes, the complicated version might prove necessary, but let's > > see if this whole thing is even useful first. ;-) > > > > Indeed. This patch only really requires a single poll/sync pair of > calls, so assuming the expedited grace period usage plays nice enough > with typical !expedited usage elsewhere in the kernel for some basic > tests, it would be fairly trivial to port this over and at least get an > idea of what the worst case behavior might be with expedited grace > periods, whether it satisfies the existing latency requirements, etc. > > Brian > > > In the meantime, if you want to look at an extremely unbaked view, > > here you go: > > > > https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing And here is a version that passes moderate rcutorture testing. So no obvious bugs. Probably a few non-obvious ones, though! ;-) This commit is on -rcu's "dev" branch along with this rcutorture addition: cd7bd64af59f ("EXP rcutorture: Test polled expedited grace-period primitives") I will carry these in -rcu's "dev" branch until at least the upcoming merge window, fixing bugs as and when they becom apparent. If I don't hear otherwise by that time, I will create a tag for it and leave it behind. The backport to v5.17-rc2 just requires removing: mutex_init(&rnp->boost_kthread_mutex); From rcu_init_one(). This line is added by this -rcu commit: 02a50b09c31f ("rcu: Add mutex for rcu boost kthread spawning and affinity setting") Please let me know how it goes! Thanx, Paul ------------------------------------------------------------------------ commit dd896a86aebc5b225ceee13fcf1375c7542a5e2d Author: Paul E. McKenney <paulmck@kernel.org> Date: Mon Jan 31 16:55:52 2022 -0800 EXP rcu: Add polled expedited grace-period primitives This is an experimental proof of concept of polled expedited grace-period functions. These functions are get_state_synchronize_rcu_expedited(), start_poll_synchronize_rcu_expedited(), poll_state_synchronize_rcu_expedited(), and cond_synchronize_rcu_expedited(), which are similar to get_state_synchronize_rcu(), start_poll_synchronize_rcu(), poll_state_synchronize_rcu(), and cond_synchronize_rcu(), respectively. One limitation is that start_poll_synchronize_rcu_expedited() cannot be invoked before workqueues are initialized. Cc: Brian Foster <bfoster@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ian Kent <raven@themaw.net> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h index 858f4d429946d..ca139b4b2d25f 100644 --- a/include/linux/rcutiny.h +++ b/include/linux/rcutiny.h @@ -23,6 +23,26 @@ static inline void cond_synchronize_rcu(unsigned long oldstate) might_sleep(); } +static inline unsigned long get_state_synchronize_rcu_expedited(void) +{ + return get_state_synchronize_rcu(); +} + +static inline unsigned long start_poll_synchronize_rcu_expedited(void) +{ + return start_poll_synchronize_rcu(); +} + +static inline bool poll_state_synchronize_rcu_expedited(unsigned long oldstate) +{ + return poll_state_synchronize_rcu(oldstate); +} + +static inline void cond_synchronize_rcu_expedited(unsigned long oldstate) +{ + cond_synchronize_rcu(oldstate); +} + extern void rcu_barrier(void); static inline void synchronize_rcu_expedited(void) diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h index 76665db179fa1..eb774e9be21bf 100644 --- a/include/linux/rcutree.h +++ b/include/linux/rcutree.h @@ -40,6 +40,10 @@ bool rcu_eqs_special_set(int cpu); void rcu_momentary_dyntick_idle(void); void kfree_rcu_scheduler_running(void); bool rcu_gp_might_be_stalled(void); +unsigned long get_state_synchronize_rcu_expedited(void); +unsigned long start_poll_synchronize_rcu_expedited(void); +bool poll_state_synchronize_rcu_expedited(unsigned long oldstate); +void cond_synchronize_rcu_expedited(unsigned long oldstate); unsigned long get_state_synchronize_rcu(void); unsigned long start_poll_synchronize_rcu(void); bool poll_state_synchronize_rcu(unsigned long oldstate); diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 24b5f2c2de87b..5b61cf20c91e9 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -23,6 +23,13 @@ #define RCU_SEQ_CTR_SHIFT 2 #define RCU_SEQ_STATE_MASK ((1 << RCU_SEQ_CTR_SHIFT) - 1) +/* + * Low-order bit definitions for polled grace-period APIs. + */ +#define RCU_GET_STATE_FROM_EXPEDITED 0x1 +#define RCU_GET_STATE_USE_NORMAL 0x2 +#define RCU_GET_STATE_BAD_FOR_NORMAL (RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL) + /* * Return the counter portion of a sequence number previously returned * by rcu_seq_snap() or rcu_seq_current(). diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index e6ad532cffe78..5de36abcd7da1 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3871,7 +3871,8 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu); */ bool poll_state_synchronize_rcu(unsigned long oldstate) { - if (rcu_seq_done(&rcu_state.gp_seq, oldstate)) { + if (rcu_seq_done(&rcu_state.gp_seq, oldstate) && + !WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL)) { smp_mb(); /* Ensure GP ends before subsequent accesses. */ return true; } @@ -3900,7 +3901,8 @@ EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu); */ void cond_synchronize_rcu(unsigned long oldstate) { - if (!poll_state_synchronize_rcu(oldstate)) + if (!poll_state_synchronize_rcu(oldstate) && + !WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL)) synchronize_rcu(); } EXPORT_SYMBOL_GPL(cond_synchronize_rcu); @@ -4593,6 +4595,9 @@ static void __init rcu_init_one(void) init_waitqueue_head(&rnp->exp_wq[3]); spin_lock_init(&rnp->exp_lock); mutex_init(&rnp->boost_kthread_mutex); + raw_spin_lock_init(&rnp->exp_poll_lock); + rnp->exp_seq_poll_rq = 0x1; + INIT_WORK(&rnp->exp_poll_wq, sync_rcu_do_polled_gp); } } diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 926673ebe355f..19fc9acce3ce2 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -128,6 +128,10 @@ struct rcu_node { wait_queue_head_t exp_wq[4]; struct rcu_exp_work rew; bool exp_need_flush; /* Need to flush workitem? */ + raw_spinlock_t exp_poll_lock; + /* Lock and data for polled expedited grace periods. */ + unsigned long exp_seq_poll_rq; + struct work_struct exp_poll_wq; } ____cacheline_internodealigned_in_smp; /* @@ -476,3 +480,6 @@ static void rcu_iw_handler(struct irq_work *iwp); static void check_cpu_stall(struct rcu_data *rdp); static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp, const unsigned long gpssdelay); + +/* Forward declarations for tree_exp.h. */ +static void sync_rcu_do_polled_gp(struct work_struct *wp); diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 1a45667402260..728896f374fee 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -871,3 +871,154 @@ void synchronize_rcu_expedited(void) destroy_work_on_stack(&rew.rew_work); } EXPORT_SYMBOL_GPL(synchronize_rcu_expedited); + +/** + * get_state_synchronize_rcu_expedited - Snapshot current expedited RCU state + * + * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited() + * or poll_state_synchronize_rcu_expedited(), allowing them to determine + * whether or not a full expedited grace period has elapsed in the meantime. + */ +unsigned long get_state_synchronize_rcu_expedited(void) +{ + if (rcu_gp_is_normal()) + return get_state_synchronize_rcu() | + RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL; + + // Any prior manipulation of RCU-protected data must happen + // before the load from ->expedited_sequence. + smp_mb(); /* ^^^ */ + return rcu_exp_gp_seq_snap() | RCU_GET_STATE_FROM_EXPEDITED; +} +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu_expedited); + +/* + * Ensure that start_poll_synchronize_rcu_expedited() has the expedited + * RCU grace periods that it needs. + */ +static void sync_rcu_do_polled_gp(struct work_struct *wp) +{ + unsigned long flags; + struct rcu_node *rnp = container_of(wp, struct rcu_node, exp_poll_wq); + unsigned long s; + + raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags); + s = rnp->exp_seq_poll_rq; + rnp->exp_seq_poll_rq |= 0x1; + raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags); + if (s & 0x1) + return; + while (!sync_exp_work_done(s)) + synchronize_rcu_expedited(); + raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags); + s = rnp->exp_seq_poll_rq; + if (!(s & 0x1) && !sync_exp_work_done(s)) + queue_work(rcu_gp_wq, &rnp->exp_poll_wq); + else + rnp->exp_seq_poll_rq |= 0x1; + raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags); +} + +/** + * start_poll_synchronize_rcu_expedited - Snapshot current expedited RCU state and start grace period + * + * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited() + * or poll_state_synchronize_rcu_expedited(), allowing them to determine + * whether or not a full expedited grace period has elapsed in the meantime. + * If the needed grace period is not already slated to start, initiates + * that grace period. + */ + +unsigned long start_poll_synchronize_rcu_expedited(void) +{ + unsigned long flags; + struct rcu_data *rdp; + struct rcu_node *rnp; + unsigned long s; + + if (rcu_gp_is_normal()) + return start_poll_synchronize_rcu_expedited() | + RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL; + + s = rcu_exp_gp_seq_snap(); + rdp = per_cpu_ptr(&rcu_data, raw_smp_processor_id()); + rnp = rdp->mynode; + raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags); + if ((rnp->exp_seq_poll_rq & 0x1) || ULONG_CMP_LT(rnp->exp_seq_poll_rq, s)) { + rnp->exp_seq_poll_rq = s; + queue_work(rcu_gp_wq, &rnp->exp_poll_wq); + } + raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags); + + return s | RCU_GET_STATE_FROM_EXPEDITED; +} +EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu_expedited); + +/** + * poll_state_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period + * + * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited() + * + * If a full expedited RCU grace period has elapsed since the earlier call + * from which oldstate was obtained, return @true, otherwise return @false. + * If @false is returned, it is the caller's responsibility to invoke + * this function later on until it does return @true. Alternatively, + * the caller can explicitly wait for a grace period, for example, by + * passing @oldstate to cond_synchronize_rcu_expedited() or by directly + * invoking synchronize_rcu_expedited(). + * + * Yes, this function does not take counter wrap into account. + * But counter wrap is harmless. If the counter wraps, we have waited for + * more than 2 billion grace periods (and way more on a 64-bit system!). + * Those needing to keep oldstate values for very long time periods + * (several hours even on 32-bit systems) should check them occasionally + * and either refresh them or set a flag indicating that the grace period + * has completed. + * + * This function provides the same memory-ordering guarantees that would + * be provided by a synchronize_rcu_expedited() that was invoked at the + * call to the function that provided @oldstate, and that returned at the + * end of this function. + */ +bool poll_state_synchronize_rcu_expedited(unsigned long oldstate) +{ + WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED)); + if (oldstate & RCU_GET_STATE_USE_NORMAL) + return poll_state_synchronize_rcu(oldstate & ~RCU_GET_STATE_BAD_FOR_NORMAL); + if (!rcu_exp_gp_seq_done(oldstate & ~RCU_SEQ_STATE_MASK)) + return false; + smp_mb(); /* Ensure GP ends before subsequent accesses. */ + return true; +} +EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu_expedited); + +/** + * cond_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period + * + * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited() + * + * If a full expedited RCU grace period has elapsed since the earlier + * call from which oldstate was obtained, just return. Otherwise, invoke + * synchronize_rcu_expedited() to wait for a full grace period. + * + * Yes, this function does not take counter wrap into account. But + * counter wrap is harmless. If the counter wraps, we have waited for + * more than 2 billion grace periods (and way more on a 64-bit system!), + * so waiting for one additional grace period should be just fine. + * + * This function provides the same memory-ordering guarantees that would + * be provided by a synchronize_rcu_expedited() that was invoked at the + * call to the function that provided @oldstate, and that returned at the + * end of this function. + */ +void cond_synchronize_rcu_expedited(unsigned long oldstate) +{ + WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED)); + if (poll_state_synchronize_rcu_expedited(oldstate)) + return; + if (oldstate & RCU_GET_STATE_USE_NORMAL) + synchronize_rcu_expedited(); + else + synchronize_rcu(); +} +EXPORT_SYMBOL_GPL(cond_synchronize_rcu_expedited); ^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-02-01 22:00 ` Paul E. McKenney @ 2022-02-03 18:49 ` Paul E. McKenney 2022-02-07 13:30 ` Brian Foster 1 sibling, 0 replies; 36+ messages in thread From: Paul E. McKenney @ 2022-02-03 18:49 UTC (permalink / raw) To: Brian Foster Cc: Dave Chinner, Al Viro, linux-xfs, Ian Kent, rcu, quic_neeraju On Tue, Feb 01, 2022 at 02:00:28PM -0800, Paul E. McKenney wrote: > On Mon, Jan 31, 2022 at 08:22:43AM -0500, Brian Foster wrote: > > On Fri, Jan 28, 2022 at 01:39:11PM -0800, Paul E. McKenney wrote: > > > On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote: > > > > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote: > > > > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote: > > > > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote: > > > > > > > > > > > > > Right, background inactivation does not improve performance - it's > > > > > > > necessary to get the transactions out of the evict() path. All we > > > > > > > wanted was to ensure that there were no performance degradations as > > > > > > > a result of background inactivation, not that it was faster. > > > > > > > > > > > > > > If you want to confirm that there is an increase in cold cache > > > > > > > access when the batch size is increased, cpu profiles with 'perf > > > > > > > top'/'perf record/report' and CPU cache performance metric reporting > > > > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread > > > > > > > where I mention those things to Paul. > > > > > > > > > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid? > > > > > > I'm not asking for that to happen this cycle and for backports Ian's > > > > > > patch is obviously fine. > > > > > > > > > > Yes, but not in the near term. > > > > > > > > > > > What I really want to avoid is the situation when we are stuck with > > > > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing > > > > > > reused inodes would hurt XFS too badly. And the benchmarks in this > > > > > > thread do look like that. > > > > > > > > > > The simplest way I think is to have the XFS inode allocation track > > > > > "busy inodes" in the same way we track "busy extents". A busy extent > > > > > is an extent that has been freed by the user, but is not yet marked > > > > > free in the journal/on disk. If we try to reallocate that busy > > > > > extent, we either select a different free extent to allocate, or if > > > > > we can't find any we force the journal to disk, wait for it to > > > > > complete (hence unbusying the extents) and retry the allocation > > > > > again. > > > > > > > > > > We can do something similar for inode allocation - it's actually a > > > > > lockless tag lookup on the radix tree entry for the candidate inode > > > > > number. If we find the reclaimable radix tree tag set, the we select > > > > > a different inode. If we can't allocate a new inode, then we kick > > > > > synchronize_rcu() and retry the allocation, allowing inodes to be > > > > > recycled this time. > > > > > > > > I'm starting to poke around this area since it's become clear that the > > > > currently proposed scheme just involves too much latency (unless Paul > > > > chimes in with his expedited grace period variant, at which point I will > > > > revisit) in the fast allocation/recycle path. ISTM so far that a simple > > > > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will > > > > have pretty much the same pattern of behavior as this patch: one > > > > synchronize_rcu() per batch. > > > > > > Apologies for being slow, but there have been some distractions. > > > One of the distractions was trying to put together atheoretically > > > attractive but massively overcomplicated implementation of > > > poll_state_synchronize_rcu_expedited(). It currently looks like a > > > somewhat suboptimal but much simpler approach is available. This > > > assumes that XFS is not in the picture until after both the scheduler > > > and workqueues are operational. > > > > > > > No worries.. I don't think that would be a roadblock for us. ;) > > > > > And yes, the complicated version might prove necessary, but let's > > > see if this whole thing is even useful first. ;-) > > > > > > > Indeed. This patch only really requires a single poll/sync pair of > > calls, so assuming the expedited grace period usage plays nice enough > > with typical !expedited usage elsewhere in the kernel for some basic > > tests, it would be fairly trivial to port this over and at least get an > > idea of what the worst case behavior might be with expedited grace > > periods, whether it satisfies the existing latency requirements, etc. > > > > Brian > > > > > In the meantime, if you want to look at an extremely unbaked view, > > > here you go: > > > > > > https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing > > And here is a version that passes moderate rcutorture testing. So no > obvious bugs. Probably a few non-obvious ones, though! ;-) > > This commit is on -rcu's "dev" branch along with this rcutorture > addition: > > cd7bd64af59f ("EXP rcutorture: Test polled expedited grace-period primitives") > > I will carry these in -rcu's "dev" branch until at least the upcoming > merge window, fixing bugs as and when they becom apparent. If I don't > hear otherwise by that time, I will create a tag for it and leave > it behind. > > The backport to v5.17-rc2 just requires removing: > > mutex_init(&rnp->boost_kthread_mutex); > > From rcu_init_one(). This line is added by this -rcu commit: > > 02a50b09c31f ("rcu: Add mutex for rcu boost kthread spawning and affinity setting") And with some alleged fixes of issues Neeraj found when reviewing this, perhaps most notably the ability to run on real-time kernels booted with rcupdate.rcu_normal=1. This version passes reasonably heavy-duty rcutorture testing. Must mean bugs in rcutorture... :-/ f93fa07011bd ("EXP rcu: Add polled expedited grace-period primitives") Again, please let me know how it goes! Thanx, Paul ------------------------------------------------------------------------ commit f93fa07011bd2460f222e570d17968baff21fa90 Author: Paul E. McKenney <paulmck@kernel.org> Date: Mon Jan 31 16:55:52 2022 -0800 EXP rcu: Add polled expedited grace-period primitives This is an experimental proof of concept of polled expedited grace-period functions. These functions are get_state_synchronize_rcu_expedited(), start_poll_synchronize_rcu_expedited(), poll_state_synchronize_rcu_expedited(), and cond_synchronize_rcu_expedited(), which are similar to get_state_synchronize_rcu(), start_poll_synchronize_rcu(), poll_state_synchronize_rcu(), and cond_synchronize_rcu(), respectively. One limitation is that start_poll_synchronize_rcu_expedited() cannot be invoked before workqueues are initialized. Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/ Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing Cc: Brian Foster <bfoster@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ian Kent <raven@themaw.net> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h index 858f4d429946d..ca139b4b2d25f 100644 --- a/include/linux/rcutiny.h +++ b/include/linux/rcutiny.h @@ -23,6 +23,26 @@ static inline void cond_synchronize_rcu(unsigned long oldstate) might_sleep(); } +static inline unsigned long get_state_synchronize_rcu_expedited(void) +{ + return get_state_synchronize_rcu(); +} + +static inline unsigned long start_poll_synchronize_rcu_expedited(void) +{ + return start_poll_synchronize_rcu(); +} + +static inline bool poll_state_synchronize_rcu_expedited(unsigned long oldstate) +{ + return poll_state_synchronize_rcu(oldstate); +} + +static inline void cond_synchronize_rcu_expedited(unsigned long oldstate) +{ + cond_synchronize_rcu(oldstate); +} + extern void rcu_barrier(void); static inline void synchronize_rcu_expedited(void) diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h index 76665db179fa1..eb774e9be21bf 100644 --- a/include/linux/rcutree.h +++ b/include/linux/rcutree.h @@ -40,6 +40,10 @@ bool rcu_eqs_special_set(int cpu); void rcu_momentary_dyntick_idle(void); void kfree_rcu_scheduler_running(void); bool rcu_gp_might_be_stalled(void); +unsigned long get_state_synchronize_rcu_expedited(void); +unsigned long start_poll_synchronize_rcu_expedited(void); +bool poll_state_synchronize_rcu_expedited(unsigned long oldstate); +void cond_synchronize_rcu_expedited(unsigned long oldstate); unsigned long get_state_synchronize_rcu(void); unsigned long start_poll_synchronize_rcu(void); bool poll_state_synchronize_rcu(unsigned long oldstate); diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 24b5f2c2de87b..5b61cf20c91e9 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -23,6 +23,13 @@ #define RCU_SEQ_CTR_SHIFT 2 #define RCU_SEQ_STATE_MASK ((1 << RCU_SEQ_CTR_SHIFT) - 1) +/* + * Low-order bit definitions for polled grace-period APIs. + */ +#define RCU_GET_STATE_FROM_EXPEDITED 0x1 +#define RCU_GET_STATE_USE_NORMAL 0x2 +#define RCU_GET_STATE_BAD_FOR_NORMAL (RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL) + /* * Return the counter portion of a sequence number previously returned * by rcu_seq_snap() or rcu_seq_current(). diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index e6ad532cffe78..135d5e2bce879 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3871,7 +3871,8 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu); */ bool poll_state_synchronize_rcu(unsigned long oldstate) { - if (rcu_seq_done(&rcu_state.gp_seq, oldstate)) { + if (rcu_seq_done(&rcu_state.gp_seq, oldstate) && + !WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL)) { smp_mb(); /* Ensure GP ends before subsequent accesses. */ return true; } @@ -3900,7 +3901,8 @@ EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu); */ void cond_synchronize_rcu(unsigned long oldstate) { - if (!poll_state_synchronize_rcu(oldstate)) + if (!poll_state_synchronize_rcu(oldstate) || + WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL)) synchronize_rcu(); } EXPORT_SYMBOL_GPL(cond_synchronize_rcu); @@ -4593,6 +4595,9 @@ static void __init rcu_init_one(void) init_waitqueue_head(&rnp->exp_wq[3]); spin_lock_init(&rnp->exp_lock); mutex_init(&rnp->boost_kthread_mutex); + raw_spin_lock_init(&rnp->exp_poll_lock); + rnp->exp_seq_poll_rq = 0x1; + INIT_WORK(&rnp->exp_poll_wq, sync_rcu_do_polled_gp); } } diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 926673ebe355f..19fc9acce3ce2 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -128,6 +128,10 @@ struct rcu_node { wait_queue_head_t exp_wq[4]; struct rcu_exp_work rew; bool exp_need_flush; /* Need to flush workitem? */ + raw_spinlock_t exp_poll_lock; + /* Lock and data for polled expedited grace periods. */ + unsigned long exp_seq_poll_rq; + struct work_struct exp_poll_wq; } ____cacheline_internodealigned_in_smp; /* @@ -476,3 +480,6 @@ static void rcu_iw_handler(struct irq_work *iwp); static void check_cpu_stall(struct rcu_data *rdp); static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp, const unsigned long gpssdelay); + +/* Forward declarations for tree_exp.h. */ +static void sync_rcu_do_polled_gp(struct work_struct *wp); diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 1a45667402260..4041988086830 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -871,3 +871,154 @@ void synchronize_rcu_expedited(void) destroy_work_on_stack(&rew.rew_work); } EXPORT_SYMBOL_GPL(synchronize_rcu_expedited); + +/** + * get_state_synchronize_rcu_expedited - Snapshot current expedited RCU state + * + * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited() + * or poll_state_synchronize_rcu_expedited(), allowing them to determine + * whether or not a full expedited grace period has elapsed in the meantime. + */ +unsigned long get_state_synchronize_rcu_expedited(void) +{ + if (rcu_gp_is_normal()) + return get_state_synchronize_rcu() | + RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL; + + // Any prior manipulation of RCU-protected data must happen + // before the load from ->expedited_sequence, and this ordering is + // provided by rcu_exp_gp_seq_snap(). + return rcu_exp_gp_seq_snap() | RCU_GET_STATE_FROM_EXPEDITED; +} +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu_expedited); + +/* + * Ensure that start_poll_synchronize_rcu_expedited() has the expedited + * RCU grace periods that it needs. + */ +static void sync_rcu_do_polled_gp(struct work_struct *wp) +{ + unsigned long flags; + struct rcu_node *rnp = container_of(wp, struct rcu_node, exp_poll_wq); + unsigned long s; + + raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags); + s = rnp->exp_seq_poll_rq; + rnp->exp_seq_poll_rq |= 0x1; + raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags); + if (s & 0x1) + return; + while (!sync_exp_work_done(s)) + synchronize_rcu_expedited(); + raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags); + s = rnp->exp_seq_poll_rq; + if (!(s & 0x1) && !sync_exp_work_done(s)) + queue_work(rcu_gp_wq, &rnp->exp_poll_wq); + else + rnp->exp_seq_poll_rq |= 0x1; + raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags); +} + +/** + * start_poll_synchronize_rcu_expedited - Snapshot current expedited RCU state and start grace period + * + * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited() + * or poll_state_synchronize_rcu_expedited(), allowing them to determine + * whether or not a full expedited grace period has elapsed in the meantime. + * If the needed grace period is not already slated to start, initiates + * that grace period. + */ + +unsigned long start_poll_synchronize_rcu_expedited(void) +{ + unsigned long flags; + struct rcu_data *rdp; + struct rcu_node *rnp; + unsigned long s; + + if (rcu_gp_is_normal()) + return start_poll_synchronize_rcu() | + RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL; + + s = rcu_exp_gp_seq_snap(); + rdp = per_cpu_ptr(&rcu_data, raw_smp_processor_id()); + rnp = rdp->mynode; + raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags); + if ((rnp->exp_seq_poll_rq & 0x1) || ULONG_CMP_LT(rnp->exp_seq_poll_rq, s)) { + rnp->exp_seq_poll_rq = s; + queue_work(rcu_gp_wq, &rnp->exp_poll_wq); + } + raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags); + + return s | RCU_GET_STATE_FROM_EXPEDITED; +} +EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu_expedited); + +/** + * poll_state_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period + * + * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited() + * + * If a full expedited RCU grace period has elapsed since the earlier call + * from which oldstate was obtained, return @true, otherwise return @false. + * If @false is returned, it is the caller's responsibility to invoke + * this function later on until it does return @true. Alternatively, + * the caller can explicitly wait for a grace period, for example, by + * passing @oldstate to cond_synchronize_rcu_expedited() or by directly + * invoking synchronize_rcu_expedited(). + * + * Yes, this function does not take counter wrap into account. + * But counter wrap is harmless. If the counter wraps, we have waited for + * more than 2 billion grace periods (and way more on a 64-bit system!). + * Those needing to keep oldstate values for very long time periods + * (several hours even on 32-bit systems) should check them occasionally + * and either refresh them or set a flag indicating that the grace period + * has completed. + * + * This function provides the same memory-ordering guarantees that would + * be provided by a synchronize_rcu_expedited() that was invoked at the + * call to the function that provided @oldstate, and that returned at the + * end of this function. + */ +bool poll_state_synchronize_rcu_expedited(unsigned long oldstate) +{ + WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED)); + if (oldstate & RCU_GET_STATE_USE_NORMAL) + return poll_state_synchronize_rcu(oldstate & ~RCU_GET_STATE_BAD_FOR_NORMAL); + if (!rcu_exp_gp_seq_done(oldstate & ~RCU_SEQ_STATE_MASK)) + return false; + smp_mb(); /* Ensure GP ends before subsequent accesses. */ + return true; +} +EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu_expedited); + +/** + * cond_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period + * + * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited() + * + * If a full expedited RCU grace period has elapsed since the earlier + * call from which oldstate was obtained, just return. Otherwise, invoke + * synchronize_rcu_expedited() to wait for a full grace period. + * + * Yes, this function does not take counter wrap into account. But + * counter wrap is harmless. If the counter wraps, we have waited for + * more than 2 billion grace periods (and way more on a 64-bit system!), + * so waiting for one additional grace period should be just fine. + * + * This function provides the same memory-ordering guarantees that would + * be provided by a synchronize_rcu_expedited() that was invoked at the + * call to the function that provided @oldstate, and that returned at the + * end of this function. + */ +void cond_synchronize_rcu_expedited(unsigned long oldstate) +{ + WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED)); + if (poll_state_synchronize_rcu_expedited(oldstate)) + return; + if (oldstate & RCU_GET_STATE_USE_NORMAL) + synchronize_rcu(); + else + synchronize_rcu_expedited(); +} +EXPORT_SYMBOL_GPL(cond_synchronize_rcu_expedited); ^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-02-01 22:00 ` Paul E. McKenney 2022-02-03 18:49 ` Paul E. McKenney @ 2022-02-07 13:30 ` Brian Foster 2022-02-07 16:36 ` Paul E. McKenney 1 sibling, 1 reply; 36+ messages in thread From: Brian Foster @ 2022-02-07 13:30 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Dave Chinner, Al Viro, linux-xfs, Ian Kent, rcu On Tue, Feb 01, 2022 at 02:00:28PM -0800, Paul E. McKenney wrote: > On Mon, Jan 31, 2022 at 08:22:43AM -0500, Brian Foster wrote: > > On Fri, Jan 28, 2022 at 01:39:11PM -0800, Paul E. McKenney wrote: > > > On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote: > > > > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote: > > > > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote: > > > > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote: > > > > > > > > > > > > > Right, background inactivation does not improve performance - it's > > > > > > > necessary to get the transactions out of the evict() path. All we > > > > > > > wanted was to ensure that there were no performance degradations as > > > > > > > a result of background inactivation, not that it was faster. > > > > > > > > > > > > > > If you want to confirm that there is an increase in cold cache > > > > > > > access when the batch size is increased, cpu profiles with 'perf > > > > > > > top'/'perf record/report' and CPU cache performance metric reporting > > > > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread > > > > > > > where I mention those things to Paul. > > > > > > > > > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid? > > > > > > I'm not asking for that to happen this cycle and for backports Ian's > > > > > > patch is obviously fine. > > > > > > > > > > Yes, but not in the near term. > > > > > > > > > > > What I really want to avoid is the situation when we are stuck with > > > > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing > > > > > > reused inodes would hurt XFS too badly. And the benchmarks in this > > > > > > thread do look like that. > > > > > > > > > > The simplest way I think is to have the XFS inode allocation track > > > > > "busy inodes" in the same way we track "busy extents". A busy extent > > > > > is an extent that has been freed by the user, but is not yet marked > > > > > free in the journal/on disk. If we try to reallocate that busy > > > > > extent, we either select a different free extent to allocate, or if > > > > > we can't find any we force the journal to disk, wait for it to > > > > > complete (hence unbusying the extents) and retry the allocation > > > > > again. > > > > > > > > > > We can do something similar for inode allocation - it's actually a > > > > > lockless tag lookup on the radix tree entry for the candidate inode > > > > > number. If we find the reclaimable radix tree tag set, the we select > > > > > a different inode. If we can't allocate a new inode, then we kick > > > > > synchronize_rcu() and retry the allocation, allowing inodes to be > > > > > recycled this time. > > > > > > > > I'm starting to poke around this area since it's become clear that the > > > > currently proposed scheme just involves too much latency (unless Paul > > > > chimes in with his expedited grace period variant, at which point I will > > > > revisit) in the fast allocation/recycle path. ISTM so far that a simple > > > > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will > > > > have pretty much the same pattern of behavior as this patch: one > > > > synchronize_rcu() per batch. > > > > > > Apologies for being slow, but there have been some distractions. > > > One of the distractions was trying to put together atheoretically > > > attractive but massively overcomplicated implementation of > > > poll_state_synchronize_rcu_expedited(). It currently looks like a > > > somewhat suboptimal but much simpler approach is available. This > > > assumes that XFS is not in the picture until after both the scheduler > > > and workqueues are operational. > > > > > > > No worries.. I don't think that would be a roadblock for us. ;) > > > > > And yes, the complicated version might prove necessary, but let's > > > see if this whole thing is even useful first. ;-) > > > > > > > Indeed. This patch only really requires a single poll/sync pair of > > calls, so assuming the expedited grace period usage plays nice enough > > with typical !expedited usage elsewhere in the kernel for some basic > > tests, it would be fairly trivial to port this over and at least get an > > idea of what the worst case behavior might be with expedited grace > > periods, whether it satisfies the existing latency requirements, etc. > > > > Brian > > > > > In the meantime, if you want to look at an extremely unbaked view, > > > here you go: > > > > > > https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing > > And here is a version that passes moderate rcutorture testing. So no > obvious bugs. Probably a few non-obvious ones, though! ;-) > > This commit is on -rcu's "dev" branch along with this rcutorture > addition: > > cd7bd64af59f ("EXP rcutorture: Test polled expedited grace-period primitives") > > I will carry these in -rcu's "dev" branch until at least the upcoming > merge window, fixing bugs as and when they becom apparent. If I don't > hear otherwise by that time, I will create a tag for it and leave > it behind. > > The backport to v5.17-rc2 just requires removing: > > mutex_init(&rnp->boost_kthread_mutex); > > From rcu_init_one(). This line is added by this -rcu commit: > > 02a50b09c31f ("rcu: Add mutex for rcu boost kthread spawning and affinity setting") > > Please let me know how it goes! > Thanks Paul. I gave this a whirl with a ported variant of this patch on top. There is definitely a notable improvement with the expedited grace periods. A few quick runs of the same batched alloc/free test (i.e. 10 sample) I had run against the original version: batch baseline baseline+bg test test+bg 1 889954 210075 552911 25540 4 879540 212740 575356 24624 8 924928 213568 496992 26080 16 922960 211504 518496 24592 32 844832 219744 524672 28608 64 579968 196544 358720 24128 128 667392 195840 397696 22400 256 624896 197888 376320 31232 512 572928 204800 382464 46080 1024 549888 174080 379904 73728 2048 522240 174080 350208 106496 4096 536576 167936 360448 131072 So this shows a major improvement in the case where the system is otherwise idle. We still aren't quite at the baseline numbers, but that's not really the goal here because those numbers are partly driven by the fact that we unsafely reuse recently freed inodes in cases where proper behavior would be to allocate new inode chunks for a period of time. The core test numbers are much closer to the single threaded allocation rate (55k-65k inodes/sec) on this setup, so that is quite positive. The "bg" variants are the same tests with 64 tasks doing unrelated pathwalk listings on a kernel source tree (on separate storage) concurrently in the background. The purpose of this was just to generate background (rcu) activity in the form of pathname lookups and whatnot and see how that impacts the results. This clearly affects both kernels, but the test kernel drops down closer to numbers reminiscent of the non-expedited grace period variant. Note that this impact seems to scale with increased background workload. With a similar test running only 8 background tasks, the test kernel is pretty consistently in the 225k-250k (per 10s) range across the set of batch sizes. That's about half the core test rate, so still not as terrible as the original variant. ;) In any event, this probably requires some thought/discussion (and more testing) on whether this is considered an acceptable change or whether we want to explore options to mitigate this further. I am still playing with some ideas to potentially mitigate grace period latency, so it might be worth seeing if anything useful falls out of that as well. Thoughts appreciated... Brian > Thanx, Paul > > ------------------------------------------------------------------------ > > commit dd896a86aebc5b225ceee13fcf1375c7542a5e2d > Author: Paul E. McKenney <paulmck@kernel.org> > Date: Mon Jan 31 16:55:52 2022 -0800 > > EXP rcu: Add polled expedited grace-period primitives > > This is an experimental proof of concept of polled expedited grace-period > functions. These functions are get_state_synchronize_rcu_expedited(), > start_poll_synchronize_rcu_expedited(), poll_state_synchronize_rcu_expedited(), > and cond_synchronize_rcu_expedited(), which are similar to > get_state_synchronize_rcu(), start_poll_synchronize_rcu(), > poll_state_synchronize_rcu(), and cond_synchronize_rcu(), respectively. > > One limitation is that start_poll_synchronize_rcu_expedited() cannot > be invoked before workqueues are initialized. > > Cc: Brian Foster <bfoster@redhat.com> > Cc: Dave Chinner <david@fromorbit.com> > Cc: Al Viro <viro@zeniv.linux.org.uk> > Cc: Ian Kent <raven@themaw.net> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h > index 858f4d429946d..ca139b4b2d25f 100644 > --- a/include/linux/rcutiny.h > +++ b/include/linux/rcutiny.h > @@ -23,6 +23,26 @@ static inline void cond_synchronize_rcu(unsigned long oldstate) > might_sleep(); > } > > +static inline unsigned long get_state_synchronize_rcu_expedited(void) > +{ > + return get_state_synchronize_rcu(); > +} > + > +static inline unsigned long start_poll_synchronize_rcu_expedited(void) > +{ > + return start_poll_synchronize_rcu(); > +} > + > +static inline bool poll_state_synchronize_rcu_expedited(unsigned long oldstate) > +{ > + return poll_state_synchronize_rcu(oldstate); > +} > + > +static inline void cond_synchronize_rcu_expedited(unsigned long oldstate) > +{ > + cond_synchronize_rcu(oldstate); > +} > + > extern void rcu_barrier(void); > > static inline void synchronize_rcu_expedited(void) > diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h > index 76665db179fa1..eb774e9be21bf 100644 > --- a/include/linux/rcutree.h > +++ b/include/linux/rcutree.h > @@ -40,6 +40,10 @@ bool rcu_eqs_special_set(int cpu); > void rcu_momentary_dyntick_idle(void); > void kfree_rcu_scheduler_running(void); > bool rcu_gp_might_be_stalled(void); > +unsigned long get_state_synchronize_rcu_expedited(void); > +unsigned long start_poll_synchronize_rcu_expedited(void); > +bool poll_state_synchronize_rcu_expedited(unsigned long oldstate); > +void cond_synchronize_rcu_expedited(unsigned long oldstate); > unsigned long get_state_synchronize_rcu(void); > unsigned long start_poll_synchronize_rcu(void); > bool poll_state_synchronize_rcu(unsigned long oldstate); > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > index 24b5f2c2de87b..5b61cf20c91e9 100644 > --- a/kernel/rcu/rcu.h > +++ b/kernel/rcu/rcu.h > @@ -23,6 +23,13 @@ > #define RCU_SEQ_CTR_SHIFT 2 > #define RCU_SEQ_STATE_MASK ((1 << RCU_SEQ_CTR_SHIFT) - 1) > > +/* > + * Low-order bit definitions for polled grace-period APIs. > + */ > +#define RCU_GET_STATE_FROM_EXPEDITED 0x1 > +#define RCU_GET_STATE_USE_NORMAL 0x2 > +#define RCU_GET_STATE_BAD_FOR_NORMAL (RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL) > + > /* > * Return the counter portion of a sequence number previously returned > * by rcu_seq_snap() or rcu_seq_current(). > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index e6ad532cffe78..5de36abcd7da1 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -3871,7 +3871,8 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu); > */ > bool poll_state_synchronize_rcu(unsigned long oldstate) > { > - if (rcu_seq_done(&rcu_state.gp_seq, oldstate)) { > + if (rcu_seq_done(&rcu_state.gp_seq, oldstate) && > + !WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL)) { > smp_mb(); /* Ensure GP ends before subsequent accesses. */ > return true; > } > @@ -3900,7 +3901,8 @@ EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu); > */ > void cond_synchronize_rcu(unsigned long oldstate) > { > - if (!poll_state_synchronize_rcu(oldstate)) > + if (!poll_state_synchronize_rcu(oldstate) && > + !WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL)) > synchronize_rcu(); > } > EXPORT_SYMBOL_GPL(cond_synchronize_rcu); > @@ -4593,6 +4595,9 @@ static void __init rcu_init_one(void) > init_waitqueue_head(&rnp->exp_wq[3]); > spin_lock_init(&rnp->exp_lock); > mutex_init(&rnp->boost_kthread_mutex); > + raw_spin_lock_init(&rnp->exp_poll_lock); > + rnp->exp_seq_poll_rq = 0x1; > + INIT_WORK(&rnp->exp_poll_wq, sync_rcu_do_polled_gp); > } > } > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h > index 926673ebe355f..19fc9acce3ce2 100644 > --- a/kernel/rcu/tree.h > +++ b/kernel/rcu/tree.h > @@ -128,6 +128,10 @@ struct rcu_node { > wait_queue_head_t exp_wq[4]; > struct rcu_exp_work rew; > bool exp_need_flush; /* Need to flush workitem? */ > + raw_spinlock_t exp_poll_lock; > + /* Lock and data for polled expedited grace periods. */ > + unsigned long exp_seq_poll_rq; > + struct work_struct exp_poll_wq; > } ____cacheline_internodealigned_in_smp; > > /* > @@ -476,3 +480,6 @@ static void rcu_iw_handler(struct irq_work *iwp); > static void check_cpu_stall(struct rcu_data *rdp); > static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp, > const unsigned long gpssdelay); > + > +/* Forward declarations for tree_exp.h. */ > +static void sync_rcu_do_polled_gp(struct work_struct *wp); > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h > index 1a45667402260..728896f374fee 100644 > --- a/kernel/rcu/tree_exp.h > +++ b/kernel/rcu/tree_exp.h > @@ -871,3 +871,154 @@ void synchronize_rcu_expedited(void) > destroy_work_on_stack(&rew.rew_work); > } > EXPORT_SYMBOL_GPL(synchronize_rcu_expedited); > + > +/** > + * get_state_synchronize_rcu_expedited - Snapshot current expedited RCU state > + * > + * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited() > + * or poll_state_synchronize_rcu_expedited(), allowing them to determine > + * whether or not a full expedited grace period has elapsed in the meantime. > + */ > +unsigned long get_state_synchronize_rcu_expedited(void) > +{ > + if (rcu_gp_is_normal()) > + return get_state_synchronize_rcu() | > + RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL; > + > + // Any prior manipulation of RCU-protected data must happen > + // before the load from ->expedited_sequence. > + smp_mb(); /* ^^^ */ > + return rcu_exp_gp_seq_snap() | RCU_GET_STATE_FROM_EXPEDITED; > +} > +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu_expedited); > + > +/* > + * Ensure that start_poll_synchronize_rcu_expedited() has the expedited > + * RCU grace periods that it needs. > + */ > +static void sync_rcu_do_polled_gp(struct work_struct *wp) > +{ > + unsigned long flags; > + struct rcu_node *rnp = container_of(wp, struct rcu_node, exp_poll_wq); > + unsigned long s; > + > + raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags); > + s = rnp->exp_seq_poll_rq; > + rnp->exp_seq_poll_rq |= 0x1; > + raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags); > + if (s & 0x1) > + return; > + while (!sync_exp_work_done(s)) > + synchronize_rcu_expedited(); > + raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags); > + s = rnp->exp_seq_poll_rq; > + if (!(s & 0x1) && !sync_exp_work_done(s)) > + queue_work(rcu_gp_wq, &rnp->exp_poll_wq); > + else > + rnp->exp_seq_poll_rq |= 0x1; > + raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags); > +} > + > +/** > + * start_poll_synchronize_rcu_expedited - Snapshot current expedited RCU state and start grace period > + * > + * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited() > + * or poll_state_synchronize_rcu_expedited(), allowing them to determine > + * whether or not a full expedited grace period has elapsed in the meantime. > + * If the needed grace period is not already slated to start, initiates > + * that grace period. > + */ > + > +unsigned long start_poll_synchronize_rcu_expedited(void) > +{ > + unsigned long flags; > + struct rcu_data *rdp; > + struct rcu_node *rnp; > + unsigned long s; > + > + if (rcu_gp_is_normal()) > + return start_poll_synchronize_rcu_expedited() | > + RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL; > + > + s = rcu_exp_gp_seq_snap(); > + rdp = per_cpu_ptr(&rcu_data, raw_smp_processor_id()); > + rnp = rdp->mynode; > + raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags); > + if ((rnp->exp_seq_poll_rq & 0x1) || ULONG_CMP_LT(rnp->exp_seq_poll_rq, s)) { > + rnp->exp_seq_poll_rq = s; > + queue_work(rcu_gp_wq, &rnp->exp_poll_wq); > + } > + raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags); > + > + return s | RCU_GET_STATE_FROM_EXPEDITED; > +} > +EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu_expedited); > + > +/** > + * poll_state_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period > + * > + * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited() > + * > + * If a full expedited RCU grace period has elapsed since the earlier call > + * from which oldstate was obtained, return @true, otherwise return @false. > + * If @false is returned, it is the caller's responsibility to invoke > + * this function later on until it does return @true. Alternatively, > + * the caller can explicitly wait for a grace period, for example, by > + * passing @oldstate to cond_synchronize_rcu_expedited() or by directly > + * invoking synchronize_rcu_expedited(). > + * > + * Yes, this function does not take counter wrap into account. > + * But counter wrap is harmless. If the counter wraps, we have waited for > + * more than 2 billion grace periods (and way more on a 64-bit system!). > + * Those needing to keep oldstate values for very long time periods > + * (several hours even on 32-bit systems) should check them occasionally > + * and either refresh them or set a flag indicating that the grace period > + * has completed. > + * > + * This function provides the same memory-ordering guarantees that would > + * be provided by a synchronize_rcu_expedited() that was invoked at the > + * call to the function that provided @oldstate, and that returned at the > + * end of this function. > + */ > +bool poll_state_synchronize_rcu_expedited(unsigned long oldstate) > +{ > + WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED)); > + if (oldstate & RCU_GET_STATE_USE_NORMAL) > + return poll_state_synchronize_rcu(oldstate & ~RCU_GET_STATE_BAD_FOR_NORMAL); > + if (!rcu_exp_gp_seq_done(oldstate & ~RCU_SEQ_STATE_MASK)) > + return false; > + smp_mb(); /* Ensure GP ends before subsequent accesses. */ > + return true; > +} > +EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu_expedited); > + > +/** > + * cond_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period > + * > + * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited() > + * > + * If a full expedited RCU grace period has elapsed since the earlier > + * call from which oldstate was obtained, just return. Otherwise, invoke > + * synchronize_rcu_expedited() to wait for a full grace period. > + * > + * Yes, this function does not take counter wrap into account. But > + * counter wrap is harmless. If the counter wraps, we have waited for > + * more than 2 billion grace periods (and way more on a 64-bit system!), > + * so waiting for one additional grace period should be just fine. > + * > + * This function provides the same memory-ordering guarantees that would > + * be provided by a synchronize_rcu_expedited() that was invoked at the > + * call to the function that provided @oldstate, and that returned at the > + * end of this function. > + */ > +void cond_synchronize_rcu_expedited(unsigned long oldstate) > +{ > + WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED)); > + if (poll_state_synchronize_rcu_expedited(oldstate)) > + return; > + if (oldstate & RCU_GET_STATE_USE_NORMAL) > + synchronize_rcu_expedited(); > + else > + synchronize_rcu(); > +} > +EXPORT_SYMBOL_GPL(cond_synchronize_rcu_expedited); > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-02-07 13:30 ` Brian Foster @ 2022-02-07 16:36 ` Paul E. McKenney 2022-02-10 4:09 ` Dave Chinner 0 siblings, 1 reply; 36+ messages in thread From: Paul E. McKenney @ 2022-02-07 16:36 UTC (permalink / raw) To: Brian Foster; +Cc: Dave Chinner, Al Viro, linux-xfs, Ian Kent, rcu On Mon, Feb 07, 2022 at 08:30:03AM -0500, Brian Foster wrote: > On Tue, Feb 01, 2022 at 02:00:28PM -0800, Paul E. McKenney wrote: > > On Mon, Jan 31, 2022 at 08:22:43AM -0500, Brian Foster wrote: > > > On Fri, Jan 28, 2022 at 01:39:11PM -0800, Paul E. McKenney wrote: > > > > On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote: > > > > > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote: > > > > > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote: > > > > > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote: > > > > > > > > > > > > > > > Right, background inactivation does not improve performance - it's > > > > > > > > necessary to get the transactions out of the evict() path. All we > > > > > > > > wanted was to ensure that there were no performance degradations as > > > > > > > > a result of background inactivation, not that it was faster. > > > > > > > > > > > > > > > > If you want to confirm that there is an increase in cold cache > > > > > > > > access when the batch size is increased, cpu profiles with 'perf > > > > > > > > top'/'perf record/report' and CPU cache performance metric reporting > > > > > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread > > > > > > > > where I mention those things to Paul. > > > > > > > > > > > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid? > > > > > > > I'm not asking for that to happen this cycle and for backports Ian's > > > > > > > patch is obviously fine. > > > > > > > > > > > > Yes, but not in the near term. > > > > > > > > > > > > > What I really want to avoid is the situation when we are stuck with > > > > > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing > > > > > > > reused inodes would hurt XFS too badly. And the benchmarks in this > > > > > > > thread do look like that. > > > > > > > > > > > > The simplest way I think is to have the XFS inode allocation track > > > > > > "busy inodes" in the same way we track "busy extents". A busy extent > > > > > > is an extent that has been freed by the user, but is not yet marked > > > > > > free in the journal/on disk. If we try to reallocate that busy > > > > > > extent, we either select a different free extent to allocate, or if > > > > > > we can't find any we force the journal to disk, wait for it to > > > > > > complete (hence unbusying the extents) and retry the allocation > > > > > > again. > > > > > > > > > > > > We can do something similar for inode allocation - it's actually a > > > > > > lockless tag lookup on the radix tree entry for the candidate inode > > > > > > number. If we find the reclaimable radix tree tag set, the we select > > > > > > a different inode. If we can't allocate a new inode, then we kick > > > > > > synchronize_rcu() and retry the allocation, allowing inodes to be > > > > > > recycled this time. > > > > > > > > > > I'm starting to poke around this area since it's become clear that the > > > > > currently proposed scheme just involves too much latency (unless Paul > > > > > chimes in with his expedited grace period variant, at which point I will > > > > > revisit) in the fast allocation/recycle path. ISTM so far that a simple > > > > > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will > > > > > have pretty much the same pattern of behavior as this patch: one > > > > > synchronize_rcu() per batch. > > > > > > > > Apologies for being slow, but there have been some distractions. > > > > One of the distractions was trying to put together atheoretically > > > > attractive but massively overcomplicated implementation of > > > > poll_state_synchronize_rcu_expedited(). It currently looks like a > > > > somewhat suboptimal but much simpler approach is available. This > > > > assumes that XFS is not in the picture until after both the scheduler > > > > and workqueues are operational. > > > > > > > > > > No worries.. I don't think that would be a roadblock for us. ;) > > > > > > > And yes, the complicated version might prove necessary, but let's > > > > see if this whole thing is even useful first. ;-) > > > > > > > > > > Indeed. This patch only really requires a single poll/sync pair of > > > calls, so assuming the expedited grace period usage plays nice enough > > > with typical !expedited usage elsewhere in the kernel for some basic > > > tests, it would be fairly trivial to port this over and at least get an > > > idea of what the worst case behavior might be with expedited grace > > > periods, whether it satisfies the existing latency requirements, etc. > > > > > > Brian > > > > > > > In the meantime, if you want to look at an extremely unbaked view, > > > > here you go: > > > > > > > > https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing > > > > And here is a version that passes moderate rcutorture testing. So no > > obvious bugs. Probably a few non-obvious ones, though! ;-) > > > > This commit is on -rcu's "dev" branch along with this rcutorture > > addition: > > > > cd7bd64af59f ("EXP rcutorture: Test polled expedited grace-period primitives") > > > > I will carry these in -rcu's "dev" branch until at least the upcoming > > merge window, fixing bugs as and when they becom apparent. If I don't > > hear otherwise by that time, I will create a tag for it and leave > > it behind. > > > > The backport to v5.17-rc2 just requires removing: > > > > mutex_init(&rnp->boost_kthread_mutex); > > > > From rcu_init_one(). This line is added by this -rcu commit: > > > > 02a50b09c31f ("rcu: Add mutex for rcu boost kthread spawning and affinity setting") > > > > Please let me know how it goes! > > > > Thanks Paul. I gave this a whirl with a ported variant of this patch on > top. There is definitely a notable improvement with the expedited grace > periods. A few quick runs of the same batched alloc/free test (i.e. 10 > sample) I had run against the original version: > > batch baseline baseline+bg test test+bg > > 1 889954 210075 552911 25540 > 4 879540 212740 575356 24624 > 8 924928 213568 496992 26080 > 16 922960 211504 518496 24592 > 32 844832 219744 524672 28608 > 64 579968 196544 358720 24128 > 128 667392 195840 397696 22400 > 256 624896 197888 376320 31232 > 512 572928 204800 382464 46080 > 1024 549888 174080 379904 73728 > 2048 522240 174080 350208 106496 > 4096 536576 167936 360448 131072 > > So this shows a major improvement in the case where the system is > otherwise idle. We still aren't quite at the baseline numbers, but > that's not really the goal here because those numbers are partly driven > by the fact that we unsafely reuse recently freed inodes in cases where > proper behavior would be to allocate new inode chunks for a period of > time. The core test numbers are much closer to the single threaded > allocation rate (55k-65k inodes/sec) on this setup, so that is quite > positive. > > The "bg" variants are the same tests with 64 tasks doing unrelated > pathwalk listings on a kernel source tree (on separate storage) > concurrently in the background. The purpose of this was just to generate > background (rcu) activity in the form of pathname lookups and whatnot > and see how that impacts the results. This clearly affects both kernels, > but the test kernel drops down closer to numbers reminiscent of the > non-expedited grace period variant. Note that this impact seems to scale > with increased background workload. With a similar test running only 8 > background tasks, the test kernel is pretty consistently in the > 225k-250k (per 10s) range across the set of batch sizes. That's about > half the core test rate, so still not as terrible as the original > variant. ;) > > In any event, this probably requires some thought/discussion (and more > testing) on whether this is considered an acceptable change or whether > we want to explore options to mitigate this further. I am still playing > with some ideas to potentially mitigate grace period latency, so it > might be worth seeing if anything useful falls out of that as well. > Thoughts appreciated... So this fixes a bug, but results in many 10s of percent performance degradation? Ouch... Another approach is to use SLAB_TYPESAFE_BY_RCU. This allows immediate reuse of freed memory, but also requires pointer traversals to the memory to do a revalidation operation. (Sorry, no free lunch here!) Thanx, Paul > Brian > > > Thanx, Paul > > > > ------------------------------------------------------------------------ > > > > commit dd896a86aebc5b225ceee13fcf1375c7542a5e2d > > Author: Paul E. McKenney <paulmck@kernel.org> > > Date: Mon Jan 31 16:55:52 2022 -0800 > > > > EXP rcu: Add polled expedited grace-period primitives > > > > This is an experimental proof of concept of polled expedited grace-period > > functions. These functions are get_state_synchronize_rcu_expedited(), > > start_poll_synchronize_rcu_expedited(), poll_state_synchronize_rcu_expedited(), > > and cond_synchronize_rcu_expedited(), which are similar to > > get_state_synchronize_rcu(), start_poll_synchronize_rcu(), > > poll_state_synchronize_rcu(), and cond_synchronize_rcu(), respectively. > > > > One limitation is that start_poll_synchronize_rcu_expedited() cannot > > be invoked before workqueues are initialized. > > > > Cc: Brian Foster <bfoster@redhat.com> > > Cc: Dave Chinner <david@fromorbit.com> > > Cc: Al Viro <viro@zeniv.linux.org.uk> > > Cc: Ian Kent <raven@themaw.net> > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > > > diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h > > index 858f4d429946d..ca139b4b2d25f 100644 > > --- a/include/linux/rcutiny.h > > +++ b/include/linux/rcutiny.h > > @@ -23,6 +23,26 @@ static inline void cond_synchronize_rcu(unsigned long oldstate) > > might_sleep(); > > } > > > > +static inline unsigned long get_state_synchronize_rcu_expedited(void) > > +{ > > + return get_state_synchronize_rcu(); > > +} > > + > > +static inline unsigned long start_poll_synchronize_rcu_expedited(void) > > +{ > > + return start_poll_synchronize_rcu(); > > +} > > + > > +static inline bool poll_state_synchronize_rcu_expedited(unsigned long oldstate) > > +{ > > + return poll_state_synchronize_rcu(oldstate); > > +} > > + > > +static inline void cond_synchronize_rcu_expedited(unsigned long oldstate) > > +{ > > + cond_synchronize_rcu(oldstate); > > +} > > + > > extern void rcu_barrier(void); > > > > static inline void synchronize_rcu_expedited(void) > > diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h > > index 76665db179fa1..eb774e9be21bf 100644 > > --- a/include/linux/rcutree.h > > +++ b/include/linux/rcutree.h > > @@ -40,6 +40,10 @@ bool rcu_eqs_special_set(int cpu); > > void rcu_momentary_dyntick_idle(void); > > void kfree_rcu_scheduler_running(void); > > bool rcu_gp_might_be_stalled(void); > > +unsigned long get_state_synchronize_rcu_expedited(void); > > +unsigned long start_poll_synchronize_rcu_expedited(void); > > +bool poll_state_synchronize_rcu_expedited(unsigned long oldstate); > > +void cond_synchronize_rcu_expedited(unsigned long oldstate); > > unsigned long get_state_synchronize_rcu(void); > > unsigned long start_poll_synchronize_rcu(void); > > bool poll_state_synchronize_rcu(unsigned long oldstate); > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > > index 24b5f2c2de87b..5b61cf20c91e9 100644 > > --- a/kernel/rcu/rcu.h > > +++ b/kernel/rcu/rcu.h > > @@ -23,6 +23,13 @@ > > #define RCU_SEQ_CTR_SHIFT 2 > > #define RCU_SEQ_STATE_MASK ((1 << RCU_SEQ_CTR_SHIFT) - 1) > > > > +/* > > + * Low-order bit definitions for polled grace-period APIs. > > + */ > > +#define RCU_GET_STATE_FROM_EXPEDITED 0x1 > > +#define RCU_GET_STATE_USE_NORMAL 0x2 > > +#define RCU_GET_STATE_BAD_FOR_NORMAL (RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL) > > + > > /* > > * Return the counter portion of a sequence number previously returned > > * by rcu_seq_snap() or rcu_seq_current(). > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > index e6ad532cffe78..5de36abcd7da1 100644 > > --- a/kernel/rcu/tree.c > > +++ b/kernel/rcu/tree.c > > @@ -3871,7 +3871,8 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu); > > */ > > bool poll_state_synchronize_rcu(unsigned long oldstate) > > { > > - if (rcu_seq_done(&rcu_state.gp_seq, oldstate)) { > > + if (rcu_seq_done(&rcu_state.gp_seq, oldstate) && > > + !WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL)) { > > smp_mb(); /* Ensure GP ends before subsequent accesses. */ > > return true; > > } > > @@ -3900,7 +3901,8 @@ EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu); > > */ > > void cond_synchronize_rcu(unsigned long oldstate) > > { > > - if (!poll_state_synchronize_rcu(oldstate)) > > + if (!poll_state_synchronize_rcu(oldstate) && > > + !WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL)) > > synchronize_rcu(); > > } > > EXPORT_SYMBOL_GPL(cond_synchronize_rcu); > > @@ -4593,6 +4595,9 @@ static void __init rcu_init_one(void) > > init_waitqueue_head(&rnp->exp_wq[3]); > > spin_lock_init(&rnp->exp_lock); > > mutex_init(&rnp->boost_kthread_mutex); > > + raw_spin_lock_init(&rnp->exp_poll_lock); > > + rnp->exp_seq_poll_rq = 0x1; > > + INIT_WORK(&rnp->exp_poll_wq, sync_rcu_do_polled_gp); > > } > > } > > > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h > > index 926673ebe355f..19fc9acce3ce2 100644 > > --- a/kernel/rcu/tree.h > > +++ b/kernel/rcu/tree.h > > @@ -128,6 +128,10 @@ struct rcu_node { > > wait_queue_head_t exp_wq[4]; > > struct rcu_exp_work rew; > > bool exp_need_flush; /* Need to flush workitem? */ > > + raw_spinlock_t exp_poll_lock; > > + /* Lock and data for polled expedited grace periods. */ > > + unsigned long exp_seq_poll_rq; > > + struct work_struct exp_poll_wq; > > } ____cacheline_internodealigned_in_smp; > > > > /* > > @@ -476,3 +480,6 @@ static void rcu_iw_handler(struct irq_work *iwp); > > static void check_cpu_stall(struct rcu_data *rdp); > > static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp, > > const unsigned long gpssdelay); > > + > > +/* Forward declarations for tree_exp.h. */ > > +static void sync_rcu_do_polled_gp(struct work_struct *wp); > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h > > index 1a45667402260..728896f374fee 100644 > > --- a/kernel/rcu/tree_exp.h > > +++ b/kernel/rcu/tree_exp.h > > @@ -871,3 +871,154 @@ void synchronize_rcu_expedited(void) > > destroy_work_on_stack(&rew.rew_work); > > } > > EXPORT_SYMBOL_GPL(synchronize_rcu_expedited); > > + > > +/** > > + * get_state_synchronize_rcu_expedited - Snapshot current expedited RCU state > > + * > > + * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited() > > + * or poll_state_synchronize_rcu_expedited(), allowing them to determine > > + * whether or not a full expedited grace period has elapsed in the meantime. > > + */ > > +unsigned long get_state_synchronize_rcu_expedited(void) > > +{ > > + if (rcu_gp_is_normal()) > > + return get_state_synchronize_rcu() | > > + RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL; > > + > > + // Any prior manipulation of RCU-protected data must happen > > + // before the load from ->expedited_sequence. > > + smp_mb(); /* ^^^ */ > > + return rcu_exp_gp_seq_snap() | RCU_GET_STATE_FROM_EXPEDITED; > > +} > > +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu_expedited); > > + > > +/* > > + * Ensure that start_poll_synchronize_rcu_expedited() has the expedited > > + * RCU grace periods that it needs. > > + */ > > +static void sync_rcu_do_polled_gp(struct work_struct *wp) > > +{ > > + unsigned long flags; > > + struct rcu_node *rnp = container_of(wp, struct rcu_node, exp_poll_wq); > > + unsigned long s; > > + > > + raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags); > > + s = rnp->exp_seq_poll_rq; > > + rnp->exp_seq_poll_rq |= 0x1; > > + raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags); > > + if (s & 0x1) > > + return; > > + while (!sync_exp_work_done(s)) > > + synchronize_rcu_expedited(); > > + raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags); > > + s = rnp->exp_seq_poll_rq; > > + if (!(s & 0x1) && !sync_exp_work_done(s)) > > + queue_work(rcu_gp_wq, &rnp->exp_poll_wq); > > + else > > + rnp->exp_seq_poll_rq |= 0x1; > > + raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags); > > +} > > + > > +/** > > + * start_poll_synchronize_rcu_expedited - Snapshot current expedited RCU state and start grace period > > + * > > + * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited() > > + * or poll_state_synchronize_rcu_expedited(), allowing them to determine > > + * whether or not a full expedited grace period has elapsed in the meantime. > > + * If the needed grace period is not already slated to start, initiates > > + * that grace period. > > + */ > > + > > +unsigned long start_poll_synchronize_rcu_expedited(void) > > +{ > > + unsigned long flags; > > + struct rcu_data *rdp; > > + struct rcu_node *rnp; > > + unsigned long s; > > + > > + if (rcu_gp_is_normal()) > > + return start_poll_synchronize_rcu_expedited() | > > + RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL; > > + > > + s = rcu_exp_gp_seq_snap(); > > + rdp = per_cpu_ptr(&rcu_data, raw_smp_processor_id()); > > + rnp = rdp->mynode; > > + raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags); > > + if ((rnp->exp_seq_poll_rq & 0x1) || ULONG_CMP_LT(rnp->exp_seq_poll_rq, s)) { > > + rnp->exp_seq_poll_rq = s; > > + queue_work(rcu_gp_wq, &rnp->exp_poll_wq); > > + } > > + raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags); > > + > > + return s | RCU_GET_STATE_FROM_EXPEDITED; > > +} > > +EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu_expedited); > > + > > +/** > > + * poll_state_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period > > + * > > + * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited() > > + * > > + * If a full expedited RCU grace period has elapsed since the earlier call > > + * from which oldstate was obtained, return @true, otherwise return @false. > > + * If @false is returned, it is the caller's responsibility to invoke > > + * this function later on until it does return @true. Alternatively, > > + * the caller can explicitly wait for a grace period, for example, by > > + * passing @oldstate to cond_synchronize_rcu_expedited() or by directly > > + * invoking synchronize_rcu_expedited(). > > + * > > + * Yes, this function does not take counter wrap into account. > > + * But counter wrap is harmless. If the counter wraps, we have waited for > > + * more than 2 billion grace periods (and way more on a 64-bit system!). > > + * Those needing to keep oldstate values for very long time periods > > + * (several hours even on 32-bit systems) should check them occasionally > > + * and either refresh them or set a flag indicating that the grace period > > + * has completed. > > + * > > + * This function provides the same memory-ordering guarantees that would > > + * be provided by a synchronize_rcu_expedited() that was invoked at the > > + * call to the function that provided @oldstate, and that returned at the > > + * end of this function. > > + */ > > +bool poll_state_synchronize_rcu_expedited(unsigned long oldstate) > > +{ > > + WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED)); > > + if (oldstate & RCU_GET_STATE_USE_NORMAL) > > + return poll_state_synchronize_rcu(oldstate & ~RCU_GET_STATE_BAD_FOR_NORMAL); > > + if (!rcu_exp_gp_seq_done(oldstate & ~RCU_SEQ_STATE_MASK)) > > + return false; > > + smp_mb(); /* Ensure GP ends before subsequent accesses. */ > > + return true; > > +} > > +EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu_expedited); > > + > > +/** > > + * cond_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period > > + * > > + * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited() > > + * > > + * If a full expedited RCU grace period has elapsed since the earlier > > + * call from which oldstate was obtained, just return. Otherwise, invoke > > + * synchronize_rcu_expedited() to wait for a full grace period. > > + * > > + * Yes, this function does not take counter wrap into account. But > > + * counter wrap is harmless. If the counter wraps, we have waited for > > + * more than 2 billion grace periods (and way more on a 64-bit system!), > > + * so waiting for one additional grace period should be just fine. > > + * > > + * This function provides the same memory-ordering guarantees that would > > + * be provided by a synchronize_rcu_expedited() that was invoked at the > > + * call to the function that provided @oldstate, and that returned at the > > + * end of this function. > > + */ > > +void cond_synchronize_rcu_expedited(unsigned long oldstate) > > +{ > > + WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED)); > > + if (poll_state_synchronize_rcu_expedited(oldstate)) > > + return; > > + if (oldstate & RCU_GET_STATE_USE_NORMAL) > > + synchronize_rcu_expedited(); > > + else > > + synchronize_rcu(); > > +} > > +EXPORT_SYMBOL_GPL(cond_synchronize_rcu_expedited); > > > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-02-07 16:36 ` Paul E. McKenney @ 2022-02-10 4:09 ` Dave Chinner 2022-02-10 5:45 ` Paul E. McKenney 0 siblings, 1 reply; 36+ messages in thread From: Dave Chinner @ 2022-02-10 4:09 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Brian Foster, Al Viro, linux-xfs, Ian Kent, rcu On Mon, Feb 07, 2022 at 08:36:21AM -0800, Paul E. McKenney wrote: > On Mon, Feb 07, 2022 at 08:30:03AM -0500, Brian Foster wrote: > Another approach is to use SLAB_TYPESAFE_BY_RCU. This allows immediate > reuse of freed memory, but also requires pointer traversals to the memory > to do a revalidation operation. (Sorry, no free lunch here!) Can't do that with inodes - newly allocated/reused inodes have to go through inode_init_always() which is the very function that causes the problems we have now with path-walk tripping over inodes in an intermediate re-initialised state because we recycled it inside a RCU grace period. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-02-10 4:09 ` Dave Chinner @ 2022-02-10 5:45 ` Paul E. McKenney 2022-02-10 20:47 ` Brian Foster 0 siblings, 1 reply; 36+ messages in thread From: Paul E. McKenney @ 2022-02-10 5:45 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, Al Viro, linux-xfs, Ian Kent, rcu On Thu, Feb 10, 2022 at 03:09:17PM +1100, Dave Chinner wrote: > On Mon, Feb 07, 2022 at 08:36:21AM -0800, Paul E. McKenney wrote: > > On Mon, Feb 07, 2022 at 08:30:03AM -0500, Brian Foster wrote: > > Another approach is to use SLAB_TYPESAFE_BY_RCU. This allows immediate > > reuse of freed memory, but also requires pointer traversals to the memory > > to do a revalidation operation. (Sorry, no free lunch here!) > > Can't do that with inodes - newly allocated/reused inodes have to go > through inode_init_always() which is the very function that causes > the problems we have now with path-walk tripping over inodes in an > intermediate re-initialised state because we recycled it inside a > RCU grace period. So not just no free lunch, but this is also not a lunch that is consistent with the code's dietary restrictions. From what you said earlier in this thread, I am guessing that you have some other fix in mind. Thanx, Paul ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH] xfs: require an rcu grace period before inode recycle 2022-02-10 5:45 ` Paul E. McKenney @ 2022-02-10 20:47 ` Brian Foster 0 siblings, 0 replies; 36+ messages in thread From: Brian Foster @ 2022-02-10 20:47 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Dave Chinner, Al Viro, linux-xfs, Ian Kent, rcu On Wed, Feb 09, 2022 at 09:45:44PM -0800, Paul E. McKenney wrote: > On Thu, Feb 10, 2022 at 03:09:17PM +1100, Dave Chinner wrote: > > On Mon, Feb 07, 2022 at 08:36:21AM -0800, Paul E. McKenney wrote: > > > On Mon, Feb 07, 2022 at 08:30:03AM -0500, Brian Foster wrote: > > > Another approach is to use SLAB_TYPESAFE_BY_RCU. This allows immediate > > > reuse of freed memory, but also requires pointer traversals to the memory > > > to do a revalidation operation. (Sorry, no free lunch here!) > > > > Can't do that with inodes - newly allocated/reused inodes have to go > > through inode_init_always() which is the very function that causes > > the problems we have now with path-walk tripping over inodes in an > > intermediate re-initialised state because we recycled it inside a > > RCU grace period. > > So not just no free lunch, but this is also not a lunch that is consistent > with the code's dietary restrictions. > > From what you said earlier in this thread, I am guessing that you have > some other fix in mind. > Yeah.. I've got an experiment running that essentially tracks pending inode grace period cookies and attempts to avoid them at allocation time. It's crude atm, but the initial numbers I see aren't that far off from the results produced by your expedited grace period mechanism. I see numbers mostly in the 40-50k cycles per second ballpark. This is somewhat expected because the current baseline behavior relies on unsafe reuse of inodes before a grace period has elapsed. We have to rely on more physical allocations to get around this, so the small batch alloc/free patterns simply won't be able to spin as fast. The difference I do see with this sort of explicit gp tracking is that the results remain much closer to the baseline kernel when background activity is ramped up. However, one of the things I'd like to experiment with is whether the combination of this approach and expedited grace periods provides any sort of opportunity for further optimization. For example, if we can identify that a grace period has elapsed between the time of ->destroy_inode() and when the queue processing ultimately marks the inode reclaimable, that might allow for some optimized allocation behavior. I see this occur occasionally with normal grace periods, but not quite frequent enough to make a difference. What I observe right now is that the same test above runs at much closer to the baseline numbers when using the ikeep mount option, so I may need to look into ways to mitigate the chunk allocation overhead.. Brian > Thanx, Paul > ^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2022-02-10 20:47 UTC | newest] Thread overview: 36+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-01-21 14:24 [PATCH] xfs: require an rcu grace period before inode recycle Brian Foster 2022-01-21 17:26 ` Darrick J. Wong 2022-01-21 18:33 ` Brian Foster 2022-01-22 5:30 ` Paul E. McKenney 2022-01-22 16:55 ` Paul E. McKenney 2022-01-24 15:12 ` Brian Foster 2022-01-24 16:40 ` Paul E. McKenney 2022-01-23 22:43 ` Dave Chinner 2022-01-24 15:06 ` Brian Foster 2022-01-24 15:02 ` Brian Foster 2022-01-24 22:08 ` Dave Chinner 2022-01-24 23:29 ` Brian Foster 2022-01-25 0:31 ` Dave Chinner 2022-01-25 14:40 ` Paul E. McKenney 2022-01-25 22:36 ` Dave Chinner 2022-01-26 5:29 ` Paul E. McKenney 2022-01-26 13:21 ` Brian Foster 2022-01-25 18:30 ` Brian Foster 2022-01-25 20:07 ` Brian Foster 2022-01-25 22:45 ` Dave Chinner 2022-01-27 4:19 ` Al Viro 2022-01-27 5:26 ` Dave Chinner 2022-01-27 19:01 ` Brian Foster 2022-01-27 22:18 ` Dave Chinner 2022-01-28 14:11 ` Brian Foster 2022-01-28 23:53 ` Dave Chinner 2022-01-31 13:28 ` Brian Foster 2022-01-28 21:39 ` Paul E. McKenney 2022-01-31 13:22 ` Brian Foster 2022-02-01 22:00 ` Paul E. McKenney 2022-02-03 18:49 ` Paul E. McKenney 2022-02-07 13:30 ` Brian Foster 2022-02-07 16:36 ` Paul E. McKenney 2022-02-10 4:09 ` Dave Chinner 2022-02-10 5:45 ` Paul E. McKenney 2022-02-10 20:47 ` Brian Foster
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).