* [PATCH 0/2] xfs: RCU inode freeing and lookups V3
@ 2010-12-13 1:32 Dave Chinner
2010-12-13 1:32 ` [PATCH 1/3] xfs: rcu free inodes Dave Chinner
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Dave Chinner @ 2010-12-13 1:32 UTC (permalink / raw)
To: xfs; +Cc: paulmck, eric.dumazet
This series converts the XFS inode cache to use RCU freeing
for freeing of the inodes and uses RCU locking for all the lookups.
It uses the ip->i_ino values and ip->i_flags & XFS_IRECLAIM being
checked under the ip->i_flags_lock to detect inodes that have been
freed during lookups.
Version 3:
- add patch for specific RCU freeing of inodes rather than relying
on an external tree to provide RCU freeing functionality. This
uses standard RCU freeing mechanisms rather than
SLAB_DESTROY_BY_RCU. This is kept as a separate patch to minimise
the merge conflict issues that may arise if the change is made
through VFS code first.
- Drop specific references to SLAB_DESTROY_BY_RCU, but keep all the
lookup guards for reclaimed/reused inodes as they are mostly the
same for both methods of RCU inode freeing.
Version 2:
- check ip->i_ino under i_flags lock on cache hit
- remove duplicate ip->i_ino == 0 check in xfs_dqrele_inode
- Fixed inode cluster walks to check for valid i_ino under
i_flags_lock.
- Fixed ag walk lookups to check i_ino under i_flags_lock.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread* [PATCH 1/3] xfs: rcu free inodes 2010-12-13 1:32 [PATCH 0/2] xfs: RCU inode freeing and lookups V3 Dave Chinner @ 2010-12-13 1:32 ` Dave Chinner 2010-12-14 20:23 ` Paul E. McKenney 2010-12-13 1:32 ` [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking Dave Chinner 2010-12-13 1:32 ` [PATCH 3/3] xfs: convert pag_ici_lock to a spin lock Dave Chinner 2 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2010-12-13 1:32 UTC (permalink / raw) To: xfs; +Cc: paulmck, eric.dumazet From: Dave Chinner <dchinner@redhat.com> Introduce RCU freeing of XFS inodes so that we can convert lookup traversals to use rcu_read_lock() protection. This patch only introduces the RCU freeing to minimise the potential conflicts with mainline if this is merged into mainline via a VFS patchset. It abuses the i_dentry list for the RCU callback structure because the VFS patches make this a union so it is safe to use like this and simplifies and merge issues. This patch uses basic RCU freeing rather than SLAB_DESTROY_BY_RCU. The later lookup patches need the same "found free inode" protection regardless of the RCU freeing method used, so once again the RCU freeing method can be dealt with apprpriately at merge time without affecting any other code. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_iget.c | 14 +++++++++++++- 1 files changed, 13 insertions(+), 1 deletions(-) diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c index cdb1c25..9fae475 100644 --- a/fs/xfs/xfs_iget.c +++ b/fs/xfs/xfs_iget.c @@ -105,6 +105,18 @@ xfs_inode_alloc( } void +__xfs_inode_free( + struct rcu_head *head) +{ + struct inode *inode = container_of((void *)head, + struct inode, i_dentry); + struct xfs_inode *ip = XFS_I(inode); + + INIT_LIST_HEAD(&inode->i_dentry); + kmem_zone_free(xfs_inode_zone, ip); +} + +void xfs_inode_free( struct xfs_inode *ip) { @@ -147,7 +159,7 @@ xfs_inode_free( ASSERT(!spin_is_locked(&ip->i_flags_lock)); ASSERT(completion_done(&ip->i_flush)); - kmem_zone_free(xfs_inode_zone, ip); + call_rcu((struct rcu_head *)&VFS_I(ip)->i_dentry, __xfs_inode_free); } /* -- 1.7.2.3 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/3] xfs: rcu free inodes 2010-12-13 1:32 ` [PATCH 1/3] xfs: rcu free inodes Dave Chinner @ 2010-12-14 20:23 ` Paul E. McKenney 0 siblings, 0 replies; 14+ messages in thread From: Paul E. McKenney @ 2010-12-14 20:23 UTC (permalink / raw) To: Dave Chinner; +Cc: eric.dumazet, xfs On Mon, Dec 13, 2010 at 12:32:35PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Introduce RCU freeing of XFS inodes so that we can convert lookup > traversals to use rcu_read_lock() protection. This patch only > introduces the RCU freeing to minimise the potential conflicts with > mainline if this is merged into mainline via a VFS patchset. It > abuses the i_dentry list for the RCU callback structure because the > VFS patches make this a union so it is safe to use like this and > simplifies and merge issues. > > This patch uses basic RCU freeing rather than SLAB_DESTROY_BY_RCU. > The later lookup patches need the same "found free inode" protection > regardless of the RCU freeing method used, so once again the RCU > freeing method can be dealt with apprpriately at merge time without > affecting any other code. There are only two call sites that free into xfs_inode_zone. One of them is initialization, before readers have access to the data strcuture, and the other is covered by this patch. So looks good to me! Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_iget.c | 14 +++++++++++++- > 1 files changed, 13 insertions(+), 1 deletions(-) > > diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c > index cdb1c25..9fae475 100644 > --- a/fs/xfs/xfs_iget.c > +++ b/fs/xfs/xfs_iget.c > @@ -105,6 +105,18 @@ xfs_inode_alloc( > } > > void > +__xfs_inode_free( > + struct rcu_head *head) > +{ > + struct inode *inode = container_of((void *)head, > + struct inode, i_dentry); > + struct xfs_inode *ip = XFS_I(inode); > + > + INIT_LIST_HEAD(&inode->i_dentry); > + kmem_zone_free(xfs_inode_zone, ip); > +} > + > +void > xfs_inode_free( > struct xfs_inode *ip) > { > @@ -147,7 +159,7 @@ xfs_inode_free( > ASSERT(!spin_is_locked(&ip->i_flags_lock)); > ASSERT(completion_done(&ip->i_flush)); > > - kmem_zone_free(xfs_inode_zone, ip); > + call_rcu((struct rcu_head *)&VFS_I(ip)->i_dentry, __xfs_inode_free); > } > > /* > -- > 1.7.2.3 > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking 2010-12-13 1:32 [PATCH 0/2] xfs: RCU inode freeing and lookups V3 Dave Chinner 2010-12-13 1:32 ` [PATCH 1/3] xfs: rcu free inodes Dave Chinner @ 2010-12-13 1:32 ` Dave Chinner 2010-12-14 21:18 ` Paul E. McKenney 2010-12-13 1:32 ` [PATCH 3/3] xfs: convert pag_ici_lock to a spin lock Dave Chinner 2 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2010-12-13 1:32 UTC (permalink / raw) To: xfs; +Cc: paulmck, eric.dumazet From: Dave Chinner <dchinner@redhat.com> With delayed logging greatly increasing the sustained parallelism of inode operations, the inode cache locking is showing significant read vs write contention when inode reclaim runs at the same time as lookups. There is also a lot more write lock acquistions than there are read locks (4:1 ratio) so the read locking is not really buying us much in the way of parallelism. To avoid the read vs write contention, change the cache to use RCU locking on the read side. To avoid needing to RCU free every single inode, use the built in slab RCU freeing mechanism. This requires us to be able to detect lookups of freed inodes, so enѕure that ever freed inode has an inode number of zero and the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit lookup path, but also add a check for a zero inode number as well. We canthen convert all the read locking lockups to use RCU read side locking and hence remove all read side locking. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> --- fs/xfs/linux-2.6/xfs_sync.c | 27 ++++++++++++++++----- fs/xfs/xfs_iget.c | 50 +++++++++++++++++++++++++++++++---------- fs/xfs/xfs_inode.c | 52 +++++++++++++++++++++++++++++++++---------- 3 files changed, 98 insertions(+), 31 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c index afb0d7c..5ee02d7 100644 --- a/fs/xfs/linux-2.6/xfs_sync.c +++ b/fs/xfs/linux-2.6/xfs_sync.c @@ -53,14 +53,20 @@ xfs_inode_ag_walk_grab( { struct inode *inode = VFS_I(ip); + /* check for stale RCU freed inode */ + spin_lock(&ip->i_flags_lock); + if (!ip->i_ino) + goto out_unlock_noent; + + /* avoid new or reclaimable inodes. Leave for reclaim code to flush */ + if (__xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM)) + goto out_unlock_noent; + spin_unlock(&ip->i_flags_lock); + /* nothing to sync during shutdown */ if (XFS_FORCED_SHUTDOWN(ip->i_mount)) return EFSCORRUPTED; - /* avoid new or reclaimable inodes. Leave for reclaim code to flush */ - if (xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM)) - return ENOENT; - /* If we can't grab the inode, it must on it's way to reclaim. */ if (!igrab(inode)) return ENOENT; @@ -72,6 +78,10 @@ xfs_inode_ag_walk_grab( /* inode is valid */ return 0; + +out_unlock_noent: + spin_unlock(&ip->i_flags_lock); + return ENOENT; } STATIC int @@ -98,12 +108,12 @@ restart: int error = 0; int i; - read_lock(&pag->pag_ici_lock); + rcu_read_lock(); nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void **)batch, first_index, XFS_LOOKUP_BATCH); if (!nr_found) { - read_unlock(&pag->pag_ici_lock); + rcu_read_unlock(); break; } @@ -129,7 +139,7 @@ restart: } /* unlock now we've grabbed the inodes. */ - read_unlock(&pag->pag_ici_lock); + rcu_read_unlock(); for (i = 0; i < nr_found; i++) { if (!batch[i]) @@ -639,6 +649,9 @@ xfs_reclaim_inode_grab( struct xfs_inode *ip, int flags) { + /* check for stale RCU freed inode */ + if (!ip->i_ino) + return 1; /* * do some unlocked checks first to avoid unnecceary lock traffic. diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c index 9fae475..1e3b035 100644 --- a/fs/xfs/xfs_iget.c +++ b/fs/xfs/xfs_iget.c @@ -80,6 +80,7 @@ xfs_inode_alloc( ASSERT(atomic_read(&ip->i_pincount) == 0); ASSERT(!spin_is_locked(&ip->i_flags_lock)); ASSERT(completion_done(&ip->i_flush)); + ASSERT(ip->i_ino == 0); mrlock_init(&ip->i_iolock, MRLOCK_BARRIER, "xfsio", ip->i_ino); lockdep_set_class_and_name(&ip->i_iolock.mr_lock, @@ -98,9 +99,6 @@ xfs_inode_alloc( ip->i_size = 0; ip->i_new_size = 0; - /* prevent anyone from using this yet */ - VFS_I(ip)->i_state = I_NEW; - return ip; } @@ -159,6 +157,16 @@ xfs_inode_free( ASSERT(!spin_is_locked(&ip->i_flags_lock)); ASSERT(completion_done(&ip->i_flush)); + /* + * Because we use RCU freeing we need to ensure the inode always + * appears to be reclaimed with an invalid inode number when in the + * free state. The ip->i_flags_lock provides the barrier against lookup + * races. + */ + spin_lock(&ip->i_flags_lock); + ip->i_flags = XFS_IRECLAIM; + ip->i_ino = 0; + spin_unlock(&ip->i_flags_lock); call_rcu((struct rcu_head *)&VFS_I(ip)->i_dentry, __xfs_inode_free); } @@ -169,14 +177,32 @@ static int xfs_iget_cache_hit( struct xfs_perag *pag, struct xfs_inode *ip, + xfs_ino_t ino, int flags, - int lock_flags) __releases(pag->pag_ici_lock) + int lock_flags) __releases(RCU) { struct inode *inode = VFS_I(ip); struct xfs_mount *mp = ip->i_mount; int error; + /* + * check for re-use of an inode within an RCU grace period due to the + * radix tree nodes not being updated yet. We monitor for this by + * setting the inode number to zero before freeing the inode structure. + * If the inode has been reallocated and set up, then the inode number + * will not match, so check for that, too. + */ spin_lock(&ip->i_flags_lock); + if (ip->i_ino != ino) { + trace_xfs_iget_skip(ip); + XFS_STATS_INC(xs_ig_frecycle); + spin_unlock(&ip->i_flags_lock); + rcu_read_unlock(); + /* Expire the grace period so we don't trip over it again. */ + synchronize_rcu(); + return EAGAIN; + } + /* * If we are racing with another cache hit that is currently @@ -219,7 +245,7 @@ xfs_iget_cache_hit( ip->i_flags |= XFS_IRECLAIM; spin_unlock(&ip->i_flags_lock); - read_unlock(&pag->pag_ici_lock); + rcu_read_unlock(); error = -inode_init_always(mp->m_super, inode); if (error) { @@ -227,7 +253,7 @@ xfs_iget_cache_hit( * Re-initializing the inode failed, and we are in deep * trouble. Try to re-add it to the reclaim list. */ - read_lock(&pag->pag_ici_lock); + rcu_read_lock(); spin_lock(&ip->i_flags_lock); ip->i_flags &= ~XFS_INEW; @@ -261,7 +287,7 @@ xfs_iget_cache_hit( /* We've got a live one. */ spin_unlock(&ip->i_flags_lock); - read_unlock(&pag->pag_ici_lock); + rcu_read_unlock(); trace_xfs_iget_hit(ip); } @@ -275,7 +301,7 @@ xfs_iget_cache_hit( out_error: spin_unlock(&ip->i_flags_lock); - read_unlock(&pag->pag_ici_lock); + rcu_read_unlock(); return error; } @@ -397,7 +423,7 @@ xfs_iget( xfs_agino_t agino; /* reject inode numbers outside existing AGs */ - if (XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount) + if (!ino || XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount) return EINVAL; /* get the perag structure and ensure that it's inode capable */ @@ -406,15 +432,15 @@ xfs_iget( again: error = 0; - read_lock(&pag->pag_ici_lock); + rcu_read_lock(); ip = radix_tree_lookup(&pag->pag_ici_root, agino); if (ip) { - error = xfs_iget_cache_hit(pag, ip, flags, lock_flags); + error = xfs_iget_cache_hit(pag, ip, ino, flags, lock_flags); if (error) goto out_error_or_again; } else { - read_unlock(&pag->pag_ici_lock); + rcu_read_unlock(); XFS_STATS_INC(xs_ig_missed); error = xfs_iget_cache_miss(mp, pag, tp, ino, &ip, diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 108c7a0..43ffd90 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -2000,17 +2000,33 @@ xfs_ifree_cluster( */ for (i = 0; i < ninodes; i++) { retry: - read_lock(&pag->pag_ici_lock); + rcu_read_lock(); ip = radix_tree_lookup(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, (inum + i))); - /* Inode not in memory or stale, nothing to do */ - if (!ip || xfs_iflags_test(ip, XFS_ISTALE)) { - read_unlock(&pag->pag_ici_lock); + /* Inode not in memory, nothing to do */ + if (!ip) { + rcu_read_unlock(); continue; } /* + * because this is an RCU protected lookup, we could + * find a recently freed or even reallocated inode + * during the lookup. We need to check under the + * i_flags_lock for a valid inode here. Skip it if it + * is not valid, the wrong inode or stale. + */ + spin_lock(&ip->i_flags_lock); + if (ip->i_ino != inum + i || + __xfs_iflags_test(ip, XFS_ISTALE)) { + spin_unlock(&ip->i_flags_lock); + rcu_read_unlock(); + continue; + } + spin_unlock(&ip->i_flags_lock); + + /* * Don't try to lock/unlock the current inode, but we * _cannot_ skip the other inodes that we did not find * in the list attached to the buffer and are not @@ -2019,11 +2035,11 @@ retry: */ if (ip != free_ip && !xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) { - read_unlock(&pag->pag_ici_lock); + rcu_read_unlock(); delay(1); goto retry; } - read_unlock(&pag->pag_ici_lock); + rcu_read_unlock(); xfs_iflock(ip); xfs_iflags_set(ip, XFS_ISTALE); @@ -2629,7 +2645,7 @@ xfs_iflush_cluster( mask = ~(((XFS_INODE_CLUSTER_SIZE(mp) >> mp->m_sb.sb_inodelog)) - 1); first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask; - read_lock(&pag->pag_ici_lock); + rcu_read_lock(); /* really need a gang lookup range call here */ nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)ilist, first_index, inodes_per_cluster); @@ -2640,9 +2656,21 @@ xfs_iflush_cluster( iq = ilist[i]; if (iq == ip) continue; - /* if the inode lies outside this cluster, we're done. */ - if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) != first_index) - break; + + /* + * because this is an RCU protected lookup, we could find a + * recently freed or even reallocated inode during the lookup. + * We need to check under the i_flags_lock for a valid inode + * here. Skip it if it is not valid or the wrong inode. + */ + spin_lock(&ip->i_flags_lock); + if (!ip->i_ino || + (XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) != first_index) { + spin_unlock(&ip->i_flags_lock); + continue; + } + spin_unlock(&ip->i_flags_lock); + /* * Do an un-protected check to see if the inode is dirty and * is a candidate for flushing. These checks will be repeated @@ -2692,7 +2720,7 @@ xfs_iflush_cluster( } out_free: - read_unlock(&pag->pag_ici_lock); + rcu_read_unlock(); kmem_free(ilist); out_put: xfs_perag_put(pag); @@ -2704,7 +2732,7 @@ cluster_corrupt_out: * Corruption detected in the clustering loop. Invalidate the * inode buffer and shut down the filesystem. */ - read_unlock(&pag->pag_ici_lock); + rcu_read_unlock(); /* * Clean up the buffer. If it was B_DELWRI, just release it -- * brelse can handle it with no problems. If not, shut down the -- 1.7.2.3 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking 2010-12-13 1:32 ` [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking Dave Chinner @ 2010-12-14 21:18 ` Paul E. McKenney 2010-12-14 23:00 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Paul E. McKenney @ 2010-12-14 21:18 UTC (permalink / raw) To: Dave Chinner; +Cc: eric.dumazet, xfs On Mon, Dec 13, 2010 at 12:32:36PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > With delayed logging greatly increasing the sustained parallelism of inode > operations, the inode cache locking is showing significant read vs write > contention when inode reclaim runs at the same time as lookups. There is > also a lot more write lock acquistions than there are read locks (4:1 ratio) > so the read locking is not really buying us much in the way of parallelism. > > To avoid the read vs write contention, change the cache to use RCU locking on > the read side. To avoid needing to RCU free every single inode, use the built > in slab RCU freeing mechanism. This requires us to be able to detect lookups of > freed inodes, so enѕure that ever freed inode has an inode number of zero and > the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit > lookup path, but also add a check for a zero inode number as well. > > We canthen convert all the read locking lockups to use RCU read side locking > and hence remove all read side locking. OK, so the xfs_inode uses straight RCU, and there fore cannot be freed and immediately reallocated, while the inode itself uses SLAB_DESTROY_BY_RCU, which does allow the inode to be freed and immediately reallocated, correct? Some questions and comments below. Thanx, Paul > Signed-off-by: Dave Chinner <dchinner@redhat.com> > Reviewed-by: Alex Elder <aelder@sgi.com> > --- > fs/xfs/linux-2.6/xfs_sync.c | 27 ++++++++++++++++----- > fs/xfs/xfs_iget.c | 50 +++++++++++++++++++++++++++++++---------- > fs/xfs/xfs_inode.c | 52 +++++++++++++++++++++++++++++++++---------- > 3 files changed, 98 insertions(+), 31 deletions(-) > > diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c > index afb0d7c..5ee02d7 100644 > --- a/fs/xfs/linux-2.6/xfs_sync.c > +++ b/fs/xfs/linux-2.6/xfs_sync.c > @@ -53,14 +53,20 @@ xfs_inode_ag_walk_grab( > { > struct inode *inode = VFS_I(ip); > > + /* check for stale RCU freed inode */ > + spin_lock(&ip->i_flags_lock); > + if (!ip->i_ino) > + goto out_unlock_noent; > + > + /* avoid new or reclaimable inodes. Leave for reclaim code to flush */ > + if (__xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM)) > + goto out_unlock_noent; > + spin_unlock(&ip->i_flags_lock); > + OK, this works because the xfs_inode cannot be freed and reallocated (which the RCU change should enforce). It is not clear to me that the above would work if using the SLAB_DESTROY_BY_RCU approach, at least not unless some higher-level checks can reliably catch an inode changing identity due to quick free and reallocation. Also, all calls to xfs_inode_ag_walk_grab() must be in RCU read-side critical sections... I suggest a debug check for rcu_read_lock_held() to catch any call paths that might have slipped through. At first glance, it appears that RCU is replacing some of ->pag_ici_lock, but I could easily be mistaken. > /* nothing to sync during shutdown */ > if (XFS_FORCED_SHUTDOWN(ip->i_mount)) > return EFSCORRUPTED; > > - /* avoid new or reclaimable inodes. Leave for reclaim code to flush */ > - if (xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM)) > - return ENOENT; > - > /* If we can't grab the inode, it must on it's way to reclaim. */ > if (!igrab(inode)) > return ENOENT; > @@ -72,6 +78,10 @@ xfs_inode_ag_walk_grab( > > /* inode is valid */ > return 0; > + > +out_unlock_noent: > + spin_unlock(&ip->i_flags_lock); > + return ENOENT; > } > > STATIC int > @@ -98,12 +108,12 @@ restart: > int error = 0; > int i; > > - read_lock(&pag->pag_ici_lock); > + rcu_read_lock(); > nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, > (void **)batch, first_index, > XFS_LOOKUP_BATCH); > if (!nr_found) { > - read_unlock(&pag->pag_ici_lock); > + rcu_read_unlock(); > break; > } > > @@ -129,7 +139,7 @@ restart: > } > > /* unlock now we've grabbed the inodes. */ > - read_unlock(&pag->pag_ici_lock); > + rcu_read_unlock(); > > for (i = 0; i < nr_found; i++) { > if (!batch[i]) > @@ -639,6 +649,9 @@ xfs_reclaim_inode_grab( > struct xfs_inode *ip, > int flags) > { > + /* check for stale RCU freed inode */ > + if (!ip->i_ino) > + return 1; Does this mean that we need to be under rcu_read_lock() here? If not, how do we keep the inode from really being freed out from under us? (Again, if we do need to be under rcu_read_lock(), I highly recommend a debug check for rcu_read_lock_held().) > /* > * do some unlocked checks first to avoid unnecceary lock traffic. > diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c > index 9fae475..1e3b035 100644 > --- a/fs/xfs/xfs_iget.c > +++ b/fs/xfs/xfs_iget.c > @@ -80,6 +80,7 @@ xfs_inode_alloc( > ASSERT(atomic_read(&ip->i_pincount) == 0); > ASSERT(!spin_is_locked(&ip->i_flags_lock)); > ASSERT(completion_done(&ip->i_flush)); > + ASSERT(ip->i_ino == 0); > > mrlock_init(&ip->i_iolock, MRLOCK_BARRIER, "xfsio", ip->i_ino); > lockdep_set_class_and_name(&ip->i_iolock.mr_lock, > @@ -98,9 +99,6 @@ xfs_inode_alloc( > ip->i_size = 0; > ip->i_new_size = 0; > > - /* prevent anyone from using this yet */ > - VFS_I(ip)->i_state = I_NEW; > - > return ip; > } > > @@ -159,6 +157,16 @@ xfs_inode_free( > ASSERT(!spin_is_locked(&ip->i_flags_lock)); > ASSERT(completion_done(&ip->i_flush)); > > + /* > + * Because we use RCU freeing we need to ensure the inode always > + * appears to be reclaimed with an invalid inode number when in the > + * free state. The ip->i_flags_lock provides the barrier against lookup > + * races. > + */ > + spin_lock(&ip->i_flags_lock); > + ip->i_flags = XFS_IRECLAIM; > + ip->i_ino = 0; > + spin_unlock(&ip->i_flags_lock); OK, good! > call_rcu((struct rcu_head *)&VFS_I(ip)->i_dentry, __xfs_inode_free); > } > > @@ -169,14 +177,32 @@ static int > xfs_iget_cache_hit( > struct xfs_perag *pag, > struct xfs_inode *ip, > + xfs_ino_t ino, > int flags, > - int lock_flags) __releases(pag->pag_ici_lock) > + int lock_flags) __releases(RCU) > { > struct inode *inode = VFS_I(ip); > struct xfs_mount *mp = ip->i_mount; > int error; > > + /* > + * check for re-use of an inode within an RCU grace period due to the > + * radix tree nodes not being updated yet. We monitor for this by > + * setting the inode number to zero before freeing the inode structure. > + * If the inode has been reallocated and set up, then the inode number > + * will not match, so check for that, too. > + */ > spin_lock(&ip->i_flags_lock); > + if (ip->i_ino != ino) { > + trace_xfs_iget_skip(ip); > + XFS_STATS_INC(xs_ig_frecycle); > + spin_unlock(&ip->i_flags_lock); > + rcu_read_unlock(); > + /* Expire the grace period so we don't trip over it again. */ > + synchronize_rcu(); Hmmm... Interesting. Wouldn't the fact that we acquired the same lock that was held after removing the inode guarantee that an immediate retry would manage not to find this same inode again? If this is not the case, then readers finding it again will not be protected by the RCU grace period, right? In short, I don't understand why the synchronize_rcu() is needed. If it is somehow helping, that sounds to me like it is covering up a real bug that should be fixed separately. > + return EAGAIN; > + } > + > > /* > * If we are racing with another cache hit that is currently > @@ -219,7 +245,7 @@ xfs_iget_cache_hit( > ip->i_flags |= XFS_IRECLAIM; > > spin_unlock(&ip->i_flags_lock); > - read_unlock(&pag->pag_ici_lock); > + rcu_read_unlock(); > > error = -inode_init_always(mp->m_super, inode); > if (error) { > @@ -227,7 +253,7 @@ xfs_iget_cache_hit( > * Re-initializing the inode failed, and we are in deep > * trouble. Try to re-add it to the reclaim list. > */ > - read_lock(&pag->pag_ici_lock); > + rcu_read_lock(); > spin_lock(&ip->i_flags_lock); > > ip->i_flags &= ~XFS_INEW; > @@ -261,7 +287,7 @@ xfs_iget_cache_hit( > > /* We've got a live one. */ > spin_unlock(&ip->i_flags_lock); > - read_unlock(&pag->pag_ici_lock); > + rcu_read_unlock(); > trace_xfs_iget_hit(ip); > } > > @@ -275,7 +301,7 @@ xfs_iget_cache_hit( > > out_error: > spin_unlock(&ip->i_flags_lock); > - read_unlock(&pag->pag_ici_lock); > + rcu_read_unlock(); > return error; > } > > @@ -397,7 +423,7 @@ xfs_iget( > xfs_agino_t agino; > > /* reject inode numbers outside existing AGs */ > - if (XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount) > + if (!ino || XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount) For the above check to be safe, don't we need to already be in an RCU read-side critical section? Or is something else protecting us? > return EINVAL; > > /* get the perag structure and ensure that it's inode capable */ > @@ -406,15 +432,15 @@ xfs_iget( > > again: > error = 0; > - read_lock(&pag->pag_ici_lock); > + rcu_read_lock(); > ip = radix_tree_lookup(&pag->pag_ici_root, agino); > > if (ip) { > - error = xfs_iget_cache_hit(pag, ip, flags, lock_flags); > + error = xfs_iget_cache_hit(pag, ip, ino, flags, lock_flags); > if (error) > goto out_error_or_again; > } else { > - read_unlock(&pag->pag_ici_lock); > + rcu_read_unlock(); > XFS_STATS_INC(xs_ig_missed); > > error = xfs_iget_cache_miss(mp, pag, tp, ino, &ip, > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > index 108c7a0..43ffd90 100644 > --- a/fs/xfs/xfs_inode.c > +++ b/fs/xfs/xfs_inode.c > @@ -2000,17 +2000,33 @@ xfs_ifree_cluster( > */ > for (i = 0; i < ninodes; i++) { > retry: > - read_lock(&pag->pag_ici_lock); > + rcu_read_lock(); > ip = radix_tree_lookup(&pag->pag_ici_root, > XFS_INO_TO_AGINO(mp, (inum + i))); > > - /* Inode not in memory or stale, nothing to do */ > - if (!ip || xfs_iflags_test(ip, XFS_ISTALE)) { > - read_unlock(&pag->pag_ici_lock); > + /* Inode not in memory, nothing to do */ > + if (!ip) { > + rcu_read_unlock(); > continue; > } > > /* > + * because this is an RCU protected lookup, we could > + * find a recently freed or even reallocated inode And here the inode might be immediately freed and then reallocated. (In contrast, this cannot happen to the xfs_inode.) > + * during the lookup. We need to check under the > + * i_flags_lock for a valid inode here. Skip it if it > + * is not valid, the wrong inode or stale. > + */ > + spin_lock(&ip->i_flags_lock); > + if (ip->i_ino != inum + i || > + __xfs_iflags_test(ip, XFS_ISTALE)) { > + spin_unlock(&ip->i_flags_lock); > + rcu_read_unlock(); > + continue; > + } > + spin_unlock(&ip->i_flags_lock); > + > + /* > * Don't try to lock/unlock the current inode, but we > * _cannot_ skip the other inodes that we did not find > * in the list attached to the buffer and are not > @@ -2019,11 +2035,11 @@ retry: > */ > if (ip != free_ip && > !xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) { > - read_unlock(&pag->pag_ici_lock); > + rcu_read_unlock(); > delay(1); > goto retry; > } > - read_unlock(&pag->pag_ici_lock); > + rcu_read_unlock(); > > xfs_iflock(ip); > xfs_iflags_set(ip, XFS_ISTALE); > @@ -2629,7 +2645,7 @@ xfs_iflush_cluster( > > mask = ~(((XFS_INODE_CLUSTER_SIZE(mp) >> mp->m_sb.sb_inodelog)) - 1); > first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask; > - read_lock(&pag->pag_ici_lock); > + rcu_read_lock(); > /* really need a gang lookup range call here */ > nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)ilist, > first_index, inodes_per_cluster); > @@ -2640,9 +2656,21 @@ xfs_iflush_cluster( > iq = ilist[i]; > if (iq == ip) > continue; > - /* if the inode lies outside this cluster, we're done. */ > - if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) != first_index) > - break; > + > + /* > + * because this is an RCU protected lookup, we could find a > + * recently freed or even reallocated inode during the lookup. > + * We need to check under the i_flags_lock for a valid inode > + * here. Skip it if it is not valid or the wrong inode. > + */ > + spin_lock(&ip->i_flags_lock); > + if (!ip->i_ino || > + (XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) != first_index) { > + spin_unlock(&ip->i_flags_lock); > + continue; > + } > + spin_unlock(&ip->i_flags_lock); > + > /* > * Do an un-protected check to see if the inode is dirty and > * is a candidate for flushing. These checks will be repeated > @@ -2692,7 +2720,7 @@ xfs_iflush_cluster( > } > > out_free: > - read_unlock(&pag->pag_ici_lock); > + rcu_read_unlock(); > kmem_free(ilist); > out_put: > xfs_perag_put(pag); > @@ -2704,7 +2732,7 @@ cluster_corrupt_out: > * Corruption detected in the clustering loop. Invalidate the > * inode buffer and shut down the filesystem. > */ > - read_unlock(&pag->pag_ici_lock); > + rcu_read_unlock(); > /* > * Clean up the buffer. If it was B_DELWRI, just release it -- > * brelse can handle it with no problems. If not, shut down the > -- > 1.7.2.3 > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking 2010-12-14 21:18 ` Paul E. McKenney @ 2010-12-14 23:00 ` Dave Chinner 2010-12-15 1:05 ` Paul E. McKenney 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2010-12-14 23:00 UTC (permalink / raw) To: Paul E. McKenney; +Cc: eric.dumazet, xfs On Tue, Dec 14, 2010 at 01:18:01PM -0800, Paul E. McKenney wrote: > On Mon, Dec 13, 2010 at 12:32:36PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > With delayed logging greatly increasing the sustained parallelism of inode > > operations, the inode cache locking is showing significant read vs write > > contention when inode reclaim runs at the same time as lookups. There is > > also a lot more write lock acquistions than there are read locks (4:1 ratio) > > so the read locking is not really buying us much in the way of parallelism. > > > > To avoid the read vs write contention, change the cache to use RCU locking on > > the read side. To avoid needing to RCU free every single inode, use the built > > in slab RCU freeing mechanism. This requires us to be able to detect lookups of > > freed inodes, so enѕure that ever freed inode has an inode number of zero and > > the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit > > lookup path, but also add a check for a zero inode number as well. > > > > We canthen convert all the read locking lockups to use RCU read side locking > > and hence remove all read side locking. > > OK, so the xfs_inode uses straight RCU, and there fore cannot be freed and > immediately reallocated, while the inode itself uses SLAB_DESTROY_BY_RCU, > which does allow the inode to be freed and immediately reallocated, > correct? No, the struct inode is embedded in the struct xfs_inode, so they have the same lifecycle. i.e. we don't separately allocate and free the struct inode. So it is all using straight RCU. > Some questions and comments below. > > Thanx, Paul > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > Reviewed-by: Alex Elder <aelder@sgi.com> > > --- > > fs/xfs/linux-2.6/xfs_sync.c | 27 ++++++++++++++++----- > > fs/xfs/xfs_iget.c | 50 +++++++++++++++++++++++++++++++---------- > > fs/xfs/xfs_inode.c | 52 +++++++++++++++++++++++++++++++++---------- > > 3 files changed, 98 insertions(+), 31 deletions(-) > > > > diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c > > index afb0d7c..5ee02d7 100644 > > --- a/fs/xfs/linux-2.6/xfs_sync.c > > +++ b/fs/xfs/linux-2.6/xfs_sync.c > > @@ -53,14 +53,20 @@ xfs_inode_ag_walk_grab( > > { > > struct inode *inode = VFS_I(ip); > > > > + /* check for stale RCU freed inode */ > > + spin_lock(&ip->i_flags_lock); > > + if (!ip->i_ino) > > + goto out_unlock_noent; > > + > > + /* avoid new or reclaimable inodes. Leave for reclaim code to flush */ > > + if (__xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM)) > > + goto out_unlock_noent; > > + spin_unlock(&ip->i_flags_lock); > > + > > OK, this works because the xfs_inode cannot be freed and reallocated (which > the RCU change should enforce). It is not clear to me that the above > would work if using the SLAB_DESTROY_BY_RCU approach, at least not unless > some higher-level checks can reliably catch an inode changing identity > due to quick free and reallocation. In this case, I don't believe it matters if the inode has changed identity - we are walking for writeback, so if we get a reallocated inode we'll write it back if it is dirty. If it has not been reallocated or still being initialised, the !ip->i_ino and XFS_INEW|XFS_IRECLAIM checks are sufficient to avoid using the inode. I probably should add a comment to this effect, yes? > Also, all calls to xfs_inode_ag_walk_grab() must be in RCU read-side > critical sections... I suggest a debug check for rcu_read_lock_held() to > catch any call paths that might have slipped through. Yes, good idea. > At first glance, > it appears that RCU is replacing some of ->pag_ici_lock, but I could > easily be mistaken. Correct, it is replacing the read side of the ->pag_ici_lock. > > @@ -639,6 +649,9 @@ xfs_reclaim_inode_grab( > > struct xfs_inode *ip, > > int flags) > > { > > + /* check for stale RCU freed inode */ > > + if (!ip->i_ino) > > + return 1; > > Does this mean that we need to be under rcu_read_lock() here? If not, > how do we keep the inode from really being freed out from under us? Hmmm, I think I've mismerged a patch somewhere along the line. In this patch the reclaim tree walk is under the ->pag_ici_lock(), when in fact it should be under the rcu_read_lock(). Good catch, Paul. As it is, being under the ->pag_ici_lock means that the tree is consistent and we won't be seeing RCU freed inodes in the walk. Hence the code is functioning correctly, just not as wasss intended. > (Again, if we do need to be under rcu_read_lock(), I highly recommend > a debug check for rcu_read_lock_held().) Yup, which would have caught the merge screwup... .... > > > > + /* > > + * check for re-use of an inode within an RCU grace period due to the > > + * radix tree nodes not being updated yet. We monitor for this by > > + * setting the inode number to zero before freeing the inode structure. > > + * If the inode has been reallocated and set up, then the inode number > > + * will not match, so check for that, too. > > + */ > > spin_lock(&ip->i_flags_lock); > > + if (ip->i_ino != ino) { > > + trace_xfs_iget_skip(ip); > > + XFS_STATS_INC(xs_ig_frecycle); > > + spin_unlock(&ip->i_flags_lock); > > + rcu_read_unlock(); > > + /* Expire the grace period so we don't trip over it again. */ > > + synchronize_rcu(); > > Hmmm... Interesting. Wouldn't the fact that we acquired the same lock > that was held after removing the inode guarantee that an immediate retry > would manage not to find this same inode again? That is what I'm not sure of. I was more worried about resolving the contents of the radix tree nodes, not so much the inode itself. If a new traversal will resolve the tree correctly (which is what you are implying), then synchronize_rcu() is not needed.... > If this is not the case, then readers finding it again will not be > protected by the RCU grace period, right? > > In short, I don't understand why the synchronize_rcu() is needed. > If it is somehow helping, that sounds to me like it is covering up > a real bug that should be fixed separately. It isn't covering up a bug, it was more tryingt o be consistent with the rest of the xfs_inode lookup failures - we back off and try again later. If that is unnecessary resolve the RCU lookup race, then it can be dropped. > > @@ -397,7 +423,7 @@ xfs_iget( > > xfs_agino_t agino; > > > > /* reject inode numbers outside existing AGs */ > > - if (XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount) > > + if (!ino || XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount) > > For the above check to be safe, don't we need to already be in an > RCU read-side critical section? Or is something else protecting us? "ino" is the inode number used as the lookup key to find the struct xfs_inode. This is ensuring we don't try to look up an inode number of zero given it's new special meaning as a freed inode. Hence it can be safely validated outside the RCU read-side critical sectioni as it is a constant. Thanks for the review, Paul. I'll fix up the issues you've pointed out and retest. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking 2010-12-14 23:00 ` Dave Chinner @ 2010-12-15 1:05 ` Paul E. McKenney 2010-12-15 2:50 ` Dave Chinner 2010-12-15 3:30 ` Nick Piggin 0 siblings, 2 replies; 14+ messages in thread From: Paul E. McKenney @ 2010-12-15 1:05 UTC (permalink / raw) To: Dave Chinner; +Cc: eric.dumazet, xfs On Wed, Dec 15, 2010 at 10:00:47AM +1100, Dave Chinner wrote: > On Tue, Dec 14, 2010 at 01:18:01PM -0800, Paul E. McKenney wrote: > > On Mon, Dec 13, 2010 at 12:32:36PM +1100, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > With delayed logging greatly increasing the sustained parallelism of inode > > > operations, the inode cache locking is showing significant read vs write > > > contention when inode reclaim runs at the same time as lookups. There is > > > also a lot more write lock acquistions than there are read locks (4:1 ratio) > > > so the read locking is not really buying us much in the way of parallelism. > > > > > > To avoid the read vs write contention, change the cache to use RCU locking on > > > the read side. To avoid needing to RCU free every single inode, use the built > > > in slab RCU freeing mechanism. This requires us to be able to detect lookups of > > > freed inodes, so enѕure that ever freed inode has an inode number of zero and > > > the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit > > > lookup path, but also add a check for a zero inode number as well. > > > > > > We canthen convert all the read locking lockups to use RCU read side locking > > > and hence remove all read side locking. > > > > OK, so the xfs_inode uses straight RCU, and there fore cannot be freed and > > immediately reallocated, while the inode itself uses SLAB_DESTROY_BY_RCU, > > which does allow the inode to be freed and immediately reallocated, > > correct? > > No, the struct inode is embedded in the struct xfs_inode, so they > have the same lifecycle. i.e. we don't separately allocate and free > the struct inode. So it is all using straight RCU. > > > Some questions and comments below. > > > > Thanx, Paul > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > Reviewed-by: Alex Elder <aelder@sgi.com> > > > --- > > > fs/xfs/linux-2.6/xfs_sync.c | 27 ++++++++++++++++----- > > > fs/xfs/xfs_iget.c | 50 +++++++++++++++++++++++++++++++---------- > > > fs/xfs/xfs_inode.c | 52 +++++++++++++++++++++++++++++++++---------- > > > 3 files changed, 98 insertions(+), 31 deletions(-) > > > > > > diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c > > > index afb0d7c..5ee02d7 100644 > > > --- a/fs/xfs/linux-2.6/xfs_sync.c > > > +++ b/fs/xfs/linux-2.6/xfs_sync.c > > > @@ -53,14 +53,20 @@ xfs_inode_ag_walk_grab( > > > { > > > struct inode *inode = VFS_I(ip); > > > > > > + /* check for stale RCU freed inode */ > > > + spin_lock(&ip->i_flags_lock); > > > + if (!ip->i_ino) > > > + goto out_unlock_noent; > > > + > > > + /* avoid new or reclaimable inodes. Leave for reclaim code to flush */ > > > + if (__xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM)) > > > + goto out_unlock_noent; > > > + spin_unlock(&ip->i_flags_lock); > > > + > > > > OK, this works because the xfs_inode cannot be freed and reallocated (which > > the RCU change should enforce). It is not clear to me that the above > > would work if using the SLAB_DESTROY_BY_RCU approach, at least not unless > > some higher-level checks can reliably catch an inode changing identity > > due to quick free and reallocation. > > In this case, I don't believe it matters if the inode has changed > identity - we are walking for writeback, so if we get a reallocated > inode we'll write it back if it is dirty. If it has not been > reallocated or still being initialised, the !ip->i_ino and > XFS_INEW|XFS_IRECLAIM checks are sufficient to avoid using the inode. > > I probably should add a comment to this effect, yes? One vote in favor from me. ;-) > > Also, all calls to xfs_inode_ag_walk_grab() must be in RCU read-side > > critical sections... I suggest a debug check for rcu_read_lock_held() to > > catch any call paths that might have slipped through. > > Yes, good idea. > > > At first glance, > > it appears that RCU is replacing some of ->pag_ici_lock, but I could > > easily be mistaken. > > Correct, it is replacing the read side of the ->pag_ici_lock. OK, good! > > > @@ -639,6 +649,9 @@ xfs_reclaim_inode_grab( > > > struct xfs_inode *ip, > > > int flags) > > > { > > > + /* check for stale RCU freed inode */ > > > + if (!ip->i_ino) > > > + return 1; > > > > Does this mean that we need to be under rcu_read_lock() here? If not, > > how do we keep the inode from really being freed out from under us? > > Hmmm, I think I've mismerged a patch somewhere along the line. In > this patch the reclaim tree walk is under the ->pag_ici_lock(), when > in fact it should be under the rcu_read_lock(). Good catch, Paul. > > As it is, being under the ->pag_ici_lock means that the tree is > consistent and we won't be seeing RCU freed inodes in the walk. > Hence the code is functioning correctly, just not as wasss intended. > > > (Again, if we do need to be under rcu_read_lock(), I highly recommend > > a debug check for rcu_read_lock_held().) > > Yup, which would have caught the merge screwup... ;-) > .... > > > > > > + /* > > > + * check for re-use of an inode within an RCU grace period due to the > > > + * radix tree nodes not being updated yet. We monitor for this by > > > + * setting the inode number to zero before freeing the inode structure. > > > + * If the inode has been reallocated and set up, then the inode number > > > + * will not match, so check for that, too. > > > + */ > > > spin_lock(&ip->i_flags_lock); > > > + if (ip->i_ino != ino) { > > > + trace_xfs_iget_skip(ip); > > > + XFS_STATS_INC(xs_ig_frecycle); > > > + spin_unlock(&ip->i_flags_lock); > > > + rcu_read_unlock(); > > > + /* Expire the grace period so we don't trip over it again. */ > > > + synchronize_rcu(); > > > > Hmmm... Interesting. Wouldn't the fact that we acquired the same lock > > that was held after removing the inode guarantee that an immediate retry > > would manage not to find this same inode again? > > That is what I'm not sure of. I was more worried about resolving the > contents of the radix tree nodes, not so much the inode itself. If a > new traversal will resolve the tree correctly (which is what you are > implying), then synchronize_rcu() is not needed.... Here is the sequence of events that I believe must be in place: 1. Some CPU removes the vfs_inode from the radix tree, presumably while holding some lock whose identity does not matter here. 2. Before invoking call_rcu() on a given vfs_inode, this same CPU clears the inode number while holding ->i_flags_lock. If the CPU in #1 and #2 might be different, then either CPU #1 must hold ->i_flags_lock while removing the vfs_inode from the radix tree, or I don't understand how this can work even with the synchronize_rcu(). 3. Some CPU, possibly different than that in #1 and #2 above, executes the above code. If locking works correctly, it must see the earlier changes, so a subsequent access should see the results of #1 above, so that it won't see the element that was removed there. That said, I don't claim to understand either vfs or xfs very well, so I would be arbitrarily deeply confused. > > If this is not the case, then readers finding it again will not be > > protected by the RCU grace period, right? > > > > In short, I don't understand why the synchronize_rcu() is needed. > > If it is somehow helping, that sounds to me like it is covering up > > a real bug that should be fixed separately. > > It isn't covering up a bug, it was more tryingt o be consistent with > the rest of the xfs_inode lookup failures - we back off and try > again later. If that is unnecessary resolve the RCU lookup race, > then it can be dropped. OK, please let me know whether my sequence of steps above makes sense. > > > @@ -397,7 +423,7 @@ xfs_iget( > > > xfs_agino_t agino; > > > > > > /* reject inode numbers outside existing AGs */ > > > - if (XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount) > > > + if (!ino || XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount) > > > > For the above check to be safe, don't we need to already be in an > > RCU read-side critical section? Or is something else protecting us? > > "ino" is the inode number used as the lookup key to find the struct > xfs_inode. This is ensuring we don't try to look up an inode number > of zero given it's new special meaning as a freed inode. Hence it > can be safely validated outside the RCU read-side critical sectioni > as it is a constant. Ah, OK, got it! Thanx, Paul > Thanks for the review, Paul. I'll fix up the issues you've pointed > out and retest. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking 2010-12-15 1:05 ` Paul E. McKenney @ 2010-12-15 2:50 ` Dave Chinner 2010-12-15 6:34 ` Paul E. McKenney 2010-12-15 3:30 ` Nick Piggin 1 sibling, 1 reply; 14+ messages in thread From: Dave Chinner @ 2010-12-15 2:50 UTC (permalink / raw) To: Paul E. McKenney; +Cc: eric.dumazet, xfs On Tue, Dec 14, 2010 at 05:05:36PM -0800, Paul E. McKenney wrote: > On Wed, Dec 15, 2010 at 10:00:47AM +1100, Dave Chinner wrote: > > On Tue, Dec 14, 2010 at 01:18:01PM -0800, Paul E. McKenney wrote: > > > On Mon, Dec 13, 2010 at 12:32:36PM +1100, Dave Chinner wrote: > > > > + /* > > > > + * check for re-use of an inode within an RCU grace period due to the > > > > + * radix tree nodes not being updated yet. We monitor for this by > > > > + * setting the inode number to zero before freeing the inode structure. > > > > + * If the inode has been reallocated and set up, then the inode number > > > > + * will not match, so check for that, too. > > > > + */ > > > > spin_lock(&ip->i_flags_lock); > > > > + if (ip->i_ino != ino) { > > > > + trace_xfs_iget_skip(ip); > > > > + XFS_STATS_INC(xs_ig_frecycle); > > > > + spin_unlock(&ip->i_flags_lock); > > > > + rcu_read_unlock(); > > > > + /* Expire the grace period so we don't trip over it again. */ > > > > + synchronize_rcu(); > > > > > > Hmmm... Interesting. Wouldn't the fact that we acquired the same lock > > > that was held after removing the inode guarantee that an immediate retry > > > would manage not to find this same inode again? > > > > That is what I'm not sure of. I was more worried about resolving the > > contents of the radix tree nodes, not so much the inode itself. If a > > new traversal will resolve the tree correctly (which is what you are > > implying), then synchronize_rcu() is not needed.... > > Here is the sequence of events that I believe must be in place: > s/vfs_inode/xfs_inode/g > 1. Some CPU removes the vfs_inode from the radix tree, presumably > while holding some lock whose identity does not matter here. Yes, the ->pag_ici_lock. > 2. Before invoking call_rcu() on a given vfs_inode, this same > CPU clears the inode number while holding ->i_flags_lock. Not necessarily the same CPU - there are points where we take sleeping locks for synchronisation with any remaining users, and we don't use preempt_disable() to prevent a change of CPU on a preemptible kernel. > If the CPU in #1 and #2 might be different, then either > CPU #1 must hold ->i_flags_lock while removing the vfs_inode > from the radix tree, or I don't understand how this can work > even with the synchronize_rcu(). I'm probably missing something, but why does the CPU we run call_rcu() to free the inode on matter w.r.t. which CPU it was deleted from the radix tree on? There is this comment in include/linux/radix-tree.h: * It is still required that the caller manage the synchronization and lifetimes * of the items. So if RCU lock-free lookups are used, typically this would mean * that the items have their own locks, or are amenable to lock-free access; and * that the items are freed by RCU (or only freed after having been deleted from * the radix tree *and* a synchronize_rcu() grace period). There is nothing there that mentions the items need to be deleted on the same CPU as they were removed from the radix tree or that the item lock needs to be held when the item is removed from the tree. AFAICT, the XFS code is following these guidelines. FWIW, this is where is got the idea of using synchronize_rcu() to ensure a llokup retry wouldn't see the same freed inode. I'm thinking that simply re-running the lookup will give the same guarantee because of the memory barriers in the radix tree lookup code... > 3. Some CPU, possibly different than that in #1 and #2 above, > executes the above code. If locking works correctly, it > must see the earlier changes, so a subsequent access should > see the results of #1 above, so that it won't see the element > that was removed there. Isn't the item validity considered separately to the tree traversal consistency? i.e. the item lock (i.e. ->i_flags_lock) provides a safe item validity check via the unlock->lock memory barrier, whilst the radix tree uses rcu_dereference() to provide memory barriers against the modifications? > That said, I don't claim to understand either vfs or xfs very well, so > I would be arbitrarily deeply confused. We might both be confused ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking 2010-12-15 2:50 ` Dave Chinner @ 2010-12-15 6:34 ` Paul E. McKenney 0 siblings, 0 replies; 14+ messages in thread From: Paul E. McKenney @ 2010-12-15 6:34 UTC (permalink / raw) To: Dave Chinner; +Cc: eric.dumazet, xfs On Wed, Dec 15, 2010 at 01:50:49PM +1100, Dave Chinner wrote: > On Tue, Dec 14, 2010 at 05:05:36PM -0800, Paul E. McKenney wrote: > > On Wed, Dec 15, 2010 at 10:00:47AM +1100, Dave Chinner wrote: > > > On Tue, Dec 14, 2010 at 01:18:01PM -0800, Paul E. McKenney wrote: > > > > On Mon, Dec 13, 2010 at 12:32:36PM +1100, Dave Chinner wrote: > > > > > + /* > > > > > + * check for re-use of an inode within an RCU grace period due to the > > > > > + * radix tree nodes not being updated yet. We monitor for this by > > > > > + * setting the inode number to zero before freeing the inode structure. > > > > > + * If the inode has been reallocated and set up, then the inode number > > > > > + * will not match, so check for that, too. > > > > > + */ > > > > > spin_lock(&ip->i_flags_lock); > > > > > + if (ip->i_ino != ino) { > > > > > + trace_xfs_iget_skip(ip); > > > > > + XFS_STATS_INC(xs_ig_frecycle); > > > > > + spin_unlock(&ip->i_flags_lock); > > > > > + rcu_read_unlock(); > > > > > + /* Expire the grace period so we don't trip over it again. */ > > > > > + synchronize_rcu(); > > > > > > > > Hmmm... Interesting. Wouldn't the fact that we acquired the same lock > > > > that was held after removing the inode guarantee that an immediate retry > > > > would manage not to find this same inode again? > > > > > > That is what I'm not sure of. I was more worried about resolving the > > > contents of the radix tree nodes, not so much the inode itself. If a > > > new traversal will resolve the tree correctly (which is what you are > > > implying), then synchronize_rcu() is not needed.... > > > > Here is the sequence of events that I believe must be in place: > > s/vfs_inode/xfs_inode/g Good point! > > 1. Some CPU removes the vfs_inode from the radix tree, presumably > > while holding some lock whose identity does not matter here. > > Yes, the ->pag_ici_lock. OK. > > 2. Before invoking call_rcu() on a given vfs_inode, this same > > CPU clears the inode number while holding ->i_flags_lock. > > Not necessarily the same CPU - there are points where we take > sleeping locks for synchronisation with any remaining users, and > we don't use preempt_disable() to prevent a change of CPU on a > preemptible kernel. > > > If the CPU in #1 and #2 might be different, then either > > CPU #1 must hold ->i_flags_lock while removing the vfs_inode > > from the radix tree, or I don't understand how this can work > > even with the synchronize_rcu(). > > I'm probably missing something, but why does the CPU we run > call_rcu() to free the inode on matter w.r.t. which CPU it was > deleted from the radix tree on? I was oversimplifying. What matters is that the item was deleted from the radix tree unambiguously before it was passed to call_rcu(). There are a number of ways to make this happen: 1. Do the removal and the call_rcu() on the same CPU, in that order. 2. Do the removal while holding a given lock, and do the call_rcu() under a later critical section for that same lock. 3. Do the removal while holding lock A one CPU 1, then later acquire lock B on CPU 1, and then do the call_rcu() after a later acquisition of lock B on some other CPU. There are a bunch of other variations on this theme, but the key requirement is again that the call_rcu() happen unambiguously after the removal. Otherwise, how is the grace period supposed to guarantee that all RCU readers that might be accessing the removed xfs_inode really have completed? > There is this comment in include/linux/radix-tree.h: > > * It is still required that the caller manage the synchronization and lifetimes > * of the items. So if RCU lock-free lookups are used, typically this would mean > * that the items have their own locks, or are amenable to lock-free access; and > * that the items are freed by RCU (or only freed after having been deleted from > * the radix tree *and* a synchronize_rcu() grace period). > > There is nothing there that mentions the items need to be deleted on > the same CPU as they were removed from the radix tree or that the > item lock needs to be held when the item is removed from the tree. > AFAICT, the XFS code is following these guidelines. Well, the grace period (from either synchronize_rcu() or call_rcu()) does need to start unambiguously after the deletion from the radix tree. Should we upgrade the comment? > FWIW, this is where is got the idea of using synchronize_rcu() to > ensure a llokup retry wouldn't see the same freed inode. I'm > thinking that simply re-running the lookup will give the same > guarantee because of the memory barriers in the radix tree lookup > code... Maybe... But we do need to be sure, right? > > 3. Some CPU, possibly different than that in #1 and #2 above, > > executes the above code. If locking works correctly, it > > must see the earlier changes, so a subsequent access should > > see the results of #1 above, so that it won't see the element > > that was removed there. > > Isn't the item validity considered separately to the tree traversal > consistency? i.e. the item lock (i.e. ->i_flags_lock) provides a > safe item validity check via the unlock->lock memory barrier, whilst > the radix tree uses rcu_dereference() to provide memory barriers > against the modifications? But the safe validity check assumes that the RCU grace period starts unambiguously after the item has been removed. Violate that assumption, and all bets are off. > > That said, I don't claim to understand either vfs or xfs very well, so > > I would be arbitrarily deeply confused. > > We might both be confused ;) Sounds like the most likely possibility, now that you mention it. ;-) Thanx, Paul _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking 2010-12-15 1:05 ` Paul E. McKenney 2010-12-15 2:50 ` Dave Chinner @ 2010-12-15 3:30 ` Nick Piggin 2010-12-15 6:35 ` Dave Chinner 1 sibling, 1 reply; 14+ messages in thread From: Nick Piggin @ 2010-12-15 3:30 UTC (permalink / raw) To: linux-xfs Paul E. McKenney <paulmck <at> linux.vnet.ibm.com> writes: > On Wed, Dec 15, 2010 at 10:00:47AM +1100, Dave Chinner wrote: > > On Tue, Dec 14, 2010 at 01:18:01PM -0800, Paul E. McKenney wrote: > > > On Mon, Dec 13, 2010 at 12:32:36PM +1100, Dave Chinner wrote: > > > > > > > > + /* > > > > + * check for re-use of an inode within an RCU grace period due to the > > > > + * radix tree nodes not being updated yet. We monitor for this by > > > > + * setting the inode number to zero before freeing the inode structure. > > > > + * If the inode has been reallocated and set up, then the inode number > > > > + * will not match, so check for that, too. > > > > + */ > > > > spin_lock(&ip->i_flags_lock); > > > > + if (ip->i_ino != ino) { > > > > + trace_xfs_iget_skip(ip); > > > > + XFS_STATS_INC(xs_ig_frecycle); > > > > + spin_unlock(&ip->i_flags_lock); > > > > + rcu_read_unlock(); > > > > + /* Expire the grace period so we don't trip over it again. */ > > > > + synchronize_rcu(); > > > > > > Hmmm... Interesting. Wouldn't the fact that we acquired the same lock > > > that was held after removing the inode guarantee that an immediate retry > > > would manage not to find this same inode again? > > > > That is what I'm not sure of. I was more worried about resolving the > > contents of the radix tree nodes, not so much the inode itself. If a > > new traversal will resolve the tree correctly (which is what you are > > implying), then synchronize_rcu() is not needed.... [...] > > > If this is not the case, then readers finding it again will not be > > > protected by the RCU grace period, right? > > > > > > In short, I don't understand why the synchronize_rcu() is needed. > > > If it is somehow helping, that sounds to me like it is covering up > > > a real bug that should be fixed separately. > > > > It isn't covering up a bug, it was more tryingt o be consistent with > > the rest of the xfs_inode lookup failures - we back off and try > > again later. If that is unnecessary resolve the RCU lookup race, > > then it can be dropped. The RCU radix tree should have the same type of causality semantics as, say, loading and storing a single word, if that helps think about it. So the favourite sequence: x = 1; smp_wmb(); y = 1; r2 = y; smp_rmb(); r1 = x; Then r2 == 1 implies r1 == 1. Ie. if we "see" something has happened on a CPU (from another CPU), then we will also see everything that has previously happened on that CPU (provided the correct barriers are there). radix_tree_delete(&tree, idx); smp_wmb(); y = 1; r2 = y; smp_rmb(); r1 = radix_tree_lookup(&tree, idx); So if we see r2 == 1, then r1 will be NULL. In this case, if you can observe something that has happened after the inode is removed from the tree (ie. i_ino has changed), then you should not find it in the tree after a subsequent lookup (no synchronize_rcu required, just appropriate locking or barriers). BTW. I wondered if you can also do the radix_tree tag lookup for reclaim under RCU? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking 2010-12-15 3:30 ` Nick Piggin @ 2010-12-15 6:35 ` Dave Chinner 2010-12-15 8:05 ` Nick Piggin 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2010-12-15 6:35 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-xfs On Wed, Dec 15, 2010 at 03:30:43AM +0000, Nick Piggin wrote: > In this case, if you can observe something that has happened after the > inode is removed from the tree (ie. i_ino has changed), then you should > not find it in the tree after a subsequent lookup (no synchronize_rcu > required, just appropriate locking or barriers). Ok, that's what I thought was supposed to be the case. Thanks for confirming that, Nick. > BTW. I wondered if you can also do the radix_tree tag lookup for reclaim > under RCU? It's currently under the ->pag_ici_lock using a radix_tree_gang_lookup_tag, though I think this was a mismerge bug from an earlier version. I intended it to be under RCU as the "inode OK for reclaim" validation checks won't touch an inode that already has XFS_IRECLAIM already set (i.e. already under reclaim or freed), so the reliability of tag lookups is not a big deal. The lookup probably needs to check if XFS_IRECLAIMABLE is set (rather than asserting it is set) to avoid so as to only reclaim inodes that are really in the reclaimable state. Note that ->i_flags_lock controls all the state changes, so it should provide the necessary item memory barriers to ensure that only reclaimable inodes are found for reclaim. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking 2010-12-15 6:35 ` Dave Chinner @ 2010-12-15 8:05 ` Nick Piggin 0 siblings, 0 replies; 14+ messages in thread From: Nick Piggin @ 2010-12-15 8:05 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, Nick Piggin On Wed, Dec 15, 2010 at 05:35:34PM +1100, Dave Chinner wrote: > On Wed, Dec 15, 2010 at 03:30:43AM +0000, Nick Piggin wrote: > > In this case, if you can observe something that has happened after the > > inode is removed from the tree (ie. i_ino has changed), then you should > > not find it in the tree after a subsequent lookup (no synchronize_rcu > > required, just appropriate locking or barriers). > > Ok, that's what I thought was supposed to be the case. Thanks > for confirming that, Nick. > > > BTW. I wondered if you can also do the radix_tree tag lookup for reclaim > > under RCU? > > It's currently under the ->pag_ici_lock using a > radix_tree_gang_lookup_tag, though I think this was a mismerge bug > from an earlier version. > > I intended it to be under RCU as the "inode OK for reclaim" > validation checks won't touch an inode that already has XFS_IRECLAIM > already set (i.e. already under reclaim or freed), so the > reliability of tag lookups is not a big deal. That would be nice, a while back I was seeing lock contention in reclaim there, and tried some hacks to improve it, but this should solve it properly. The tag lookups should be reliable too -- we use them in pagecache dirty writeout. Basically same causality rules: if we observe (given the required barriers or locks) an event we know to have happened after the tag was set and before it is cleared, then the tag lookup should find the item. > The lookup probably needs to check if XFS_IRECLAIMABLE is set > (rather than asserting it is set) to avoid so as to only reclaim > inodes that are really in the reclaimable state. Note that > ->i_flags_lock controls all the state changes, so it should provide > the necessary item memory barriers to ensure that only reclaimable > inodes are found for reclaim. If you are checking and clearing those flags under lock, then I think everything will be fine. lock() set deleted flag unlock() delete from radix tree versus look up from radix tree lock() check deleted flag... Then at this point we would know that if the deleted flag is not set, then the item can not have been deleted from the radix tree by the 1st sequence. The stores that delete the item from the radix tree may be visible before the deleted flag is visible (due to release-barrier ordering), but that's not a problem here. Thanks, Nick _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 3/3] xfs: convert pag_ici_lock to a spin lock 2010-12-13 1:32 [PATCH 0/2] xfs: RCU inode freeing and lookups V3 Dave Chinner 2010-12-13 1:32 ` [PATCH 1/3] xfs: rcu free inodes Dave Chinner 2010-12-13 1:32 ` [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking Dave Chinner @ 2010-12-13 1:32 ` Dave Chinner 2010-12-14 21:19 ` Paul E. McKenney 2 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2010-12-13 1:32 UTC (permalink / raw) To: xfs; +Cc: paulmck, eric.dumazet From: Dave Chinner <dchinner@redhat.com> now that we are using RCU protection for the inode cache lookups, the lock is only needed on the modification side. Hence it is not necessary for the lock to be a rwlock as there are no read side holders anymore. Convert it to a spin lock to reflect it's exclusive nature. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/linux-2.6/xfs_sync.c | 14 +++++++------- fs/xfs/xfs_ag.h | 2 +- fs/xfs/xfs_iget.c | 10 +++++----- fs/xfs/xfs_mount.c | 2 +- 4 files changed, 14 insertions(+), 14 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c index 5ee02d7..d3b3b4f 100644 --- a/fs/xfs/linux-2.6/xfs_sync.c +++ b/fs/xfs/linux-2.6/xfs_sync.c @@ -602,12 +602,12 @@ xfs_inode_set_reclaim_tag( struct xfs_perag *pag; pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino)); - write_lock(&pag->pag_ici_lock); + spin_lock(&pag->pag_ici_lock); spin_lock(&ip->i_flags_lock); __xfs_inode_set_reclaim_tag(pag, ip); __xfs_iflags_set(ip, XFS_IRECLAIMABLE); spin_unlock(&ip->i_flags_lock); - write_unlock(&pag->pag_ici_lock); + spin_unlock(&pag->pag_ici_lock); xfs_perag_put(pag); } @@ -808,12 +808,12 @@ reclaim: * added to the tree assert that it's been there before to catch * problems with the inode life time early on. */ - write_lock(&pag->pag_ici_lock); + spin_lock(&pag->pag_ici_lock); if (!radix_tree_delete(&pag->pag_ici_root, XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino))) ASSERT(0); __xfs_inode_clear_reclaim(pag, ip); - write_unlock(&pag->pag_ici_lock); + spin_unlock(&pag->pag_ici_lock); /* * Here we do an (almost) spurious inode lock in order to coordinate @@ -877,14 +877,14 @@ restart: struct xfs_inode *batch[XFS_LOOKUP_BATCH]; int i; - write_lock(&pag->pag_ici_lock); + spin_lock(&pag->pag_ici_lock); nr_found = radix_tree_gang_lookup_tag( &pag->pag_ici_root, (void **)batch, first_index, XFS_LOOKUP_BATCH, XFS_ICI_RECLAIM_TAG); if (!nr_found) { - write_unlock(&pag->pag_ici_lock); + spin_unlock(&pag->pag_ici_lock); break; } @@ -911,7 +911,7 @@ restart: } /* unlock now we've grabbed the inodes. */ - write_unlock(&pag->pag_ici_lock); + spin_unlock(&pag->pag_ici_lock); for (i = 0; i < nr_found; i++) { if (!batch[i]) diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h index 63c7a1a..58632cc 100644 --- a/fs/xfs/xfs_ag.h +++ b/fs/xfs/xfs_ag.h @@ -227,7 +227,7 @@ typedef struct xfs_perag { atomic_t pagf_fstrms; /* # of filestreams active in this AG */ - rwlock_t pag_ici_lock; /* incore inode lock */ + spinlock_t pag_ici_lock; /* incore inode cache lock */ struct radix_tree_root pag_ici_root; /* incore inode cache root */ int pag_ici_reclaimable; /* reclaimable inodes */ struct mutex pag_ici_reclaim_lock; /* serialisation point */ diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c index 1e3b035..ec440da 100644 --- a/fs/xfs/xfs_iget.c +++ b/fs/xfs/xfs_iget.c @@ -263,7 +263,7 @@ xfs_iget_cache_hit( goto out_error; } - write_lock(&pag->pag_ici_lock); + spin_lock(&pag->pag_ici_lock); spin_lock(&ip->i_flags_lock); ip->i_flags &= ~(XFS_IRECLAIMABLE | XFS_IRECLAIM); ip->i_flags |= XFS_INEW; @@ -276,7 +276,7 @@ xfs_iget_cache_hit( &xfs_iolock_active, "xfs_iolock_active"); spin_unlock(&ip->i_flags_lock); - write_unlock(&pag->pag_ici_lock); + spin_unlock(&pag->pag_ici_lock); } else { /* If the VFS inode is being torn down, pause and try again. */ if (!igrab(inode)) { @@ -354,7 +354,7 @@ xfs_iget_cache_miss( BUG(); } - write_lock(&pag->pag_ici_lock); + spin_lock(&pag->pag_ici_lock); /* insert the new inode */ error = radix_tree_insert(&pag->pag_ici_root, agino, ip); @@ -369,14 +369,14 @@ xfs_iget_cache_miss( ip->i_udquot = ip->i_gdquot = NULL; xfs_iflags_set(ip, XFS_INEW); - write_unlock(&pag->pag_ici_lock); + spin_unlock(&pag->pag_ici_lock); radix_tree_preload_end(); *ipp = ip; return 0; out_preload_end: - write_unlock(&pag->pag_ici_lock); + spin_unlock(&pag->pag_ici_lock); radix_tree_preload_end(); if (lock_flags) xfs_iunlock(ip, lock_flags); diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index fe27338..52186c0 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -598,7 +598,7 @@ xfs_initialize_perag( goto out_unwind; pag->pag_agno = index; pag->pag_mount = mp; - rwlock_init(&pag->pag_ici_lock); + spin_lock_init(&pag->pag_ici_lock); mutex_init(&pag->pag_ici_reclaim_lock); INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC); spin_lock_init(&pag->pag_buf_lock); -- 1.7.2.3 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] xfs: convert pag_ici_lock to a spin lock 2010-12-13 1:32 ` [PATCH 3/3] xfs: convert pag_ici_lock to a spin lock Dave Chinner @ 2010-12-14 21:19 ` Paul E. McKenney 0 siblings, 0 replies; 14+ messages in thread From: Paul E. McKenney @ 2010-12-14 21:19 UTC (permalink / raw) To: Dave Chinner; +Cc: eric.dumazet, xfs On Mon, Dec 13, 2010 at 12:32:37PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > now that we are using RCU protection for the inode cache lookups, > the lock is only needed on the modification side. Hence it is not > necessary for the lock to be a rwlock as there are no read side > holders anymore. Convert it to a spin lock to reflect it's exclusive > nature. Looks good! Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > Signed-off-by: Dave Chinner <dchinner@redhat.com> > Reviewed-by: Alex Elder <aelder@sgi.com> > Reviewed-by: Christoph Hellwig <hch@lst.de> > --- > fs/xfs/linux-2.6/xfs_sync.c | 14 +++++++------- > fs/xfs/xfs_ag.h | 2 +- > fs/xfs/xfs_iget.c | 10 +++++----- > fs/xfs/xfs_mount.c | 2 +- > 4 files changed, 14 insertions(+), 14 deletions(-) > > diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c > index 5ee02d7..d3b3b4f 100644 > --- a/fs/xfs/linux-2.6/xfs_sync.c > +++ b/fs/xfs/linux-2.6/xfs_sync.c > @@ -602,12 +602,12 @@ xfs_inode_set_reclaim_tag( > struct xfs_perag *pag; > > pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino)); > - write_lock(&pag->pag_ici_lock); > + spin_lock(&pag->pag_ici_lock); > spin_lock(&ip->i_flags_lock); > __xfs_inode_set_reclaim_tag(pag, ip); > __xfs_iflags_set(ip, XFS_IRECLAIMABLE); > spin_unlock(&ip->i_flags_lock); > - write_unlock(&pag->pag_ici_lock); > + spin_unlock(&pag->pag_ici_lock); > xfs_perag_put(pag); > } > > @@ -808,12 +808,12 @@ reclaim: > * added to the tree assert that it's been there before to catch > * problems with the inode life time early on. > */ > - write_lock(&pag->pag_ici_lock); > + spin_lock(&pag->pag_ici_lock); > if (!radix_tree_delete(&pag->pag_ici_root, > XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino))) > ASSERT(0); > __xfs_inode_clear_reclaim(pag, ip); > - write_unlock(&pag->pag_ici_lock); > + spin_unlock(&pag->pag_ici_lock); > > /* > * Here we do an (almost) spurious inode lock in order to coordinate > @@ -877,14 +877,14 @@ restart: > struct xfs_inode *batch[XFS_LOOKUP_BATCH]; > int i; > > - write_lock(&pag->pag_ici_lock); > + spin_lock(&pag->pag_ici_lock); > nr_found = radix_tree_gang_lookup_tag( > &pag->pag_ici_root, > (void **)batch, first_index, > XFS_LOOKUP_BATCH, > XFS_ICI_RECLAIM_TAG); > if (!nr_found) { > - write_unlock(&pag->pag_ici_lock); > + spin_unlock(&pag->pag_ici_lock); > break; > } > > @@ -911,7 +911,7 @@ restart: > } > > /* unlock now we've grabbed the inodes. */ > - write_unlock(&pag->pag_ici_lock); > + spin_unlock(&pag->pag_ici_lock); > > for (i = 0; i < nr_found; i++) { > if (!batch[i]) > diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h > index 63c7a1a..58632cc 100644 > --- a/fs/xfs/xfs_ag.h > +++ b/fs/xfs/xfs_ag.h > @@ -227,7 +227,7 @@ typedef struct xfs_perag { > > atomic_t pagf_fstrms; /* # of filestreams active in this AG */ > > - rwlock_t pag_ici_lock; /* incore inode lock */ > + spinlock_t pag_ici_lock; /* incore inode cache lock */ > struct radix_tree_root pag_ici_root; /* incore inode cache root */ > int pag_ici_reclaimable; /* reclaimable inodes */ > struct mutex pag_ici_reclaim_lock; /* serialisation point */ > diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c > index 1e3b035..ec440da 100644 > --- a/fs/xfs/xfs_iget.c > +++ b/fs/xfs/xfs_iget.c > @@ -263,7 +263,7 @@ xfs_iget_cache_hit( > goto out_error; > } > > - write_lock(&pag->pag_ici_lock); > + spin_lock(&pag->pag_ici_lock); > spin_lock(&ip->i_flags_lock); > ip->i_flags &= ~(XFS_IRECLAIMABLE | XFS_IRECLAIM); > ip->i_flags |= XFS_INEW; > @@ -276,7 +276,7 @@ xfs_iget_cache_hit( > &xfs_iolock_active, "xfs_iolock_active"); > > spin_unlock(&ip->i_flags_lock); > - write_unlock(&pag->pag_ici_lock); > + spin_unlock(&pag->pag_ici_lock); > } else { > /* If the VFS inode is being torn down, pause and try again. */ > if (!igrab(inode)) { > @@ -354,7 +354,7 @@ xfs_iget_cache_miss( > BUG(); > } > > - write_lock(&pag->pag_ici_lock); > + spin_lock(&pag->pag_ici_lock); > > /* insert the new inode */ > error = radix_tree_insert(&pag->pag_ici_root, agino, ip); > @@ -369,14 +369,14 @@ xfs_iget_cache_miss( > ip->i_udquot = ip->i_gdquot = NULL; > xfs_iflags_set(ip, XFS_INEW); > > - write_unlock(&pag->pag_ici_lock); > + spin_unlock(&pag->pag_ici_lock); > radix_tree_preload_end(); > > *ipp = ip; > return 0; > > out_preload_end: > - write_unlock(&pag->pag_ici_lock); > + spin_unlock(&pag->pag_ici_lock); > radix_tree_preload_end(); > if (lock_flags) > xfs_iunlock(ip, lock_flags); > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c > index fe27338..52186c0 100644 > --- a/fs/xfs/xfs_mount.c > +++ b/fs/xfs/xfs_mount.c > @@ -598,7 +598,7 @@ xfs_initialize_perag( > goto out_unwind; > pag->pag_agno = index; > pag->pag_mount = mp; > - rwlock_init(&pag->pag_ici_lock); > + spin_lock_init(&pag->pag_ici_lock); > mutex_init(&pag->pag_ici_reclaim_lock); > INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC); > spin_lock_init(&pag->pag_buf_lock); > -- > 1.7.2.3 > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2010-12-15 8:03 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-12-13 1:32 [PATCH 0/2] xfs: RCU inode freeing and lookups V3 Dave Chinner 2010-12-13 1:32 ` [PATCH 1/3] xfs: rcu free inodes Dave Chinner 2010-12-14 20:23 ` Paul E. McKenney 2010-12-13 1:32 ` [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking Dave Chinner 2010-12-14 21:18 ` Paul E. McKenney 2010-12-14 23:00 ` Dave Chinner 2010-12-15 1:05 ` Paul E. McKenney 2010-12-15 2:50 ` Dave Chinner 2010-12-15 6:34 ` Paul E. McKenney 2010-12-15 3:30 ` Nick Piggin 2010-12-15 6:35 ` Dave Chinner 2010-12-15 8:05 ` Nick Piggin 2010-12-13 1:32 ` [PATCH 3/3] xfs: convert pag_ici_lock to a spin lock Dave Chinner 2010-12-14 21:19 ` Paul E. McKenney
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox