From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: eric.dumazet@gmail.com, xfs@oss.sgi.com
Subject: Re: [PATCH 07/16] xfs: convert inode cache lookups to use RCU locking
Date: Mon, 8 Nov 2010 19:36:28 -0800 [thread overview]
Message-ID: <20101109033628.GN4032@linux.vnet.ibm.com> (raw)
In-Reply-To: <20101108230929.GA13299@infradead.org>
On Mon, Nov 08, 2010 at 06:09:29PM -0500, Christoph Hellwig wrote:
> This patch generally looks good to me, but with so much RCU magic I'd prefer
> if Paul & Eric could look over it.
Is there a git tree, tarball, or whatever? For example, I don't see
how this patch handles the case of an inode being freed just as an RCU
reader gains a reference to it, but then reallocated as some other inode
(so that ->ino is nonzero) before the RCU reader gets a chance to actually
look at the inode. But such a check might well be in the code that this
patch didn't change...
Thanx, Paul
> On Mon, Nov 08, 2010 at 07:55:10PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > With delayed logging greatly increasing the sustained parallelism of inode
> > operations, the inode cache locking is showing significant read vs write
> > contention when inode reclaim runs at the same time as lookups. There is
> > also a lot more write lock acquistions than there are read locks (4:1 ratio)
> > so the read locking is not really buying us much in the way of parallelism.
> >
> > To avoid the read vs write contention, change the cache to use RCU locking on
> > the read side. To avoid needing to RCU free every single inode, use the built
> > in slab RCU freeing mechanism. This requires us to be able to detect lookups of
> > freed inodes, so en??ure that ever freed inode has an inode number of zero and
> > the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
> > lookup path, but also add a check for a zero inode number as well.
> >
> > We canthen convert all the read locking lockups to use RCU read side locking
> > and hence remove all read side locking.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Alex Elder <aelder@sgi.com>
> > ---
> > fs/xfs/linux-2.6/xfs_iops.c | 7 +++++-
> > fs/xfs/linux-2.6/xfs_sync.c | 13 +++++++++--
> > fs/xfs/quota/xfs_qm_syscalls.c | 3 ++
> > fs/xfs/xfs_iget.c | 44 ++++++++++++++++++++++++++++++---------
> > fs/xfs/xfs_inode.c | 22 ++++++++++++-------
> > 5 files changed, 67 insertions(+), 22 deletions(-)
> >
> > diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
> > index 8b46867..909bd9c 100644
> > --- a/fs/xfs/linux-2.6/xfs_iops.c
> > +++ b/fs/xfs/linux-2.6/xfs_iops.c
> > @@ -757,6 +757,8 @@ xfs_diflags_to_iflags(
> > * We don't use the VFS inode hash for lookups anymore, so make the inode look
> > * hashed to the VFS by faking it. This avoids needing to touch inode hash
> > * locks in this path, but makes the VFS believe the inode is validly hashed.
> > + * We initialise i_state and i_hash under the i_lock so that we follow the same
> > + * setup rules that the rest of the VFS follows.
> > */
> > void
> > xfs_setup_inode(
> > @@ -765,10 +767,13 @@ xfs_setup_inode(
> > struct inode *inode = &ip->i_vnode;
> >
> > inode->i_ino = ip->i_ino;
> > +
> > + spin_lock(&inode->i_lock);
> > inode->i_state = I_NEW;
> > + hlist_nulls_add_fake(&inode->i_hash);
> > + spin_unlock(&inode->i_lock);
>
> This screams for another VFS helper, even if it's XFS-specific for now.
> Having to duplicate inode.c-private locking rules in XFS seems a bit
> nasty to me.
>
> >
> > inode_sb_list_add(inode);
> > - hlist_nulls_add_fake(&inode->i_hash);
> >
> > inode->i_mode = ip->i_d.di_mode;
> > inode->i_nlink = ip->i_d.di_nlink;
> > diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> > index afb0d7c..9a53cc9 100644
> > --- a/fs/xfs/linux-2.6/xfs_sync.c
> > +++ b/fs/xfs/linux-2.6/xfs_sync.c
> > @@ -53,6 +53,10 @@ xfs_inode_ag_walk_grab(
> > {
> > struct inode *inode = VFS_I(ip);
> >
> > + /* check for stale RCU freed inode */
> > + if (!ip->i_ino)
> > + return ENOENT;
>
> Assuming i_ino is never 0 is fine for XFS, unlike for the generic VFS
> code, so ACK.
>
> > /* nothing to sync during shutdown */
> > if (XFS_FORCED_SHUTDOWN(ip->i_mount))
> > return EFSCORRUPTED;
> > @@ -98,12 +102,12 @@ restart:
> > int error = 0;
> > int i;
> >
> > - read_lock(&pag->pag_ici_lock);
> > + rcu_read_lock();
> > nr_found = radix_tree_gang_lookup(&pag->pag_ici_root,
> > (void **)batch, first_index,
> > XFS_LOOKUP_BATCH);
> > if (!nr_found) {
> > - read_unlock(&pag->pag_ici_lock);
> > + rcu_read_unlock();
> > break;
> > }
> >
> > @@ -129,7 +133,7 @@ restart:
> > }
> >
> > /* unlock now we've grabbed the inodes. */
> > - read_unlock(&pag->pag_ici_lock);
> > + rcu_read_unlock();
> >
> > for (i = 0; i < nr_found; i++) {
> > if (!batch[i])
> > @@ -639,6 +643,9 @@ xfs_reclaim_inode_grab(
> > struct xfs_inode *ip,
> > int flags)
> > {
> > + /* check for stale RCU freed inode */
> > + if (!ip->i_ino)
> > + return 1;
> >
> > /*
> > * do some unlocked checks first to avoid unnecceary lock traffic.
> > diff --git a/fs/xfs/quota/xfs_qm_syscalls.c b/fs/xfs/quota/xfs_qm_syscalls.c
> > index bdebc18..8b207fc 100644
> > --- a/fs/xfs/quota/xfs_qm_syscalls.c
> > +++ b/fs/xfs/quota/xfs_qm_syscalls.c
> > @@ -875,6 +875,9 @@ xfs_dqrele_inode(
> > struct xfs_perag *pag,
> > int flags)
> > {
> > + if (!ip->i_ino)
> > + return ENOENT;
> > +
>
> Why do we need the check here again? Having it in
> xfs_inode_ag_walk_grab should be enough.
>
> > /* skip quota inodes */
> > if (ip == ip->i_mount->m_quotainfo->qi_uquotaip ||
> > ip == ip->i_mount->m_quotainfo->qi_gquotaip) {
> > diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
> > index 18991a9..edeb918 100644
> > --- a/fs/xfs/xfs_iget.c
> > +++ b/fs/xfs/xfs_iget.c
> > @@ -69,6 +69,7 @@ xfs_inode_alloc(
> > ASSERT(atomic_read(&ip->i_pincount) == 0);
> > ASSERT(!spin_is_locked(&ip->i_flags_lock));
> > ASSERT(completion_done(&ip->i_flush));
> > + ASSERT(ip->i_ino == 0);
> >
> > mrlock_init(&ip->i_iolock, MRLOCK_BARRIER, "xfsio", ip->i_ino);
> >
> > @@ -86,9 +87,6 @@ xfs_inode_alloc(
> > ip->i_new_size = 0;
> > ip->i_dirty_releases = 0;
> >
> > - /* prevent anyone from using this yet */
> > - VFS_I(ip)->i_state = I_NEW;
> > -
> > return ip;
> > }
> >
> > @@ -135,6 +133,16 @@ xfs_inode_free(
> > ASSERT(!spin_is_locked(&ip->i_flags_lock));
> > ASSERT(completion_done(&ip->i_flush));
> >
> > + /*
> > + * because we use SLAB_DESTROY_BY_RCU freeing, ensure the inode
> > + * always appears to be reclaimed with an invalid inode number
> > + * when in the free state. The ip->i_flags_lock provides the barrier
> > + * against lookup races.
> > + */
> > + spin_lock(&ip->i_flags_lock);
> > + ip->i_flags = XFS_IRECLAIM;
> > + ip->i_ino = 0;
> > + spin_unlock(&ip->i_flags_lock);
> > kmem_zone_free(xfs_inode_zone, ip);
> > }
> >
> > @@ -146,12 +154,28 @@ xfs_iget_cache_hit(
> > struct xfs_perag *pag,
> > struct xfs_inode *ip,
> > int flags,
> > - int lock_flags) __releases(pag->pag_ici_lock)
> > + int lock_flags) __releases(RCU)
> > {
> > struct inode *inode = VFS_I(ip);
> > struct xfs_mount *mp = ip->i_mount;
> > int error;
> >
> > + /*
> > + * check for re-use of an inode within an RCU grace period due to the
> > + * radix tree nodes not being updated yet. We monitor for this by
> > + * setting the inode number to zero before freeing the inode structure.
> > + * We don't need to recheck this after taking the i_flags_lock because
> > + * the check against XFS_IRECLAIM will catch a freed inode.
> > + */
> > + if (ip->i_ino == 0) {
> > + trace_xfs_iget_skip(ip);
> > + XFS_STATS_INC(xs_ig_frecycle);
> > + rcu_read_unlock();
> > + /* Expire the grace period so we don't trip over it again. */
> > + synchronize_rcu();
> > + return EAGAIN;
> > + }
> > +
> > spin_lock(&ip->i_flags_lock);
> >
> > /*
> > @@ -195,7 +219,7 @@ xfs_iget_cache_hit(
> > ip->i_flags |= XFS_IRECLAIM;
> >
> > spin_unlock(&ip->i_flags_lock);
> > - read_unlock(&pag->pag_ici_lock);
> > + rcu_read_unlock();
> >
> > error = -inode_init_always(mp->m_super, inode);
> > if (error) {
> > @@ -203,7 +227,7 @@ xfs_iget_cache_hit(
> > * Re-initializing the inode failed, and we are in deep
> > * trouble. Try to re-add it to the reclaim list.
> > */
> > - read_lock(&pag->pag_ici_lock);
> > + rcu_read_lock();
> > spin_lock(&ip->i_flags_lock);
> >
> > ip->i_flags &= ~XFS_INEW;
> > @@ -231,7 +255,7 @@ xfs_iget_cache_hit(
> >
> > /* We've got a live one. */
> > spin_unlock(&ip->i_flags_lock);
> > - read_unlock(&pag->pag_ici_lock);
> > + rcu_read_unlock();
> > trace_xfs_iget_hit(ip);
> > }
> >
> > @@ -245,7 +269,7 @@ xfs_iget_cache_hit(
> >
> > out_error:
> > spin_unlock(&ip->i_flags_lock);
> > - read_unlock(&pag->pag_ici_lock);
> > + rcu_read_unlock();
> > return error;
> > }
> >
> > @@ -376,7 +400,7 @@ xfs_iget(
> >
> > again:
> > error = 0;
> > - read_lock(&pag->pag_ici_lock);
> > + rcu_read_lock();
> > ip = radix_tree_lookup(&pag->pag_ici_root, agino);
> >
> > if (ip) {
> > @@ -384,7 +408,7 @@ again:
> > if (error)
> > goto out_error_or_again;
> > } else {
> > - read_unlock(&pag->pag_ici_lock);
> > + rcu_read_unlock();
> > XFS_STATS_INC(xs_ig_missed);
> >
> > error = xfs_iget_cache_miss(mp, pag, tp, ino, &ip,
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 108c7a0..25becb1 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -2000,13 +2000,14 @@ xfs_ifree_cluster(
> > */
> > for (i = 0; i < ninodes; i++) {
> > retry:
> > - read_lock(&pag->pag_ici_lock);
> > + rcu_read_lock();
> > ip = radix_tree_lookup(&pag->pag_ici_root,
> > XFS_INO_TO_AGINO(mp, (inum + i)));
> >
> > /* Inode not in memory or stale, nothing to do */
> > - if (!ip || xfs_iflags_test(ip, XFS_ISTALE)) {
> > - read_unlock(&pag->pag_ici_lock);
> > + if (!ip || !ip->i_ino ||
> > + xfs_iflags_test(ip, XFS_ISTALE)) {
> > + rcu_read_unlock();
> > continue;
> > }
> >
> > @@ -2019,11 +2020,11 @@ retry:
> > */
> > if (ip != free_ip &&
> > !xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> > - read_unlock(&pag->pag_ici_lock);
> > + rcu_read_unlock();
> > delay(1);
> > goto retry;
> > }
> > - read_unlock(&pag->pag_ici_lock);
> > + rcu_read_unlock();
> >
> > xfs_iflock(ip);
> > xfs_iflags_set(ip, XFS_ISTALE);
> > @@ -2629,7 +2630,7 @@ xfs_iflush_cluster(
> >
> > mask = ~(((XFS_INODE_CLUSTER_SIZE(mp) >> mp->m_sb.sb_inodelog)) - 1);
> > first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
> > - read_lock(&pag->pag_ici_lock);
> > + rcu_read_lock();
> > /* really need a gang lookup range call here */
> > nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)ilist,
> > first_index, inodes_per_cluster);
> > @@ -2640,6 +2641,11 @@ xfs_iflush_cluster(
> > iq = ilist[i];
> > if (iq == ip)
> > continue;
> > +
> > + /* check we've got a valid inode */
> > + if (!iq->i_ino)
> > + continue;
> > +
> > /* if the inode lies outside this cluster, we're done. */
> > if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) != first_index)
> > break;
> > @@ -2692,7 +2698,7 @@ xfs_iflush_cluster(
> > }
> >
> > out_free:
> > - read_unlock(&pag->pag_ici_lock);
> > + rcu_read_unlock();
> > kmem_free(ilist);
> > out_put:
> > xfs_perag_put(pag);
> > @@ -2704,7 +2710,7 @@ cluster_corrupt_out:
> > * Corruption detected in the clustering loop. Invalidate the
> > * inode buffer and shut down the filesystem.
> > */
> > - read_unlock(&pag->pag_ici_lock);
> > + rcu_read_unlock();
> > /*
> > * Clean up the buffer. If it was B_DELWRI, just release it --
> > * brelse can handle it with no problems. If not, shut down the
> > --
> > 1.7.2.3
> >
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> ---end quoted text---
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2010-11-09 3:35 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-11-08 8:55 [PATCH 00/16] xfs: current patch stack for 2.6.38 window Dave Chinner
2010-11-08 8:55 ` [PATCH 01/16] xfs: fix per-ag reference counting in inode reclaim tree walking Dave Chinner
2010-11-08 9:23 ` Christoph Hellwig
2010-11-08 8:55 ` [PATCH 02/16] xfs: move delayed write buffer trace Dave Chinner
2010-11-08 9:24 ` Christoph Hellwig
2010-11-08 8:55 ` [PATCH 03/16] [RFC] xfs: use generic per-cpu counter infrastructure Dave Chinner
2010-11-08 12:13 ` Christoph Hellwig
2010-11-09 0:20 ` Dave Chinner
2010-11-08 8:55 ` [PATCH 04/16] xfs: dynamic speculative EOF preallocation Dave Chinner
2010-11-08 11:43 ` Christoph Hellwig
2010-11-09 0:08 ` Dave Chinner
2010-11-08 8:55 ` [PATCH 05/16] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
2010-11-08 11:36 ` Christoph Hellwig
2010-11-08 23:56 ` Dave Chinner
2010-11-08 8:55 ` [PATCH 06/16] patch xfs-inode-hash-fake Dave Chinner
2010-11-08 9:19 ` Christoph Hellwig
2010-11-08 8:55 ` [PATCH 07/16] xfs: convert inode cache lookups to use RCU locking Dave Chinner
2010-11-08 23:09 ` Christoph Hellwig
2010-11-09 0:24 ` Dave Chinner
2010-11-09 3:36 ` Paul E. McKenney [this message]
2010-11-09 5:04 ` Dave Chinner
2010-11-10 5:12 ` Paul E. McKenney
2010-11-10 6:20 ` Dave Chinner
2010-11-08 8:55 ` [PATCH 08/16] xfs: convert pag_ici_lock to a spin lock Dave Chinner
2010-11-08 23:10 ` Christoph Hellwig
2010-11-08 8:55 ` [PATCH 09/16] xfs: convert xfsbud shrinker to a per-buftarg shrinker Dave Chinner
2010-11-08 8:55 ` [PATCH 10/16] xfs: add a lru to the XFS buffer cache Dave Chinner
2010-11-08 23:19 ` Christoph Hellwig
2010-11-08 23:45 ` Dave Chinner
2010-11-08 8:55 ` [PATCH 11/16] xfs: connect up buffer reclaim priority hooks Dave Chinner
2010-11-08 11:25 ` Christoph Hellwig
2010-11-08 23:50 ` Dave Chinner
2010-11-08 8:55 ` [PATCH 12/16] xfs: bulk AIL insertion during transaction commit Dave Chinner
2010-11-08 8:55 ` [PATCH 13/16] xfs: reduce the number of AIL push wakeups Dave Chinner
2010-11-08 11:32 ` Christoph Hellwig
2010-11-08 23:51 ` Dave Chinner
2010-11-08 8:55 ` [PATCH 14/16] xfs: remove all the inodes on a buffer from the AIL in bulk Dave Chinner
2010-11-08 8:55 ` [PATCH 15/16] xfs: only run xfs_error_test if error injection is active Dave Chinner
2010-11-08 11:33 ` Christoph Hellwig
2010-11-08 8:55 ` [PATCH 16/16] xfs: make xlog_space_left() independent of the grant lock Dave Chinner
2010-11-08 14:17 ` [PATCH 00/16] xfs: current patch stack for 2.6.38 window Christoph Hellwig
2010-11-09 0:21 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101109033628.GN4032@linux.vnet.ibm.com \
--to=paulmck@linux.vnet.ibm.com \
--cc=eric.dumazet@gmail.com \
--cc=hch@infradead.org \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox