public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Alex Elder <aelder@sgi.com>
To: Dave Chinner <david@fromorbit.com>
Cc: xfs@oss.sgi.com
Subject: Re: [PATCH 05/18] xfs: convert inode cache lookups to use RCU locking
Date: Tue, 14 Sep 2010 16:23:41 -0500	[thread overview]
Message-ID: <1284499421.9701.69.camel@doink> (raw)
In-Reply-To: <1284461777-1496-6-git-send-email-david@fromorbit.com>

On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> With delayed logging greatly increasing the sustained parallelism of inode
> operations, the inode cache locking is showing significant read vs write
> contention when inode reclaim runs at the same time as lookups. There is
> also a lot more write lock acquistions than there are read locks (4:1 ratio)
> so the read locking is not really buying us much in the way of parallelism.
> 
> To avoid the read vs write contention, change the cache to use RCU locking on
> the read side. To avoid needing to RCU free every single inode, use the built
> in slab RCU freeing mechanism. This requires us to be able to detect lookups of
> freed inodes, so enѕure that ever freed inode has an inode number of zero and
> the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
> lookup path, but also add a check for a zero inode number as well.
> 
> We canthen convert all the read locking lockups to use RCU read side locking
> and hence remove all read side locking.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

I confess that I'm a little less than solid on this, but
that's a comment on me, not your code.  (After writing
all this I feel a bit better.)

I'll try to describe my understanding and you can reassure
me all is well...  It's quite a lot, but I'll call attention
to two things to look for:  a question about something in
xfs_reclaim_inode(); and a comment related to
xfs_iget_cache_hit().


First, you are replacing the use of a single rwlock for
protecting access to the per-AG in-core inode radix tree
with RCU for readers and a spinlock for writers.

This initially seemed strange to me, and unsafe, but I
now think it's OK because:
- the spinlock protects against concurrent writers
  interfering with each other
- the rcu_read_lock() is sufficient for ensuring readers
  have valid pointers, because the underlying structure
  is a radix tree, which uses rcu_update_pointer() in
  order to change anything in the tree.
I'm still unsettled about the protection readers have
against a concurrent writer, but it's probably just
because this particular usage is new to me.


Second, you are exploiting the SLAB_DESTROY_BY_RCU
feature in order to avoid having to have each inode
wait an RCU grace period when it's freed.  To use
that we need to check for and recognize a freed
inode after looking it up, since we have no guarantee
it's updated in the radix tree after it's freed until
after an RCU grace period has passed.  So zeroing the
i_ino field and setting XFS_RECLAIM handles that.

So I see these lookups:
- Two gang lookups in xfs_inode_ag_lookup(), which
  is called only by xfs_inode_ag_walk(), in turn
  called only by xfs_inode_ag_iterator().  The
  check in this case has to happen in the "execute"
  function passed in to xfs_inode_ag_walk() via
  xfs_inode_ag_iterator().  The affected functions
  are:
    - xfs_sync_inode_data().  This one calls
      xfs_sync_inode_valid() right away, which in
      your change now checks for a zero i_ino.
    - xfs_sync_inode_attr().  Same as above,
      handled by xfs_sync_inode_valid().
    - xfs_reclaim_inode().  This one should
      be fine, because it already has a test
      for the XFS_IRECLAIM flag being set, and
      ignores the inode if it is.  However, it
      has this line also:
        ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE));
      Your change doesn't set XFS_IRECLAIMABLE, so
*     I imagine if we get here inside that RCU window
*     we'd have a problem.  Am I wrong about this?
    - xfs_dqrele_inode().  This one again calls
      xfs_sync_inode_valid(), so should be covered.
- A lookup in xfs_iget().  This is handled by
  your change, by looking for a zero i_ino in
* xfs_iget_cache_hit().  (Please see the comment
  on this function in-line, below.)
- A lookup in xfs_ifree_cluster().  Handled by
  your change (now checks for zero i_ino).
- And a gang lookup in xfs_iflush_cluster().  This
  one is handled by your change (now checks each
  inode for a zero i_ino field).

OK, so I think that covers everything, but I have
that one question about xfs_reclaim_inode(), and
then I have one more comment below.



Despite all my commentary above...  The patch looks
good (consistent) to me.  I'm interested to hear
your feedback though.  And unless there is something
major changed, or I'm fundamentally misguided about
this stuff, you can consider it:

Reviewed-by: Alex Elder <aelder@sgi.com>


> ---
>  fs/xfs/linux-2.6/kmem.h        |    1 +
>  fs/xfs/linux-2.6/xfs_super.c   |    3 ++-
>  fs/xfs/linux-2.6/xfs_sync.c    |   12 ++++++------
>  fs/xfs/quota/xfs_qm_syscalls.c |    4 ++--

. . .

> diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
> index b1ecc6f..f3a46b6 100644
> --- a/fs/xfs/xfs_iget.c
> +++ b/fs/xfs/xfs_iget.c

. . .

> @@ -145,12 +153,26 @@ xfs_iget_cache_hit(
>  	struct xfs_perag	*pag,
>  	struct xfs_inode	*ip,
>  	int			flags,
> -	int			lock_flags) __releases(pag->pag_ici_lock)
> +	int			lock_flags) __releases(RCU)
>  {
>  	struct inode		*inode = VFS_I(ip);
>  	struct xfs_mount	*mp = ip->i_mount;
>  	int			error;
>  
> +	/*
> +	 * check for re-use of an inode within an RCU grace period due to the
> +	 * radix tree nodes not being updated yet. We monitor for this by
> +	 * setting the inode number to zero before freeing the inode structure.
> +	 */
> +	if (ip->i_ino == 0) {
> +		trace_xfs_iget_skip(ip);
> +		XFS_STATS_INC(xs_ig_frecycle);
> +		rcu_read_unlock();
> +		/* Expire the grace period so we don't trip over it again. */
> +		synchronize_rcu();

Since you're waiting for the end of the grace period here,
it seems a shame that the caller (xfs_iget()) will still
end up calling delay(1) before trying again.  It would
be nice if the delay could be avoided in that case.

> +		return EAGAIN;
> +	}
> +
>  	spin_lock(&ip->i_flags_lock);
>  
>  	/*

. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  parent reply	other threads:[~2010-09-14 21:23 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
2010-09-14 10:56 ` [PATCH 01/18] xfs: single thread inode cache shrinking Dave Chinner
2010-09-14 18:48   ` Alex Elder
2010-09-14 22:48     ` Dave Chinner
2010-09-14 10:56 ` [PATCH 02/18] xfs: reduce the number of CIL lock round trips during commit Dave Chinner
2010-09-14 14:48   ` Christoph Hellwig
2010-09-14 17:21   ` Alex Elder
2010-09-14 10:56 ` [PATCH 03/18] xfs: remove debug assert for per-ag reference counting Dave Chinner
2010-09-14 14:48   ` Christoph Hellwig
2010-09-14 17:22   ` Alex Elder
2010-09-14 10:56 ` [PATCH 04/18] xfs: lockless per-ag lookups Dave Chinner
2010-09-14 12:35   ` Dave Chinner
2010-09-14 14:50   ` Christoph Hellwig
2010-09-14 17:28   ` Alex Elder
2010-09-14 10:56 ` [PATCH 05/18] xfs: convert inode cache lookups to use RCU locking Dave Chinner
2010-09-14 16:27   ` Christoph Hellwig
2010-09-14 23:17     ` Dave Chinner
2010-09-14 21:23   ` Alex Elder [this message]
2010-09-14 23:42     ` Dave Chinner
2010-09-14 10:56 ` [PATCH 06/18] xfs: convert pag_ici_lock to a spin lock Dave Chinner
2010-09-14 21:26   ` Alex Elder
2010-09-14 10:56 ` [PATCH 07/18] xfs: don't use vfs writeback for pure metadata modifications Dave Chinner
2010-09-14 14:54   ` Christoph Hellwig
2010-09-15  0:14     ` Dave Chinner
2010-09-15  0:17       ` Christoph Hellwig
2010-09-14 22:12   ` Alex Elder
2010-09-15  0:28     ` Dave Chinner
2010-11-08 10:47   ` Christoph Hellwig
2010-09-14 10:56 ` [PATCH 08/18] xfs: rename xfs_buf_get_nodaddr to be more appropriate Dave Chinner
2010-09-14 14:56   ` Christoph Hellwig
2010-09-14 22:14   ` Alex Elder
2010-09-14 10:56 ` [PATCH 09/18] xfs: introduced uncached buffer read primitve Dave Chinner
2010-09-14 14:56   ` Christoph Hellwig
2010-09-14 22:16   ` Alex Elder
2010-09-14 10:56 ` [PATCH 10/18] xfs: store xfs_mount in the buftarg instead of in the xfs_buf Dave Chinner
2010-09-14 14:57   ` Christoph Hellwig
2010-09-14 22:21   ` Alex Elder
2010-09-14 10:56 ` [PATCH 11/18] xfs: kill XBF_FS_MANAGED buffers Dave Chinner
2010-09-14 14:59   ` Christoph Hellwig
2010-09-14 22:26   ` Alex Elder
2010-09-14 10:56 ` [PATCH 12/18] xfs: use unhashed buffers for size checks Dave Chinner
2010-09-14 15:00   ` Christoph Hellwig
2010-09-14 22:29   ` Alex Elder
2010-09-14 10:56 ` [PATCH 13/18] xfs: remove buftarg hash for external devices Dave Chinner
2010-09-14 22:29   ` Alex Elder
2010-09-14 10:56 ` [PATCH 14/18] xfs: convert buffer cache hash to rbtree Dave Chinner
2010-09-14 16:29   ` Christoph Hellwig
2010-09-15 17:46   ` Alex Elder
2010-09-14 10:56 ` [PATCH 15/18] xfs; pack xfs_buf structure more tightly Dave Chinner
2010-09-14 16:30   ` Christoph Hellwig
2010-09-15 18:01   ` Alex Elder
2010-09-14 10:56 ` [PATCH 16/18] xfs: convert xfsbud shrinker to a per-buftarg shrinker Dave Chinner
2010-09-14 16:32   ` Christoph Hellwig
2010-09-15 20:19   ` Alex Elder
2010-09-16  0:28     ` Dave Chinner
2010-09-14 10:56 ` [PATCH 17/18] xfs: add a lru to the XFS buffer cache Dave Chinner
2010-09-14 23:16   ` Christoph Hellwig
2010-09-15  0:05     ` Dave Chinner
2010-09-15 21:28   ` Alex Elder
2010-09-14 10:56 ` [PATCH 18/18] xfs: stop using the page cache to back the " Dave Chinner
2010-09-14 23:20   ` Christoph Hellwig
2010-09-15  0:06     ` Dave Chinner
2010-09-14 14:25 ` [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Christoph Hellwig
2010-09-17 13:21 ` Alex Elder
2010-09-21  2:02   ` Dave Chinner
2010-09-21 16:23     ` Alex Elder
2010-09-21 22:34       ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1284499421.9701.69.camel@doink \
    --to=aelder@sgi.com \
    --cc=david@fromorbit.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox