All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alex Elder <aelder@sgi.com>
To: Dave Chinner <david@fromorbit.com>
Cc: xfs@oss.sgi.com
Subject: Re: [PATCH 4/4] xfs: convert buffer cache hash to rbtree
Date: Mon, 13 Sep 2010 11:53:49 -0500	[thread overview]
Message-ID: <1284396829.2404.52.camel@doink> (raw)
In-Reply-To: <1283958778-28610-5-git-send-email-david@fromorbit.com>

On Thu, 2010-09-09 at 01:12 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The buffer cache hash is starting to show typical hash scalability
> problems.  large scale testing is showing the number of cached items
> growing far larger than the hash can efficiently handle. Hence we
> need to move to a self-scaling cache indexing mechanism.
> 
> I have selected rbtrees for indexing becuse they can have O(log n)
> search scalability, and insert and remove cost is not excessive,
> even on large trees. Hence we should be able to cache large numbers
> of buffers without incurring the excessive cache miss search
> penalties that the hash is imposing on us.
> 
> To ensure we still have parallel access to the cache, we need
> multiple trees. Rather than hashing the buffers by disk address to
> select a tree, it seems more sensible to separate trees by typical
> access patterns. Most operations use buffers from within a single AG
> at a time, so rather than searching lots of different lists,
> separate the buffer indexes out into per-AG rbtrees. This means that
> searches during metadata operation have a much higher chance of
> hitting cache resident nodes, and that updates of the tree are less
> likely to disturb trees being accessed on other CPUs doing
> independent operations.

I found a bug in this, and have a bunch of other less
important comments for you to consider.  I haven't spent
any time contemplating your decision to use RB trees on
AG's but it seems quite reasonable.

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/linux-2.6/xfs_buf.c   |  139 +++++++++++++++++++++--------------------
>  fs/xfs/linux-2.6/xfs_buf.h   |   14 ++--
>  fs/xfs/linux-2.6/xfs_super.c |    6 +-
>  fs/xfs/xfs_ag.h              |    4 +
>  fs/xfs/xfs_mount.c           |    4 +-
>  5 files changed, 87 insertions(+), 80 deletions(-)
> 
> diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
> index b2b5dea..facd37e 100644
> --- a/fs/xfs/linux-2.6/xfs_buf.c
> +++ b/fs/xfs/linux-2.6/xfs_buf.c

. . .

> @@ -432,14 +434,36 @@ _xfs_buf_find(
>  	ASSERT(!(range_length < (1 << btp->bt_sshift)));
>  	ASSERT(!(range_base & (xfs_off_t)btp->bt_smask));
>  
> -	hash = &btp->bt_hash[hash_long((unsigned long)ioff, btp->bt_hashshift)];
> -
> -	spin_lock(&hash->bh_lock);
> -
> -	list_for_each_entry_safe(bp, n, &hash->bh_list, b_hash_list) {
> -		ASSERT(btp == bp->b_target);
> -		if (bp->b_file_offset == range_base &&
> -		    bp->b_buffer_length == range_length) {
> +	/* get tree root */
> +	pag = xfs_perag_get(btp->bt_mp, xfs_daddr_to_agno(btp->bt_mp, ioff));
> +
> +	/* walk tree */
> +	spin_lock(&pag->pagbuf_lock);
> +	rbp = &pag->pagbuf_tree.rb_node;
> +	parent = NULL;

Could drop the above assignment and change the while
statement to read:  while ((parent = *rbp)) {
(I know that leads to a mildly controversial style.)

> +	bp = NULL;
> +	while (*rbp) {
> +		parent = *rbp;
> +		bp = rb_entry(parent, struct xfs_buf, b_rbnode);

Below here you might as well make use of the
value of "parent" in places where (*rbp) is used.
I realize you're just mimicking what's in the
rbtree.h file though...  If the result seems to
read funny, maybe you could rename "parent" to
be "entry" or something.

Here's the BUG:

> +		if (bp->b_file_offset < range_base)
> +			rbp = &(*rbp)->rb_left;
> +		else if (bp->b_file_offset > range_base)
> +			rbp = &(*rbp)->rb_right;

Your comparisons here are reversed.  In other words,
I believe it should be:

	if (range_base < bp->b_file_offset)	
		rbp = &parent->rb_left;
	else if (range_base > bp->b_file_offset)
		rbp = &parent->rb_right;

Maybe it mostly works in the order you have it,
but I'm pretty sure it's wrong.  (The "continue
searching to the right" below would be wrong though.)


> +		else {
> +			/*
> +			 * found a block offset match. If the range doesn't
> +			 * match, the only way this is allowed is if the buffer
> +			 * in the cache is stale and the transaction that made
> +			 * it stale has not yet committed. i.e. we are
> +			 * reallocating a busy extent. Skip this buffer and
> +			 * continue searching to the right for an exact match.
> +			 */
> +			if (bp->b_buffer_length != range_length) {
> +				ASSERT(bp->b_flags & XBF_STALE);

This assertion is new.  I trust it's correct, but
maybe it should go in separately (first).

> +				rbp = &(*rbp)->rb_right;
> +				continue;
> +			}
>  			atomic_inc(&bp->b_hold);
>  			goto found;
>  		}

When I first read the next hunk I thought you were
mistakenly not dropping the perag reference.  Later
I realized it was intentional--that the buffer holds
a reference on the perag until it (the buffer) is
released.  This is a minor point but I think it would
be helpful to see a comment explaining that.

> @@ -449,17 +473,20 @@ _xfs_buf_find(
>  	if (new_bp) {
>  		_xfs_buf_initialize(new_bp, btp, range_base,
>  				range_length, flags);
> -		new_bp->b_hash = hash;
> -		list_add(&new_bp->b_hash_list, &hash->bh_list);
> +		new_bp->b_pag = pag;
> +		rb_link_node(&new_bp->b_rbnode, parent, rbp);
> +		rb_insert_color(&new_bp->b_rbnode, &pag->pagbuf_tree);
> +		spin_unlock(&pag->pagbuf_lock);
>  	} else {
>  		XFS_STATS_INC(xb_miss_locked);
> +		spin_unlock(&pag->pagbuf_lock);
> +		xfs_perag_put(pag);
>  	}
> -
> -	spin_unlock(&hash->bh_lock);
>  	return new_bp;
>  
>  found:
> -	spin_unlock(&hash->bh_lock);
> +	spin_unlock(&pag->pagbuf_lock);
> +	xfs_perag_put(pag);
>  
>  	/* Attempt to get the semaphore without sleeping,
>  	 * if this does not work then we need to drop the
> @@ -810,27 +837,30 @@ void
>  xfs_buf_rele(
>  	xfs_buf_t		*bp)
>  {
> -	xfs_bufhash_t		*hash = bp->b_hash;
> +	struct xfs_perag	*pag = bp->b_pag;
>  
>  	trace_xfs_buf_rele(bp, _RET_IP_);
>  
> -	if (unlikely(!hash)) {
> +	if (!pag) {

I know this is not related to your change, but I'm
curious whether this can even happen, and if so, when?

>  		ASSERT(!bp->b_relse);
> +		ASSERT(RB_EMPTY_NODE(&bp->b_rbnode));
>  		if (atomic_dec_and_test(&bp->b_hold))
>  			xfs_buf_free(bp);
>  		return;
>  	}
>  
> +	ASSERT(!RB_EMPTY_NODE(&bp->b_rbnode));
>  	ASSERT(atomic_read(&bp->b_hold) > 0);
> -	if (atomic_dec_and_lock(&bp->b_hold, &hash->bh_lock)) {
> +	if (atomic_dec_and_lock(&bp->b_hold, &pag->pagbuf_lock)) {
>  		if (bp->b_relse) {
>  			atomic_inc(&bp->b_hold);
> -			spin_unlock(&hash->bh_lock);
> +			spin_unlock(&pag->pagbuf_lock);
>  			(*(bp->b_relse)) (bp);

Drop the excess parentheses here (above) while you're at it.

>  		} else {
>  			ASSERT(!(bp->b_flags & (XBF_DELWRI|_XBF_DELWRI_Q)));
> -			list_del_init(&bp->b_hash_list);
> -			spin_unlock(&hash->bh_lock);
> +			rb_erase(&bp->b_rbnode, &pag->pagbuf_tree);
> +			spin_unlock(&pag->pagbuf_lock);
> +			xfs_perag_put(pag);
>  			xfs_buf_free(bp);
>  		}
>  	}
> @@ -1433,57 +1463,25 @@ xfs_buf_iomove(
>   */
>  void
>  xfs_wait_buftarg(
> -	xfs_buftarg_t	*btp)
> +	struct xfs_buftarg	*btp)
>  {
> -	xfs_bufhash_t	*hash;
> -	uint		i;
> +	struct xfs_perag	*pag;
> +	uint			i;
>  
> -	for (i = 0; i < (1 << btp->bt_hashshift); i++) {
> -		hash = &btp->bt_hash[i];
> -		spin_lock(&hash->bh_lock);
> -		while (!list_empty(&hash->bh_list)) {
> -			spin_unlock(&hash->bh_lock);
> +	for (i = 0; i < btp->bt_mp->m_sb.sb_agcount; i++) {
> +		pag = xfs_perag_get(btp->bt_mp, i);
> +		spin_lock(&pag->pagbuf_lock);
> +		while (rb_first(&pag->pagbuf_tree)) {
> +			spin_unlock(&pag->pagbuf_lock);
>  			delay(100);
> -			spin_lock(&hash->bh_lock);
> +			spin_lock(&pag->pagbuf_lock);
>  		}
> -		spin_unlock(&hash->bh_lock);
> +		spin_unlock(&pag->pagbuf_lock);
> +		xfs_perag_put(pag);
>  	}
>  }

This suggestion is again not related to your change, but...
Would this basic structure (not tested) be better than
the above?

	int more;

	do {
		more = 0;
		for (i = 0; i < btp->bt_mp->m_sb.sb_agcount; i++) {
			pag = xfs_perag_get(btp->bt_mp, i);
			spin_lock(&pag->pagbuf_lock);
			if (rb_first(&pag->pagbuf_tree))
				more++;
			spin_unlock(&pag->pagbuf_lock);
			xfs_perag_put(pag);
		}
		if (more)
			delay(100);
	} while (more);
>  
>  /*
> - *	Allocate buffer hash table for a given target.
> - *	For devices containing metadata (i.e. not the log/realtime devices)
> - *	we need to allocate a much larger hash table.

. . . 

> @@ -1645,6 +1643,12 @@ xfs_alloc_buftarg(
>  
>  	btp = kmem_zalloc(sizeof(*btp), KM_SLEEP);
>  
> +	/*
> +	 * The buftarg cache should never be used by external devices.

I suggest that "buftarg cache" is not a good name for the
(now per-AG) buffer cache.

> +	 * Ensure we catch any users with extreme prejudice.
> +	 */
> +	btp->bt_mp = external ? NULL : mp;
> +
>  	btp->bt_dev =  bdev->bd_dev;
>  	btp->bt_bdev = bdev;
>  	if (xfs_setsize_buftarg_early(btp, bdev))
> @@ -1653,7 +1657,6 @@ xfs_alloc_buftarg(
>  		goto error;
>  	if (xfs_alloc_delwrite_queue(btp, fsname))
>  		goto error;
> -	xfs_alloc_bufhash(btp, external);
>  	return btp;
>  
>  error:
> diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
> index 802dc5e..3797ee8 100644
> --- a/fs/xfs/linux-2.6/xfs_buf.h
> +++ b/fs/xfs/linux-2.6/xfs_buf.h
. . .
> @@ -171,8 +170,8 @@ typedef struct xfs_buf {
>  	wait_queue_head_t	b_waiters;	/* unpin waiters */
>  	struct list_head	b_list;
>  	xfs_buf_flags_t		b_flags;	/* status flags */
> -	struct list_head	b_hash_list;	/* hash table list */
> -	xfs_bufhash_t		*b_hash;	/* hash table list start */
> +	struct rb_node		b_rbnode;	/* rbtree node */
> +	struct xfs_perag	*b_pag;		/* rbtree root */

Rather, "contains rbtree root" (in the comment).

>  	xfs_buftarg_t		*b_target;	/* buffer target (device) */
>  	atomic_t		b_hold;		/* reference count */
>  	xfs_daddr_t		b_bn;		/* block number for I/O */

. . . 

> diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
> index 4917d4e..e01d4cf 100644
> --- a/fs/xfs/xfs_ag.h
> +++ b/fs/xfs/xfs_ag.h
> @@ -230,6 +230,10 @@ typedef struct xfs_perag {
>  	rwlock_t	pag_ici_lock;	/* incore inode lock */
>  	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
>  	int		pag_ici_reclaimable;	/* reclaimable inodes */
> +
> +	/* buffer cache index */
> +	spinlock_t	pagbuf_lock;	/* lock for pagbuf_tree */
> +	struct rb_root	pagbuf_tree;	/* ordered tree of active buffers */

I know it's in some way consistent with the rest of the
naming scheme for fields in this structure, maybe these
could be named pag_buf_lock and pag_buf_tree.  (The rest
of the fields here have a sort of strange convention and
maybe they've got strong enough history that they should
be left alone.)

>  #endif
>  	int		pagb_count;	/* pagb slots in use */
>  } xfs_perag_t;

. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  parent reply	other threads:[~2010-09-13 16:53 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-08 15:12 [RFC] [PATCH 0/4] Replace buffer cache hash with rbtrees Dave Chinner
2010-09-08 15:12 ` [PATCH 1/4] xfs: kill XBF_FS_MANAGED buffers Dave Chinner
2010-09-09  1:33   ` Christoph Hellwig
2010-09-10  3:10     ` Dave Chinner
2010-09-10 21:17   ` Alex Elder
2010-09-08 15:12 ` [PATCH 2/4] xfs: use unhashed buffers for size checks Dave Chinner
2010-09-09  1:38   ` Christoph Hellwig
2010-09-10  3:14     ` Dave Chinner
2010-09-10 21:33   ` Alex Elder
2010-09-08 15:12 ` [PATCH 3/4] xfs: remove buftarg hash for external devices Dave Chinner
2010-09-09  1:39   ` Christoph Hellwig
2010-09-08 15:12 ` [PATCH 4/4] xfs: convert buffer cache hash to rbtree Dave Chinner
2010-09-09  1:51   ` Christoph Hellwig
2010-09-10  3:22     ` Dave Chinner
2010-09-13 16:59     ` Alex Elder
2010-09-13 16:53   ` Alex Elder [this message]
2010-09-14  7:13     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1284396829.2404.52.camel@doink \
    --to=aelder@sgi.com \
    --cc=david@fromorbit.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.