public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] xfs: Fix issues found doing large inode count testing
@ 2010-08-23  0:15 Dave Chinner
  2010-08-23  0:15 ` [PATCH 1/2] xfs: don't do memory allocation under the CIL context lock Dave Chinner
  2010-08-23  0:15 ` [PATCH 2/2] xfs: improve buffer cache hash scalability Dave Chinner
  0 siblings, 2 replies; 6+ messages in thread
From: Dave Chinner @ 2010-08-23  0:15 UTC (permalink / raw)
  To: xfs

Two fixes - one that prevents stalls in the CIl commit subsystem when memory is
low, one that brings CPu usage of the buffer cache hash down to something more
sensible.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] xfs: don't do memory allocation under the CIL context lock
  2010-08-23  0:15 [PATCH 0/2] xfs: Fix issues found doing large inode count testing Dave Chinner
@ 2010-08-23  0:15 ` Dave Chinner
  2010-08-23 10:32   ` Christoph Hellwig
  2010-08-23  0:15 ` [PATCH 2/2] xfs: improve buffer cache hash scalability Dave Chinner
  1 sibling, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2010-08-23  0:15 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Formatting items requires memory allocation when using delayed
logging. Currently that memory allocation is done while holding the
CIL context lock in read mode. This means that if memory allocation
takes some time (e.g. enters reclaim), we cannot push on the CIL
until the allocation(s) required by formatting complete. This can
stall CIL pushes for some time, and once a push is stalled so are
all new transaction commits.

Fix this splitting the item formatting into two steps. The first
step which does the allocation and memcpy() into the allocated
buffer is now done outside the CIL context lock, and only the CIL
insert is done inside the CIL context lock. This avoids the stall
issue.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c |   34 ++++++++++++++++++++++++++--------
 1 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 9768f24..ed575fb 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -270,15 +270,10 @@ xlog_cil_insert(
 static void
 xlog_cil_format_items(
 	struct log		*log,
-	struct xfs_log_vec	*log_vector,
-	struct xlog_ticket	*ticket,
-	xfs_lsn_t		*start_lsn)
+	struct xfs_log_vec	*log_vector)
 {
 	struct xfs_log_vec *lv;
 
-	if (start_lsn)
-		*start_lsn = log->l_cilp->xc_ctx->sequence;
-
 	ASSERT(log_vector);
 	for (lv = log_vector; lv; lv = lv->lv_next) {
 		void	*ptr;
@@ -302,9 +297,24 @@ xlog_cil_format_items(
 			ptr += vec->i_len;
 		}
 		ASSERT(ptr == lv->lv_buf + lv->lv_buf_len);
+	}
+}
 
+static void
+xlog_cil_insert_items(
+	struct log		*log,
+	struct xfs_log_vec	*log_vector,
+	struct xlog_ticket	*ticket,
+	xfs_lsn_t		*start_lsn)
+{
+	struct xfs_log_vec *lv;
+
+	if (start_lsn)
+		*start_lsn = log->l_cilp->xc_ctx->sequence;
+
+	ASSERT(log_vector);
+	for (lv = log_vector; lv; lv = lv->lv_next)
 		xlog_cil_insert(log, ticket, lv->lv_item, lv);
-	}
 }
 
 static void
@@ -612,9 +622,17 @@ xfs_log_commit_cil(
 		return XFS_ERROR(EIO);
 	}
 
+	/*
+	 * do all the hard work of formatting items (including memory
+	 * allocation) outside the CIL context lock. This prevents stalling CIL
+	 * pushes when we are low on memory and a transaction commit spends a
+	 * lot of time in memory reclaim.
+	 */
+	xlog_cil_format_items(log, log_vector);
+
 	/* lock out background commit */
 	down_read(&log->l_cilp->xc_ctx_lock);
-	xlog_cil_format_items(log, log_vector, tp->t_ticket, commit_lsn);
+	xlog_cil_insert_items(log, log_vector, tp->t_ticket, commit_lsn);
 
 	/* check we didn't blow the reservation */
 	if (tp->t_ticket->t_curr_res < 0)
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] xfs: improve buffer cache hash scalability
  2010-08-23  0:15 [PATCH 0/2] xfs: Fix issues found doing large inode count testing Dave Chinner
  2010-08-23  0:15 ` [PATCH 1/2] xfs: don't do memory allocation under the CIL context lock Dave Chinner
@ 2010-08-23  0:15 ` Dave Chinner
  2010-08-23 10:36   ` Christoph Hellwig
  2010-08-27 18:11   ` Alex Elder
  1 sibling, 2 replies; 6+ messages in thread
From: Dave Chinner @ 2010-08-23  0:15 UTC (permalink / raw)
  To: xfs

When doing large parallel file creates on a 16p machines, large amounts of
time is being spent in _xfs_buf_find(). A system wide profile with perf top
shows this:

          1134740.00 19.3% _xfs_buf_find
           733142.00 12.5% __ticket_spin_lock

The problem is that the hash contains 45,000 buffers, and the hash table width
is only 256 buffers. That means we've got around 200 buffers per chain, and
searching it is quite expensive. The hash table size needs to increase.

Secondly, every time we do a lookup, we promote the buffer we find to the head
of the hash chain. This is causing cachelines to be dirtied and causes
invalidation of cachelines across all CPUs that may have walked the hash chain
recently. hence every walk of the hash chain is effectively a cold cache walk.
Remove the promotion to avoid this invalidation.

The results are:

          1045043.00 21.2% __ticket_spin_lock
           326184.00  6.6% _xfs_buf_find

A 70% drop in the CPU usage when looking up buffers. Unfortunately that does
not result in an increase in performance underthis workload as contention on
the inode_lock soaks up most of the reduction in CPU usage.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_buf.c |    8 +-------
 fs/xfs/linux-2.6/xfs_buf.h |    1 -
 2 files changed, 1 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index ea79072..d72cf2b 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -440,12 +440,7 @@ _xfs_buf_find(
 		ASSERT(btp == bp->b_target);
 		if (bp->b_file_offset == range_base &&
 		    bp->b_buffer_length == range_length) {
-			/*
-			 * If we look at something, bring it to the
-			 * front of the list for next time.
-			 */
 			atomic_inc(&bp->b_hold);
-			list_move(&bp->b_hash_list, &hash->bh_list);
 			goto found;
 		}
 	}
@@ -1443,8 +1438,7 @@ xfs_alloc_bufhash(
 {
 	unsigned int		i;
 
-	btp->bt_hashshift = external ? 3 : 8;	/* 8 or 256 buckets */
-	btp->bt_hashmask = (1 << btp->bt_hashshift) - 1;
+	btp->bt_hashshift = external ? 3 : 12;	/* 8 or 4096 buckets */
 	btp->bt_hash = kmem_zalloc_large((1 << btp->bt_hashshift) *
 					 sizeof(xfs_bufhash_t));
 	for (i = 0; i < (1 << btp->bt_hashshift); i++) {
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index d072e5f..2a05614 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -137,7 +137,6 @@ typedef struct xfs_buftarg {
 	size_t			bt_smask;
 
 	/* per device buffer hash table */
-	uint			bt_hashmask;
 	uint			bt_hashshift;
 	xfs_bufhash_t		*bt_hash;
 
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] xfs: don't do memory allocation under the CIL context lock
  2010-08-23  0:15 ` [PATCH 1/2] xfs: don't do memory allocation under the CIL context lock Dave Chinner
@ 2010-08-23 10:32   ` Christoph Hellwig
  0 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2010-08-23 10:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] xfs: improve buffer cache hash scalability
  2010-08-23  0:15 ` [PATCH 2/2] xfs: improve buffer cache hash scalability Dave Chinner
@ 2010-08-23 10:36   ` Christoph Hellwig
  2010-08-27 18:11   ` Alex Elder
  1 sibling, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2010-08-23 10:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

> @@ -1443,8 +1438,7 @@ xfs_alloc_bufhash(
>  {
>  	unsigned int		i;
>  
> -	btp->bt_hashshift = external ? 3 : 8;	/* 8 or 256 buckets */
> -	btp->bt_hashmask = (1 << btp->bt_hashshift) - 1;
> +	btp->bt_hashshift = external ? 3 : 12;	/* 8 or 4096 buckets */

Not directly related to the patch, but we never have hashes buffers
on the log device.  And on the RT device we only use hashed buffers
twice to verify the device size.  With a little work on the latter
we can stop allocating the buf hash entirely for the RT and log
devices.

Anyway, the patch looks good for now, although we should try to get
rid of the hash long term.


Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] xfs: improve buffer cache hash scalability
  2010-08-23  0:15 ` [PATCH 2/2] xfs: improve buffer cache hash scalability Dave Chinner
  2010-08-23 10:36   ` Christoph Hellwig
@ 2010-08-27 18:11   ` Alex Elder
  1 sibling, 0 replies; 6+ messages in thread
From: Alex Elder @ 2010-08-27 18:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, 2010-08-23 at 10:15 +1000, Dave Chinner wrote:
> When doing large parallel file creates on a 16p machines, large amounts of
> time is being spent in _xfs_buf_find(). A system wide profile with perf top
> shows this:
> 
>           1134740.00 19.3% _xfs_buf_find
>            733142.00 12.5% __ticket_spin_lock
> 
> The problem is that the hash contains 45,000 buffers, and the hash table width
> is only 256 buffers. That means we've got around 200 buffers per chain, and
> searching it is quite expensive. The hash table size needs to increase.

My only comment on this is that 4096 buckets is
good now but someday that may not be right either.
Is there any better way (based on size of underlying
block_device and maybe taking into account other things
like page size or CPU count) to decide this hash size?

Either way it looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>

. . .


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-08-27 18:11 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-23  0:15 [PATCH 0/2] xfs: Fix issues found doing large inode count testing Dave Chinner
2010-08-23  0:15 ` [PATCH 1/2] xfs: don't do memory allocation under the CIL context lock Dave Chinner
2010-08-23 10:32   ` Christoph Hellwig
2010-08-23  0:15 ` [PATCH 2/2] xfs: improve buffer cache hash scalability Dave Chinner
2010-08-23 10:36   ` Christoph Hellwig
2010-08-27 18:11   ` Alex Elder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox