public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: xfs@oss.sgi.com
Subject: [PATCH 01/18] xfs: force background CIL push under sustained load
Date: Fri, 24 Sep 2010 22:30:59 +1000	[thread overview]
Message-ID: <1285331476-23015-2-git-send-email-david@fromorbit.com> (raw)
In-Reply-To: <1285331476-23015-1-git-send-email-david@fromorbit.com>

From: Dave Chinner <dchinner@redhat.com>

I have been seeing occasional pauses in transaction throughput up to
30s long under heavy parallel workloads. The only notable thing was
that the xfsaild was trying to be active during the pauses, but
making no progress. It was running exactly 20 times a second (on the
50ms no-progress backoff), and the number of pushbuf events was
constant across this time as well.  IOWs, the xfsaild appeared to be
stuck on buffers that it could not push out.

Further investigation indicated that it was trying to push out inode
buffers that were pinned and/or locked. The xfsbufd was also getting
woken at the same frequency (by the xfsaild, no doubt) to push out
delayed write buffers. The xfsbufd was not making any progress
because all the buffers in the delwri queue were pinned. This scan-
and-make-no-progress dance went one in the trace for some seconds,
before the xfssyncd came along an issued a log force, and then
things started going again.

However, I noticed something strange about the log force - there
were way too many IO's issued. 516 log buffers were written, to be
exact. That added up to 129MB of log IO, which got me very
interested because it's almost exactly 25% of the size of the log.
He delayed logging code is suppose to aggregate the minimum of 25%
of the log or 8MB worth of changes before flushing. That's what
really puzzled me - why did a log force write 129MB instead of only
8MB?

Essentially what has happened is that no CIL pushes had occurred
since the previous tail push which cleared out 25% of the log space.
That caused all the new transactions to block because there wasn't
log space for them, but they kick the xfsaild to push the tail.
However, the xfsaild was not making progress because there were
buffers it could not lock and flush, and the xfsbufd could not flush
them because they were pinned. As a result, both the xfsaild and the
xfsbufd could not move the tail of the log forward without the CIL
first committing.

The cause of the problem was that the background CIL push, which
should happen when 8MB of aggregated changes have been committed, is
being held off by the concurrent transaction commit load. The
background push does a down_write_trylock() which will fail if there
is a concurrent transaction commit holding the push lock in read
mode. With 8 CPUs all doing transactions as fast as they can, there
was enough concurrent transaction commits to hold off the background
push until tail-pushing could no longer free log space, and the halt
would occur.

It should be noted that there is no reason why it would halt at 25%
of log space used by a single CIL checkpoint. This bug could
definitely violate the "no transaction should be larger than half
the log" requirement and hence result in corruption if the system
crashed under heavy load. This sort of bug is exactly the reason why
delayed logging was tagged as experimental....

The fix is to start blocking background pushes once the threshold
has been exceeded. Rework the threshold calculations to keep the
amount of log space a CIL checkpoint can use to below that of the
AIL push threshold to avoid the problem completely.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_log_cil.c  |   12 +++++++++---
 fs/xfs/xfs_log_priv.h |   37 +++++++++++++++++++++----------------
 2 files changed, 30 insertions(+), 19 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index ed575fb..7e206fc 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -405,9 +405,15 @@ xlog_cil_push(
 	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_SLEEP|KM_NOFS);
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
 
-	/* lock out transaction commit, but don't block on background push */
+	/*
+	 * Lock out transaction commit, but don't block for background pushes
+	 * unless we are well over the CIL space limit. See the definition of
+	 * XLOG_CIL_HARD_SPACE_LIMIT() for the full explanation of the logic
+	 * used here.
+	 */
 	if (!down_write_trylock(&cil->xc_ctx_lock)) {
-		if (!push_seq)
+		if (!push_seq &&
+		    cil->xc_ctx->space_used < XLOG_CIL_HARD_SPACE_LIMIT(log))
 			goto out_free_ticket;
 		down_write(&cil->xc_ctx_lock);
 	}
@@ -422,7 +428,7 @@ xlog_cil_push(
 		goto out_skip;
 
 	/* check for a previously pushed seqeunce */
-	if (push_seq < cil->xc_ctx->sequence)
+	if (push_seq && push_seq < cil->xc_ctx->sequence)
 		goto out_skip;
 
 	/*
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index ced52b9..edcdfe0 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -426,13 +426,13 @@ struct xfs_cil {
 };
 
 /*
- * The amount of log space we should the CIL to aggregate is difficult to size.
- * Whatever we chose we have to make we can get a reservation for the log space
- * effectively, that it is large enough to capture sufficient relogging to
- * reduce log buffer IO significantly, but it is not too large for the log or
- * induces too much latency when writing out through the iclogs. We track both
- * space consumed and the number of vectors in the checkpoint context, so we
- * need to decide which to use for limiting.
+ * The amount of log space we allow the CIL to aggregate is difficult to size.
+ * Whatever we choose, we have to make sure we can get a reservation for the
+ * log space effectively, that it is large enough to capture sufficient
+ * relogging to reduce log buffer IO significantly, but it is not too large for
+ * the log or induces too much latency when writing out through the iclogs. We
+ * track both space consumed and the number of vectors in the checkpoint
+ * context, so we need to decide which to use for limiting.
  *
  * Every log buffer we write out during a push needs a header reserved, which
  * is at least one sector and more for v2 logs. Hence we need a reservation of
@@ -459,16 +459,21 @@ struct xfs_cil {
  * checkpoint transaction ticket is specific to the checkpoint context, rather
  * than the CIL itself.
  *
- * With dynamic reservations, we can basically make up arbitrary limits for the
- * checkpoint size so long as they don't violate any other size rules.  Hence
- * the initial maximum size for the checkpoint transaction will be set to a
- * quarter of the log or 8MB, which ever is smaller. 8MB is an arbitrary limit
- * right now based on the latency of writing out a large amount of data through
- * the circular iclog buffers.
+ * With dynamic reservations, we can effectively make up arbitrary limits for
+ * the checkpoint size so long as they don't violate any other size rules.
+ * Recovery imposes a rule that no transaction exceed half the log, so we are
+ * limited by that.  Furthermore, the log transaction reservation subsystem
+ * tries to keep 25% of the log free, so we need to keep below that limit or we
+ * risk running out of free log space to start any new transactions.
+ *
+ * In order to keep background CIL push efficient, we will set a lower
+ * threshold at which background pushing is attempted without blocking current
+ * transaction commits.  A separate, higher bound defines when CIL pushes are
+ * enforced to ensure we stay within our maximum checkpoint size bounds.
+ * threshold, yet give us plenty of space for aggregation on large logs.
  */
-
-#define XLOG_CIL_SPACE_LIMIT(log)	\
-	(min((log->l_logsize >> 2), (8 * 1024 * 1024)))
+#define XLOG_CIL_SPACE_LIMIT(log)	(log->l_logsize >> 3)
+#define XLOG_CIL_HARD_SPACE_LIMIT(log)	(3 * (log->l_logsize >> 4))
 
 /*
  * The reservation head lsn is not made up of a cycle number and block number.
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2010-09-24 12:30 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-24 12:30 [PATCH 0/18] xfs: metadata scalability V3 Dave Chinner
2010-09-24 12:30 ` Dave Chinner [this message]
2010-09-24 12:31 ` [PATCH 02/18] xfs: reduce the number of CIL lock round trips during commit Dave Chinner
2010-09-24 12:31 ` [PATCH 03/18] xfs: remove debug assert for per-ag reference counting Dave Chinner
2010-09-24 12:31 ` [PATCH 04/18] xfs: lockless per-ag lookups Dave Chinner
2010-09-24 12:31 ` [PATCH 05/18] xfs: don't use vfs writeback for pure metadata modifications Dave Chinner
2010-09-25 23:42   ` Christoph Hellwig
2010-09-27  1:09     ` Dave Chinner
2010-09-24 12:31 ` [PATCH 06/18] xfs: rename xfs_buf_get_nodaddr to be more appropriate Dave Chinner
2010-09-24 12:31 ` [PATCH 07/18] xfs: introduced uncached buffer read primitve Dave Chinner
2010-09-24 12:31 ` [PATCH 08/18] xfs: store xfs_mount in the buftarg instead of in the xfs_buf Dave Chinner
2010-09-24 12:31 ` [PATCH 09/18] xfs: kill XBF_FS_MANAGED buffers Dave Chinner
2010-09-24 12:31 ` [PATCH 10/18] xfs: use unhashed buffers for size checks Dave Chinner
2010-09-24 12:31 ` [PATCH 11/18] xfs: remove buftarg hash for external devices Dave Chinner
2010-09-24 12:31 ` [PATCH 12/18] xfs: split inode AG walking into separate code for reclaim Dave Chinner
2010-09-24 12:31 ` [PATCH 13/18] xfs: split out inode walk inode grabbing Dave Chinner
2010-09-25 16:31   ` Christoph Hellwig
2010-09-24 12:31 ` [PATCH 14/18] xfs: implement batched inode lookups for AG walking Dave Chinner
2010-09-25 16:32   ` Christoph Hellwig
2010-09-24 12:31 ` [PATCH 15/18] xfs: batch inode reclaim lookup Dave Chinner
2010-09-24 12:31 ` [PATCH 16/18] xfs: serialise inode reclaim within an AG Dave Chinner
2010-09-25 23:49   ` Christoph Hellwig
2010-09-27  0:56     ` Dave Chinner
2010-09-24 12:31 ` [PATCH 17/18] xfs: convert buffer cache hash to rbtree Dave Chinner
2010-09-24 12:31 ` [PATCH 18/18] xfs: pack xfs_buf structure more tightly Dave Chinner
  -- strict thread matches above, loose matches on Subject: below --
2010-09-27  1:47 [PATCH 0/18] xfs: metadata scalability V4 Dave Chinner
2010-09-27  1:47 ` [PATCH 01/18] xfs: force background CIL push under sustained load Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1285331476-23015-2-git-send-email-david@fromorbit.com \
    --to=david@fromorbit.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox