All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alex Elder <aelder@sgi.com>
To: Dave Chinner <david@fromorbit.com>
Cc: xfs@oss.sgi.com
Subject: Re: [PATCH] xfs: force background CIL push under sustained load
Date: Thu, 23 Sep 2010 13:53:55 -0500	[thread overview]
Message-ID: <1285268035.1973.109.camel@doink> (raw)
In-Reply-To: <1285208863-31489-1-git-send-email-david@fromorbit.com>

On Thu, 2010-09-23 at 12:27 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> I have been seeing relatively frequent pauses in transaction throughput up to
> 30s long under heavy parallel workloads. The only thing that seemed strange
> about them was that the xfsaild was active during the pauses, but making no
> progress. It was running exactly 20 times a second (on the 50ms no-progress
> backoff), and the number of pushbuf events was constant across this time as
> well.  IOWs, the xfsaild appeared to be stuck on buffers that it could not push
> out.
> 
> Further investigation indicated that it was trying to push out inode buffers
> that were pinned and/or locked. The xfsbufd was also getting woken at the same
> frequency (by the xfsaild, no doubt) to push out delayed write buffers. The
> xfsbufd was not making any progress because all the buffers in the delwri queue
> were pinned. This scan and make no progress dance went one in the trace for
> some seconds, before the xfssyncd came along an issued a log force, and then
> things started going again.
> 
> However, I noticed something strange about the log force - there were way too
> many IO's issued. 516 log buffers were written, to be exact. That added up to
> 129MB of log IO, which got me very interested because it's almost exactly 25%
> of the size of the log. That "25% of the log" is actually very important - it's
> the distance that tail-pushing tries to keep free in the log; i.e. where it
> pushes to. Further, the delayed logging code is suppose to aggregate the
> minimum of 25% of the log or 8MB worth of changes before flushing. That's what
> really puzzled me - why did a log force write 129MB instead of only 8MB?
> 
> Essentially what has happened is that no CIL pushes had occurred since the
> previous tail push which cleared out 25% of the log space. That caused all the
> new transactions to block because there wasn't log space for them, but they
> kick the xfsaild to push the tail. However, the xfsaild was not making
> progress because there were buffers it could not lock and flush, and the
> xfsbufd could not flush them because they were pinned. As a result, both
> the xfsaild and the xfsbufd could not move the tail of the log forward without
> the CIL first committing.
> 
> The cause of the problem was that the background CIL push, which should happen
> when 8MB of aggregated changes have been committed, is being held off by the
> concurrent transaction commit load. The background push does a
> down_write_trylock() which will fail if there is a concurrent transaction
> commit holding the push lock in read mode. With 8 CPUs all doing transactions
> as fast as they can, there was enough concurrent transaction commits to hold
> off the background push until tail-pushing could no longer free log space,
> and the halt would occur.
> 
> It should be noted that there is no reason why it would halt at 25% of log
> space used by a single CIL checkpoint. This bug could definitely violate the
> "no transaction should be larger than half the log" requirement and hence
> result in corruption if the system crashed under heavy load. This sort of
> bug is exactly the reason why delayed logging was tagged as experimental....
> 
> The fix is to start blocking background pushes once the threshold has been
> exceeded by 50%. This is still only 37.5% of log space at worst, so keeps us
> well short of the limit a transaction must not exceed.

This looks good.  You're making the CIL push be controlled by
low/high water marks, basically, with a "best effort" at the
low water mark (XLOG_CIL_SPACE_LIMIT()) and forced once
XLOG_CIL_HARD_SPACE_LIMIT() has been reached.  And the
important bug fix is that previously there was no point at
which the forced push would occur.

As usual I have a bit of gratuitous commentary below,
but this change looks good to me.

Reviewed-by: Alex Elder <aelder@sgi.com>


> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c  |   12 +++++++++---
>  fs/xfs/xfs_log_priv.h |   28 ++++++++++++++++++++--------
>  2 files changed, 29 insertions(+), 11 deletions(-)

. . .

> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index ced52b9..9776956 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -460,15 +460,27 @@ struct xfs_cil {
>   * than the CIL itself.
>   *
>   * With dynamic reservations, we can basically make up arbitrary limits for the
> - * checkpoint size so long as they don't violate any other size rules.  Hence
> - * the initial maximum size for the checkpoint transaction will be set to a
> - * quarter of the log or 8MB, which ever is smaller. 8MB is an arbitrary limit
> - * right now based on the latency of writing out a large amount of data through
> - * the circular iclog buffers.
> + * checkpoint size so long as they don't violate any other size rules. A key
> + * size we should try to keep below is the AIL push threshold which is 25% of
> + * the log. That is how much empty space in the log the transaction reservation
> + * subsystem tries to keep free. If we make transactions larger than that, then
> + * we risk requiring more space for the transaction commit than the AIL will
> + * try to free for us. Hence the maximum size we should allow is under 25% of
> + * the log. This will also keep us below the "no transaction shoul dbe more

s/shoul dbe/should be/

> + * than half the log" rule that recovery requires us to keep.
> + *
> + * Further, we need to make sure the background CIL push is efficient, which
> + * means we need to give the background push a chance to commit without
> + * blocking all the current transaction commits. Hence we need some space
> + * between the threshold and the 25% limit to allow background pushes to be
> + * tried, but not enforced. To make this simple and fast to calculate, set
> + * the background push threshold to 1/8th (12.5%) the size of the log, and then start
> + * enforcing the background push at 50% above this. i.e. at 3/16th or 18.75% of
> + * the log size. This should keep us well under the limits of the AIL pushing
> + * threshold, yet give us plenty of space for aggregation on large logs.
>   */

I think the above explanation is pretty good but I don't know that it's
as clear or concise as it could be.  I don't claim this is better but
I'll take a shot (I don't like offering criticism without suggesting
an alternative).

 * With dynamic reservations, we can basically make up arbitrary
 * limits for the checkpoint size so long as they don't violate any
 * other size rules.  Recovery imposes a rule that no transaction
 * exceed half the log, so we are limited by that.  Furthermore, the
 * log transaction reservation subsystem tries to keep 25% of the
 * log free, so we should keep below that limit or we risk not being
 * able to get the space we need.
 *
 * In order to keep background CIL push efficient, we will set a
 * lower threshold at which background pushing is attempted without
 * blocking current transaction commits.  A separate, higher bound
 * defines when CIL pushes are forced in order to ensure we stay
 * within our transaction size limits.

> -
> -#define XLOG_CIL_SPACE_LIMIT(log)	\
> -	(min((log->l_logsize >> 2), (8 * 1024 * 1024)))
> +#define XLOG_CIL_SPACE_LIMIT(log)	(log->l_logsize >> 3)
> +#define XLOG_CIL_HARD_SPACE_LIMIT(log)	(3 * (log->l_logsize >> 4))

Maybe "LIMIT" isn't quite the right name for these two.
(But I have no better suggestion.)

I don't really care much about this, but I'll take this
opportunity for a small rant.

The difference in calculation cost/speed between "x >> 3" and
"x / 8" is vanishingly small.  I think it is meaningful to use
a shift in places where a power-of-two is mandated, but in places
like this it suggests there is a constraint that simply doesn't
exist.  So for example, you could have chosen (log->logsize / 10)
as the "try pushing" value, and (log->logsize / 4 - 1) as the
"must push" value.

Anyway, this is all fine as-is, it just surprises me that
people still use this idiom when it makes such a negligible
actual difference in execution.

>  
>  /*
>   * The reservation head lsn is not made up of a cycle number and block number.



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  parent reply	other threads:[~2010-09-23 18:53 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-23  2:27 [PATCH] xfs: force background CIL push under sustained load Dave Chinner
2010-09-23 10:43 ` Christoph Hellwig
2010-09-23 18:53 ` Alex Elder [this message]
2010-09-24  1:18   ` Dave Chinner
2010-09-24 14:02     ` Alex Elder
2010-09-23 18:58 ` Alex Elder
2010-09-24  2:15   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1285268035.1973.109.camel@doink \
    --to=aelder@sgi.com \
    --cc=david@fromorbit.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.