All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ben Myers <bpm@sgi.com>
To: Juerg Haefliger <juergh@gmail.com>
Cc: xfs@oss.sgi.com
Subject: Re: Still seeing hangs in xlog_grant_log_space
Date: Mon, 21 May 2012 12:11:37 -0500	[thread overview]
Message-ID: <20120521171136.GR16099@sgi.com> (raw)
In-Reply-To: <CADLDEKssiOCVRknW3hYtxDxYHSyGr6qfepfai+UymsD6zMGopw@mail.gmail.com>

Hey Juerg,

On Sat, May 19, 2012 at 09:28:55AM +0200, Juerg Haefliger wrote:
> > On Wed, May 09, 2012 at 09:54:08AM +0200, Juerg Haefliger wrote:
> >> > On Sat, May 05, 2012 at 09:44:35AM +0200, Juerg Haefliger wrote:
> >> >> Did anybody have a chance to look at the data?
> >> >
> >> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/979498
> >> >
> >> > Here you indicate that you have created a reproducer.  Can you post it to the list?
> >>
> >> Canonical attached them to the bug report that they filed yesterday:
> >> http://oss.sgi.com/bugzilla/show_bug.cgi?id=922
> >
> > I'm interested in understanding to what extent the hang you see in production
> > on 2.6.38 is similar to the hang of the reproducer.  Mark is seeing a situation
> > where there is nothing on the AIL and is clogged up in the CIL, others are
> > seeing items on the AIL that don't seem to be making progress.  Could you
> > provide a dump or traces from a hang on a filesystem with a normal sized log?
> > Can the reproducer hit the hang eventually without resorting to the tiny log?
> 
> I'm not certain that the reproducer hang is identical to the
> production hang. One difference that I've noticed is that a reproducer
> hang can be cleared with an emergency sync while a production hang
> can't. I'm working on trying to get a trace from a production machine.

Hit this on a filesystem with a regular sized log over the weekend.  If you see
this again in production could you gather up task states?

echo t > /proc/sysrq-trigger

Mark and I have been looking at the dump.  There are few interesting items to point out.

1) xfs_sync_worker is blocked trying to get log reservation:

PID: 25374  TASK: ffff88013481c6c0  CPU: 3   COMMAND: "kworker/3:83"
 #0 [ffff88013481fb50] __schedule at ffffffff813aacac
 #1 [ffff88013481fc98] schedule at ffffffff813ab0c4
 #2 [ffff88013481fca8] xlog_grant_head_wait at ffffffffa0347b78 [xfs]
 #3 [ffff88013481fcf8] xlog_grant_head_check at ffffffffa03483e6 [xfs]
 #4 [ffff88013481fd38] xfs_log_reserve at ffffffffa034852c [xfs]
 #5 [ffff88013481fd88] xfs_trans_reserve at ffffffffa0344e64 [xfs]
 #6 [ffff88013481fdd8] xfs_fs_log_dummy at ffffffffa02ec138 [xfs]
 #7 [ffff88013481fdf8] xfs_sync_worker at ffffffffa02f7be4 [xfs]
 #8 [ffff88013481fe18] process_one_work at ffffffff8104c53b
 #9 [ffff88013481fe68] worker_thread at ffffffff8104f0e3
#10 [ffff88013481fee8] kthread at ffffffff8105395e
#11 [ffff88013481ff48] kernel_thread_helper at ffffffff813b3ae4

This means that it is not in a position to push the AIL.  It is clear that the
AIL has plenty of entries which can be pushed.

crash> xfs_ail 0xffff88022112b7c0,
struct xfs_ail {
...
  xa_ail = {
    next = 0xffff880144d1c318,
    prev = 0xffff880170a02078
  },
  xa_target = 0x1f00003063,

Here's the first item on the AIL:

ffff880144d1c318
struct xfs_log_item_t {
  li_ail = {
    next = 0xffff880196ea0858,
    prev = 0xffff88022112b7d0
  },
  li_lsn = 0x1f00001c63,		<--- less than xa_target
  li_desc = 0x0,
  li_mountp = 0xffff88016adee000,
  li_ailp = 0xffff88022112b7c0,
  li_type = 0x123b,
  li_flags = 0x1,
  li_bio_list = 0xffff88016afa5cb8,
  li_cb = 0xffffffffa034de00 <xfs_istale_done>,
  li_ops = 0xffffffffa035f620,
  li_cil = {
    next = 0xffff880144d1c368,
    prev = 0xffff880144d1c368
  },
  li_lv = 0x0,
  li_seq = 0x3b
}

So if xfs_sync_worker were not blocked on log reservation it would push these
items.

2) The CIL is waiting around too:

crash> xfs_cil_ctx 0xffff880144d1a9c0,
struct xfs_cil_ctx {
...
  space_used = 0x135f68, 

struct log {
...
  l_logsize = 0xa00000,

A00000/8
140000						<--- XLOG_CIL_SPACE_LIMIT

140000 - 135F68
A098

Looks like xlog_cil_push_background will not push the CIL while space used is
less than XLOG_CIL_SPACE_LIMIT, so that's not going anywhere either.

3) It may be unrelated to this bug, but we do have a race in the log
reservation code that hasn't been resolved... between when log_space_left
samples the grant heads and when the space is actually granted a bit later.
Maybe we can grant more space than intended.

If you can provide output of 'echo t > /proc/sysrq-trigger' it may be enough
information to determine if you're seeing the same problem we hit on Saturday.

Thanks,

Ben & Mark

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2012-05-21 17:06 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-23 12:09 Still seeing hangs in xlog_grant_log_space Juerg Haefliger
2012-04-23 14:38 ` Dave Chinner
2012-04-23 15:33   ` Juerg Haefliger
2012-04-23 23:58     ` Dave Chinner
2012-04-24  8:55       ` Juerg Haefliger
2012-04-24 12:07         ` Dave Chinner
2012-04-24 18:26           ` Juerg Haefliger
2012-04-25 22:38             ` Dave Chinner
2012-04-26 12:37               ` Juerg Haefliger
2012-04-26 22:44                 ` Dave Chinner
2012-04-26 23:00                   ` Juerg Haefliger
2012-04-26 23:07                     ` Dave Chinner
2012-04-27  9:04                       ` Juerg Haefliger
2012-04-27 11:09                         ` Dave Chinner
2012-04-27 13:07                           ` Juerg Haefliger
2012-05-05  7:44                             ` Juerg Haefliger
2012-05-07 17:19                               ` Ben Myers
2012-05-09  7:54                                 ` Juerg Haefliger
2012-05-10 16:11                                   ` Chris J Arges
2012-05-10 21:53                                     ` Mark Tinguely
2012-05-16 18:42                                     ` Ben Myers
2012-05-16 19:03                                       ` Chris J Arges
2012-05-16 21:29                                         ` Mark Tinguely
2012-05-18 10:10                                           ` Dave Chinner
2012-05-18 14:42                                             ` Mark Tinguely
2012-05-22 22:59                                               ` Dave Chinner
2012-06-06 15:00                                             ` Chris J Arges
2012-06-07  0:49                                               ` Dave Chinner
2012-05-17 20:55                                       ` Chris J Arges
2012-05-18 16:53                                         ` Chris J Arges
2012-05-18 17:19                                   ` Ben Myers
2012-05-19  7:28                                     ` Juerg Haefliger
2012-05-21 17:11                                       ` Ben Myers [this message]
2012-05-24  5:45                                         ` Juerg Haefliger
2012-05-24 14:23                                           ` Ben Myers
2012-05-07 22:59                               ` Dave Chinner
2012-05-09  7:35                                 ` Dave Chinner
2012-05-09 21:07                                   ` Mark Tinguely
2012-05-10  2:10                                     ` Mark Tinguely
2012-05-18  9:37                                       ` Dave Chinner
2012-05-18  9:31                                     ` Dave Chinner
2012-05-24 20:18 ` Peter Watkins
2012-05-25  6:28   ` Juerg Haefliger
2012-05-25 17:03     ` Peter Watkins
2012-06-05 23:54       ` Dave Chinner
2012-06-06 13:40         ` Brian Foster
2012-06-06 17:41           ` Mark Tinguely
2012-06-11 20:42             ` Chris J Arges
2012-06-11 23:53               ` Dave Chinner
2012-06-12 13:28                 ` Chris J Arges
2012-06-06 22:03           ` Mark Tinguely
2012-06-06 23:04             ` Brian Foster
2012-06-07  1:35           ` Dave Chinner
2012-06-07 14:16             ` Brian Foster
2012-06-08  0:28               ` Dave Chinner
2012-06-08 17:09                 ` Ben Myers
2012-06-11 20:59         ` Mark Tinguely
2012-06-05 15:21   ` Chris J Arges

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120521171136.GR16099@sgi.com \
    --to=bpm@sgi.com \
    --cc=juergh@gmail.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.