public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Ben Myers <bpm@sgi.com>
To: Juerg Haefliger <juergh@gmail.com>
Cc: xfs@oss.sgi.com
Subject: Re: Still seeing hangs in xlog_grant_log_space
Date: Mon, 21 May 2012 12:11:37 -0500	[thread overview]
Message-ID: <20120521171136.GR16099@sgi.com> (raw)
In-Reply-To: <CADLDEKssiOCVRknW3hYtxDxYHSyGr6qfepfai+UymsD6zMGopw@mail.gmail.com>

Hey Juerg,

On Sat, May 19, 2012 at 09:28:55AM +0200, Juerg Haefliger wrote:
> > On Wed, May 09, 2012 at 09:54:08AM +0200, Juerg Haefliger wrote:
> >> > On Sat, May 05, 2012 at 09:44:35AM +0200, Juerg Haefliger wrote:
> >> >> Did anybody have a chance to look at the data?
> >> >
> >> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/979498
> >> >
> >> > Here you indicate that you have created a reproducer.  Can you post it to the list?
> >>
> >> Canonical attached them to the bug report that they filed yesterday:
> >> http://oss.sgi.com/bugzilla/show_bug.cgi?id=922
> >
> > I'm interested in understanding to what extent the hang you see in production
> > on 2.6.38 is similar to the hang of the reproducer.  Mark is seeing a situation
> > where there is nothing on the AIL and is clogged up in the CIL, others are
> > seeing items on the AIL that don't seem to be making progress.  Could you
> > provide a dump or traces from a hang on a filesystem with a normal sized log?
> > Can the reproducer hit the hang eventually without resorting to the tiny log?
> 
> I'm not certain that the reproducer hang is identical to the
> production hang. One difference that I've noticed is that a reproducer
> hang can be cleared with an emergency sync while a production hang
> can't. I'm working on trying to get a trace from a production machine.

Hit this on a filesystem with a regular sized log over the weekend.  If you see
this again in production could you gather up task states?

echo t > /proc/sysrq-trigger

Mark and I have been looking at the dump.  There are few interesting items to point out.

1) xfs_sync_worker is blocked trying to get log reservation:

PID: 25374  TASK: ffff88013481c6c0  CPU: 3   COMMAND: "kworker/3:83"
 #0 [ffff88013481fb50] __schedule at ffffffff813aacac
 #1 [ffff88013481fc98] schedule at ffffffff813ab0c4
 #2 [ffff88013481fca8] xlog_grant_head_wait at ffffffffa0347b78 [xfs]
 #3 [ffff88013481fcf8] xlog_grant_head_check at ffffffffa03483e6 [xfs]
 #4 [ffff88013481fd38] xfs_log_reserve at ffffffffa034852c [xfs]
 #5 [ffff88013481fd88] xfs_trans_reserve at ffffffffa0344e64 [xfs]
 #6 [ffff88013481fdd8] xfs_fs_log_dummy at ffffffffa02ec138 [xfs]
 #7 [ffff88013481fdf8] xfs_sync_worker at ffffffffa02f7be4 [xfs]
 #8 [ffff88013481fe18] process_one_work at ffffffff8104c53b
 #9 [ffff88013481fe68] worker_thread at ffffffff8104f0e3
#10 [ffff88013481fee8] kthread at ffffffff8105395e
#11 [ffff88013481ff48] kernel_thread_helper at ffffffff813b3ae4

This means that it is not in a position to push the AIL.  It is clear that the
AIL has plenty of entries which can be pushed.

crash> xfs_ail 0xffff88022112b7c0,
struct xfs_ail {
...
  xa_ail = {
    next = 0xffff880144d1c318,
    prev = 0xffff880170a02078
  },
  xa_target = 0x1f00003063,

Here's the first item on the AIL:

ffff880144d1c318
struct xfs_log_item_t {
  li_ail = {
    next = 0xffff880196ea0858,
    prev = 0xffff88022112b7d0
  },
  li_lsn = 0x1f00001c63,		<--- less than xa_target
  li_desc = 0x0,
  li_mountp = 0xffff88016adee000,
  li_ailp = 0xffff88022112b7c0,
  li_type = 0x123b,
  li_flags = 0x1,
  li_bio_list = 0xffff88016afa5cb8,
  li_cb = 0xffffffffa034de00 <xfs_istale_done>,
  li_ops = 0xffffffffa035f620,
  li_cil = {
    next = 0xffff880144d1c368,
    prev = 0xffff880144d1c368
  },
  li_lv = 0x0,
  li_seq = 0x3b
}

So if xfs_sync_worker were not blocked on log reservation it would push these
items.

2) The CIL is waiting around too:

crash> xfs_cil_ctx 0xffff880144d1a9c0,
struct xfs_cil_ctx {
...
  space_used = 0x135f68, 

struct log {
...
  l_logsize = 0xa00000,

A00000/8
140000						<--- XLOG_CIL_SPACE_LIMIT

140000 - 135F68
A098

Looks like xlog_cil_push_background will not push the CIL while space used is
less than XLOG_CIL_SPACE_LIMIT, so that's not going anywhere either.

3) It may be unrelated to this bug, but we do have a race in the log
reservation code that hasn't been resolved... between when log_space_left
samples the grant heads and when the space is actually granted a bit later.
Maybe we can grant more space than intended.

If you can provide output of 'echo t > /proc/sysrq-trigger' it may be enough
information to determine if you're seeing the same problem we hit on Saturday.

Thanks,

Ben & Mark

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2012-05-21 17:06 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-23 12:09 Still seeing hangs in xlog_grant_log_space Juerg Haefliger
2012-04-23 14:38 ` Dave Chinner
2012-04-23 15:33   ` Juerg Haefliger
2012-04-23 23:58     ` Dave Chinner
2012-04-24  8:55       ` Juerg Haefliger
2012-04-24 12:07         ` Dave Chinner
2012-04-24 18:26           ` Juerg Haefliger
2012-04-25 22:38             ` Dave Chinner
2012-04-26 12:37               ` Juerg Haefliger
2012-04-26 22:44                 ` Dave Chinner
2012-04-26 23:00                   ` Juerg Haefliger
2012-04-26 23:07                     ` Dave Chinner
2012-04-27  9:04                       ` Juerg Haefliger
2012-04-27 11:09                         ` Dave Chinner
2012-04-27 13:07                           ` Juerg Haefliger
2012-05-05  7:44                             ` Juerg Haefliger
2012-05-07 17:19                               ` Ben Myers
2012-05-09  7:54                                 ` Juerg Haefliger
2012-05-10 16:11                                   ` Chris J Arges
2012-05-10 21:53                                     ` Mark Tinguely
2012-05-16 18:42                                     ` Ben Myers
2012-05-16 19:03                                       ` Chris J Arges
2012-05-16 21:29                                         ` Mark Tinguely
2012-05-18 10:10                                           ` Dave Chinner
2012-05-18 14:42                                             ` Mark Tinguely
2012-05-22 22:59                                               ` Dave Chinner
2012-06-06 15:00                                             ` Chris J Arges
2012-06-07  0:49                                               ` Dave Chinner
2012-05-17 20:55                                       ` Chris J Arges
2012-05-18 16:53                                         ` Chris J Arges
2012-05-18 17:19                                   ` Ben Myers
2012-05-19  7:28                                     ` Juerg Haefliger
2012-05-21 17:11                                       ` Ben Myers [this message]
2012-05-24  5:45                                         ` Juerg Haefliger
2012-05-24 14:23                                           ` Ben Myers
2012-05-07 22:59                               ` Dave Chinner
2012-05-09  7:35                                 ` Dave Chinner
2012-05-09 21:07                                   ` Mark Tinguely
2012-05-10  2:10                                     ` Mark Tinguely
2012-05-18  9:37                                       ` Dave Chinner
2012-05-18  9:31                                     ` Dave Chinner
2012-05-24 20:18 ` Peter Watkins
2012-05-25  6:28   ` Juerg Haefliger
2012-05-25 17:03     ` Peter Watkins
2012-06-05 23:54       ` Dave Chinner
2012-06-06 13:40         ` Brian Foster
2012-06-06 17:41           ` Mark Tinguely
2012-06-11 20:42             ` Chris J Arges
2012-06-11 23:53               ` Dave Chinner
2012-06-12 13:28                 ` Chris J Arges
2012-06-06 22:03           ` Mark Tinguely
2012-06-06 23:04             ` Brian Foster
2012-06-07  1:35           ` Dave Chinner
2012-06-07 14:16             ` Brian Foster
2012-06-08  0:28               ` Dave Chinner
2012-06-08 17:09                 ` Ben Myers
2012-06-11 20:59         ` Mark Tinguely
2012-06-05 15:21   ` Chris J Arges

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120521171136.GR16099@sgi.com \
    --to=bpm@sgi.com \
    --cc=juergh@gmail.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox