From: Ben Myers <bpm@sgi.com>
To: Juerg Haefliger <juergh@gmail.com>
Cc: xfs@oss.sgi.com
Subject: Re: Still seeing hangs in xlog_grant_log_space
Date: Mon, 21 May 2012 12:11:37 -0500 [thread overview]
Message-ID: <20120521171136.GR16099@sgi.com> (raw)
In-Reply-To: <CADLDEKssiOCVRknW3hYtxDxYHSyGr6qfepfai+UymsD6zMGopw@mail.gmail.com>
Hey Juerg,
On Sat, May 19, 2012 at 09:28:55AM +0200, Juerg Haefliger wrote:
> > On Wed, May 09, 2012 at 09:54:08AM +0200, Juerg Haefliger wrote:
> >> > On Sat, May 05, 2012 at 09:44:35AM +0200, Juerg Haefliger wrote:
> >> >> Did anybody have a chance to look at the data?
> >> >
> >> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/979498
> >> >
> >> > Here you indicate that you have created a reproducer. Can you post it to the list?
> >>
> >> Canonical attached them to the bug report that they filed yesterday:
> >> http://oss.sgi.com/bugzilla/show_bug.cgi?id=922
> >
> > I'm interested in understanding to what extent the hang you see in production
> > on 2.6.38 is similar to the hang of the reproducer. Mark is seeing a situation
> > where there is nothing on the AIL and is clogged up in the CIL, others are
> > seeing items on the AIL that don't seem to be making progress. Could you
> > provide a dump or traces from a hang on a filesystem with a normal sized log?
> > Can the reproducer hit the hang eventually without resorting to the tiny log?
>
> I'm not certain that the reproducer hang is identical to the
> production hang. One difference that I've noticed is that a reproducer
> hang can be cleared with an emergency sync while a production hang
> can't. I'm working on trying to get a trace from a production machine.
Hit this on a filesystem with a regular sized log over the weekend. If you see
this again in production could you gather up task states?
echo t > /proc/sysrq-trigger
Mark and I have been looking at the dump. There are few interesting items to point out.
1) xfs_sync_worker is blocked trying to get log reservation:
PID: 25374 TASK: ffff88013481c6c0 CPU: 3 COMMAND: "kworker/3:83"
#0 [ffff88013481fb50] __schedule at ffffffff813aacac
#1 [ffff88013481fc98] schedule at ffffffff813ab0c4
#2 [ffff88013481fca8] xlog_grant_head_wait at ffffffffa0347b78 [xfs]
#3 [ffff88013481fcf8] xlog_grant_head_check at ffffffffa03483e6 [xfs]
#4 [ffff88013481fd38] xfs_log_reserve at ffffffffa034852c [xfs]
#5 [ffff88013481fd88] xfs_trans_reserve at ffffffffa0344e64 [xfs]
#6 [ffff88013481fdd8] xfs_fs_log_dummy at ffffffffa02ec138 [xfs]
#7 [ffff88013481fdf8] xfs_sync_worker at ffffffffa02f7be4 [xfs]
#8 [ffff88013481fe18] process_one_work at ffffffff8104c53b
#9 [ffff88013481fe68] worker_thread at ffffffff8104f0e3
#10 [ffff88013481fee8] kthread at ffffffff8105395e
#11 [ffff88013481ff48] kernel_thread_helper at ffffffff813b3ae4
This means that it is not in a position to push the AIL. It is clear that the
AIL has plenty of entries which can be pushed.
crash> xfs_ail 0xffff88022112b7c0,
struct xfs_ail {
...
xa_ail = {
next = 0xffff880144d1c318,
prev = 0xffff880170a02078
},
xa_target = 0x1f00003063,
Here's the first item on the AIL:
ffff880144d1c318
struct xfs_log_item_t {
li_ail = {
next = 0xffff880196ea0858,
prev = 0xffff88022112b7d0
},
li_lsn = 0x1f00001c63, <--- less than xa_target
li_desc = 0x0,
li_mountp = 0xffff88016adee000,
li_ailp = 0xffff88022112b7c0,
li_type = 0x123b,
li_flags = 0x1,
li_bio_list = 0xffff88016afa5cb8,
li_cb = 0xffffffffa034de00 <xfs_istale_done>,
li_ops = 0xffffffffa035f620,
li_cil = {
next = 0xffff880144d1c368,
prev = 0xffff880144d1c368
},
li_lv = 0x0,
li_seq = 0x3b
}
So if xfs_sync_worker were not blocked on log reservation it would push these
items.
2) The CIL is waiting around too:
crash> xfs_cil_ctx 0xffff880144d1a9c0,
struct xfs_cil_ctx {
...
space_used = 0x135f68,
struct log {
...
l_logsize = 0xa00000,
A00000/8
140000 <--- XLOG_CIL_SPACE_LIMIT
140000 - 135F68
A098
Looks like xlog_cil_push_background will not push the CIL while space used is
less than XLOG_CIL_SPACE_LIMIT, so that's not going anywhere either.
3) It may be unrelated to this bug, but we do have a race in the log
reservation code that hasn't been resolved... between when log_space_left
samples the grant heads and when the space is actually granted a bit later.
Maybe we can grant more space than intended.
If you can provide output of 'echo t > /proc/sysrq-trigger' it may be enough
information to determine if you're seeing the same problem we hit on Saturday.
Thanks,
Ben & Mark
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2012-05-21 17:06 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-04-23 12:09 Still seeing hangs in xlog_grant_log_space Juerg Haefliger
2012-04-23 14:38 ` Dave Chinner
2012-04-23 15:33 ` Juerg Haefliger
2012-04-23 23:58 ` Dave Chinner
2012-04-24 8:55 ` Juerg Haefliger
2012-04-24 12:07 ` Dave Chinner
2012-04-24 18:26 ` Juerg Haefliger
2012-04-25 22:38 ` Dave Chinner
2012-04-26 12:37 ` Juerg Haefliger
2012-04-26 22:44 ` Dave Chinner
2012-04-26 23:00 ` Juerg Haefliger
2012-04-26 23:07 ` Dave Chinner
2012-04-27 9:04 ` Juerg Haefliger
2012-04-27 11:09 ` Dave Chinner
2012-04-27 13:07 ` Juerg Haefliger
2012-05-05 7:44 ` Juerg Haefliger
2012-05-07 17:19 ` Ben Myers
2012-05-09 7:54 ` Juerg Haefliger
2012-05-10 16:11 ` Chris J Arges
2012-05-10 21:53 ` Mark Tinguely
2012-05-16 18:42 ` Ben Myers
2012-05-16 19:03 ` Chris J Arges
2012-05-16 21:29 ` Mark Tinguely
2012-05-18 10:10 ` Dave Chinner
2012-05-18 14:42 ` Mark Tinguely
2012-05-22 22:59 ` Dave Chinner
2012-06-06 15:00 ` Chris J Arges
2012-06-07 0:49 ` Dave Chinner
2012-05-17 20:55 ` Chris J Arges
2012-05-18 16:53 ` Chris J Arges
2012-05-18 17:19 ` Ben Myers
2012-05-19 7:28 ` Juerg Haefliger
2012-05-21 17:11 ` Ben Myers [this message]
2012-05-24 5:45 ` Juerg Haefliger
2012-05-24 14:23 ` Ben Myers
2012-05-07 22:59 ` Dave Chinner
2012-05-09 7:35 ` Dave Chinner
2012-05-09 21:07 ` Mark Tinguely
2012-05-10 2:10 ` Mark Tinguely
2012-05-18 9:37 ` Dave Chinner
2012-05-18 9:31 ` Dave Chinner
2012-05-24 20:18 ` Peter Watkins
2012-05-25 6:28 ` Juerg Haefliger
2012-05-25 17:03 ` Peter Watkins
2012-06-05 23:54 ` Dave Chinner
2012-06-06 13:40 ` Brian Foster
2012-06-06 17:41 ` Mark Tinguely
2012-06-11 20:42 ` Chris J Arges
2012-06-11 23:53 ` Dave Chinner
2012-06-12 13:28 ` Chris J Arges
2012-06-06 22:03 ` Mark Tinguely
2012-06-06 23:04 ` Brian Foster
2012-06-07 1:35 ` Dave Chinner
2012-06-07 14:16 ` Brian Foster
2012-06-08 0:28 ` Dave Chinner
2012-06-08 17:09 ` Ben Myers
2012-06-11 20:59 ` Mark Tinguely
2012-06-05 15:21 ` Chris J Arges
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120521171136.GR16099@sgi.com \
--to=bpm@sgi.com \
--cc=juergh@gmail.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox