From: Brian Foster <bfoster@redhat.com>
To: Hugo Kuo <hugo@swiftstack.com>
Cc: Darrell Bishop <darrell@swiftstack.com>, xfs@oss.sgi.com
Subject: Re: [XFS] Any process to a particular XFS device hung in D state forever.
Date: Wed, 20 Apr 2016 07:24:45 -0400 [thread overview]
Message-ID: <20160420112445.GA2773@laptop.bfoster> (raw)
In-Reply-To: <CAJBkf_cNm3Xoun+PnzgdKEGz0zQx2JGNhS4KoTuDYjPm=78w+g@mail.gmail.com>
On Wed, Apr 20, 2016 at 01:49:49PM +0800, Hugo Kuo wrote:
> Hi XFS team,
>
>
> Here's the lsof output of the grouped result of any openfile happens on
> problematic disks. The full log of xfs_repair -n is included in this gist
> as well. The xfs_repair recommend to contact xfs mailing list in the end of
> the command.
>
> https://gist.github.com/HugoKuo/95613d7864aa0a1343615642b3309451
>
> Perhaps I should go ahead to reboot the machine and run the xfs_repair
> again. Please find my answers inlines.
>
Yes, repair is crashing in this case. Best to try xfs_repair after
you've rebooted and mounted/umounted the fs to replay the log. If it's
still crashing at that point, we'll probably want a metadata image of
the fs, if possible (though there's a good chance a newer xfsprogs has
the problem fixed).
>
> On Wed, Apr 20, 2016 at 3:34 AM, Brian Foster <bfoster@redhat.com> wrote:
>
> >
> > So there's definitely some traces waiting on AGF locks and whatnot, but
> > also many traces that appear to be waiting on I/O. For example:
> >
>
> Yes, those I/O waiting is the original problem of this thread. It looks
> like the disk was locked. All these I/O waiting for same disk (a multipath
> entry).
>
>
> >
> > kernel: swift-object- D 0000000000000008 0 2096 1605 0x00000000
> > kernel: ffff8877cc2378b8 0000000000000082 ffff8877cc237818 ffff887ff016eb68
> > kernel: ffff883fd4ab6b28 0000000000000046 ffff883fd4bd9400 00000001e7ea49d0
> > kernel: ffff8877cc237848 ffffffff812735d1 ffff885fa2e4a5f8 ffff8877cc237fd8
> > kernel: Call Trace:
> > kernel: [<ffffffff812735d1>] ? __blk_run_queue+0x31/0x40
> > kernel: [<ffffffff81539455>] schedule_timeout+0x215/0x2e0
> > kernel: [<ffffffff812757c9>] ? blk_peek_request+0x189/0x210
> > kernel: [<ffffffff8126d9b3>] ? elv_queue_empty+0x33/0x40
> > kernel: [<ffffffffa00040a0>] ? dm_request_fn+0x240/0x340 [dm_mod]
> > kernel: [<ffffffff815390d3>] wait_for_common+0x123/0x180
> > kernel: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
> > kernel: [<ffffffffa0001036>] ? dm_unplug_all+0x36/0x50 [dm_mod]
> > kernel: [<ffffffffa0415b56>] ? _xfs_buf_read+0x46/0x60 [xfs]
> > kernel: [<ffffffffa040b417>] ? xfs_trans_read_buf+0x197/0x410 [xfs]
> > kernel: [<ffffffff815391ed>] wait_for_completion+0x1d/0x20
> > kernel: [<ffffffffa041503b>] xfs_buf_iowait+0x9b/0x100 [xfs]
> > kernel: [<ffffffffa040b417>] ? xfs_trans_read_buf+0x197/0x410 [xfs]
> > kernel: [<ffffffffa0415b56>] _xfs_buf_read+0x46/0x60 [xfs]
> > kernel: [<ffffffffa0415c1b>] xfs_buf_read+0xab/0x100 [xfs]
> >
> >
> > Are all of these swift processes running against independent storage, or
> > one big array? Also, can you tell (e.g., with iotop) whether progress is
> > being made here, albiet very slowly, or if the storage is indeed locked
> > up..?
> >
> > There're 240+ swift processes in running.
> All stuck swift processes were attempting to access same disk. I can
> confirm it's indeed locked rather than slowly. By monitoring io via iotop.
> There's 0 activity one the problematic mount point.
>
>
> > In any event, given the I/O hangs, the fact that you're on an old distro
> > kernel and you have things like multipath enabled, it might be
> > worthwhile to see if you can rule out any multipath issues.
> >
> >
> To upgrade the kernel for CentOS6.5 may not the option for the time being
> but it definitely worth to give it try by picking up one of nodes for
> testing later. As for the multipath, yes I did suspect some mystery problem
> with multipath + XFS under a certain loading. But it's more like a XFS and
> inode related hence I start to investigate from XFS. If there's no chance
> to move forward in XFS, I might break the multipath and observe the result
> for awhile.
>
It's hard to pinpoint something to the fs when there's a bunch of hung
I/Os. You probably want to track down the source of those problems
first.
Brian
>
> >
> > 'umount -l' doesn't necessarily force anything. It just lazily unmounts
> > the fs from the namespace and cleans up the mount once all references
> > are dropped. I suspect the fs is still mounted internally.
> >
> > Brian
> >
> >
> Thanks // Hugo
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2016-04-20 11:24 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-19 9:56 [XFS] Any process to a particular XFS device hung in D state forever Hugo Kuo
2016-04-19 11:30 ` Brian Foster
2016-04-19 13:24 ` Hugo Kuo
2016-04-19 19:34 ` Brian Foster
2016-04-20 5:49 ` Hugo Kuo
2016-04-20 11:24 ` Brian Foster [this message]
2016-04-21 5:54 ` Hugo Kuo
2016-04-21 12:40 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160420112445.GA2773@laptop.bfoster \
--to=bfoster@redhat.com \
--cc=darrell@swiftstack.com \
--cc=hugo@swiftstack.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox