linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Les Stroud <les@lesstroud.com>
To: Shaohua Li <shli@kernel.org>
Cc: "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
Subject: Re: Process stuck in md_flush_request (state: D)
Date: Mon, 27 Feb 2017 21:58:37 -0500	[thread overview]
Message-ID: <4743144107696292033@unknownmsgid> (raw)
In-Reply-To: <1224510038.17134.1488242683070@vsaw28.prod.google.com>

Sent from my iPhone

> On Feb 27, 2017, at 7:44 PM, Shaohua Li <shli@kernel.org> wrote:
>
>> On Mon, Feb 27, 2017 at 01:48:00PM -0500, Les Stroud wrote:
>>
>>
>>
>>
>>> On Feb 27, 2017, at 1:28 PM, Shaohua Li <shli@kernel.org> wrote:
>>>
>>> On Mon, Feb 27, 2017 at 09:49:59AM -0500, Les Stroud wrote:
>>>> After a period of a couple of weeks with one of our test instances having this problem every other day, they were all nice enough to operate without an issue for 9 days.  It finally reoccurred last night on one of the machines.
>>>>
>>>> It exhibits the same symptoms and the call traces look as they did previously.  This particular instance is configured with a deadline scheduler.  I was able to capture the inflight you requested:
>>>>
>>>> $ cat /sys/block/xvd[abcde]/inflight
>>>>        0        0
>>>>        0        0
>>>>        0        0
>>>>        0        0
>>>>        0        0
>>>>
>>>> I’ve had this happen on instances with the deadline scheduler and the noop scheduler.  At this point, I have not had this happen on an instance that is noop and the raid filesystem (ext4) is mounted with nobarrier.  The instances with noop/nobarrier have not been running long enough for me to make any sort of conclusion that it works around the problem. Frankly, I’m not sure I understand the interaction between ext4 barriers and raid0 block flushes well enough to theorize whether it should or shouldn’t make a difference.
>>>
>>> If nobarrier, ext4 doesn't send flush request.
>>
>> So, could ext4’s flush request deadlock with an md_flush_request?  Do they share a mutex of some sort? Could one of them be failing to acquire a mutex and not handling it?
>
> No, it shouldn't deadlock. I don't have other reports for such issue. Yours are the only one.
>
>>>
>>>> Does any of this help with identifying the bug?  Is there anymore information I can get that would be useful?
>>>
>>>
>>> Unfortunately I can't find anything fishing. Does the xcdx disk correctly
>>> handle flush request? For example, you can do the same test with a single such
>>> disk and check if anything wrong.
>>
I'll test a single disk config.


>> Until recently, we had a number of these systems setup without raid0.  This issue never occurred on those systems.  Unfortunately, I can’t find a way to make it happen other than stand a server up and let it run.
>>
>> I suppose I could try a different filesystem and see if that makes a difference (maybe ext3, xfs, etc).
>
> You could format a xcdx disk and do a test against it, and check if there is
> anything wrong. To be honest, I don't think it's a problme in ext4 side too,
> but better try other filesystems. If the xcdx is a proprietory driver, I highly
> recommend a check with a single such disk first.
>

These disks are AWS EBS. So, maybe it is an issue in the xen virtual
driver? I'll see if amazon support can give me any information about
what's happening below the OS.

Is there any other output that might tell me what the process is waiting on?

Thanx,
LES


> Thanks,
> Shaohua

      parent reply	other threads:[~2017-02-28  2:58 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <36A8825E-F387-4ED8-8672-976094B3BEBB@lesstroud.com>
2017-02-17 19:05 ` Process stuck in md_flush_request (state: D) Les Stroud
2017-02-17 20:06   ` Shaohua Li
2017-02-17 20:40     ` Les Stroud
2017-02-27 14:49       ` Les Stroud
2017-02-27 18:28         ` Shaohua Li
2017-02-27 18:48           ` Les Stroud
2017-02-28  0:44             ` Shaohua Li
     [not found]             ` <1224510038.17134.1488242683070@vsaw28.prod.google.com>
2017-02-28  2:58               ` Les Stroud [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4743144107696292033@unknownmsgid \
    --to=les@lesstroud.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=shli@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).