linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Justin Bronder <jsbronder@gentoo.org>
To: linux-raid@vger.kernel.org
Subject: Re: Raid10 device hangs during resync and heavy I/O.
Date: Fri, 23 Jul 2010 11:47:01 -0400	[thread overview]
Message-ID: <20100723154701.GA2090@gmail.com> (raw)
In-Reply-To: <20100723131925.4b2bd54e@notabene>

[-- Attachment #1: Type: text/plain, Size: 7321 bytes --]

On 23/07/10 13:19 +1000, Neil Brown wrote:
> On Thu, 22 Jul 2010 14:49:33 -0400
> Justin Bronder <jsbronder@gentoo.org> wrote:
> 
> > On 16/07/10 14:46 -0400, Justin Bronder wrote:
> > 
> > I've done some more research that may potentially help. All of
> > the following was done with 2.6.34.1.
> > 
> > Still produces the hang:
> >     - Using cp (may take a bit longer).
> >     - Using jfs as the filesystem.
> >     - Dropping RESYNC_DEPTH to 32
> >     - Using the offset layout.
> > 
> > Does not produce the hang:
> >     - Using the near layout.
> >     - Using dd on the partition directly instead of on a
> >       filesystem via something like:
> >       dd if=/dev/${MD_DEV}p1 of=/dev/${MD_DEV}p1 seek=4001 bs=1M
> > 
> > 
> > As the barrier code is very similiar, I repeated a number of
> > these tests using raid1 instead of raid10.  In every case, I was
> > unable to cause the system to hang.  I focused on the barriers
> > due to the tracebacks in the previous email.  For the heck of it,
> > I added some tracing (patch below) where the reason for the hang
> > is fairly obvious.  Of course, how it happened isn't.
> > 
> > The last bit of the trace before the hang.
> 
> Thanks for doing this!
> 
> See below...

<previous trace cut>

> 
> 
> So the 'dd' process successfully waited for the barrier to be gone at
> 189.021179, and thus set pending to '1'.  It then submitted the IO request.
> We should then see swapper (or possibly some other thread) calling
> allow_barrier when the request completes.  But we don't.
> A request could possibly take many milliseconds to complete, but it shouldn't
> take seconds and certainly not minutes.
> 
> It might be helpful if you could run this again, and in make_request(), after
> the call to "wait_barrier()" print out:
>   bio->bi_sector, bio->bi_size, bio->bi_rw
> 
> I'm guessing that the last request that doesn't seem to complete will be
> different from the other in some important way.

Nothing stood out to me, but here's the tail end of a couple of different
traces.

           <...>-5047  [002]   207.023784: wait_barrier: in:  dd - w:0 p:11 b:0
           <...>-5047  [002]   207.023784: wait_barrier: out: dd - w:0 p:12 b:0
           <...>-5047  [002]   207.023785: make_request: dd - sector:7472001 sz:40960 rw:0
           <...>-4958  [002]   207.023872: raise_barrier: mid: md99_resync - w:0 p:12 b:1
           <...>-5047  [002]   207.024689: allow_barrier:     dd - w:0 p:11 b:1
           <...>-5047  [002]   207.024695: allow_barrier:     dd - w:0 p:10 b:1
           <...>-5047  [002]   207.024697: allow_barrier:     dd - w:0 p:9 b:1
           <...>-5047  [002]   207.024710: allow_barrier:     dd - w:0 p:8 b:1
           <...>-5047  [002]   207.024713: allow_barrier:     dd - w:0 p:7 b:1
           <...>-5047  [002]   207.026679: wait_barrier: in:  dd - w:0 p:7 b:1
          <idle>-0     [003]   207.043049: allow_barrier:     swapper - w:1 p:6 b:1
          <idle>-0     [003]   207.043058: allow_barrier:     swapper - w:1 p:5 b:1
          <idle>-0     [003]   207.043063: allow_barrier:     swapper - w:1 p:4 b:1
          <idle>-0     [003]   207.043070: allow_barrier:     swapper - w:1 p:3 b:1
          <idle>-0     [003]   207.043074: allow_barrier:     swapper - w:1 p:2 b:1
          <idle>-0     [003]   207.043079: allow_barrier:     swapper - w:1 p:1 b:1
          <idle>-0     [003]   207.043084: allow_barrier:     swapper - w:1 p:0 b:1
           <...>-4958  [003]   207.043108: raise_barrier: out: md99_resync - w:1 p:0 b:1
           <...>-4958  [003]   207.043150: raise_barrier: in:  md99_resync - w:1 p:0 b:1
           <...>-4957  [003]   207.051206: lower_barrier:     md99_raid10 - w:1 p:0 b:0
           <...>-5047  [002]   207.051215: wait_barrier: out: dd - w:0 p:1 b:0
           <...>-5047  [002]   207.051216: make_request: dd - sector:7472081 sz:20480 rw:0
           <...>-4958  [003]   207.051218: raise_barrier: mid: md99_resync - w:0 p:1 b:1
           <...>-5047  [002]   207.051227: wait_barrier: in:  dd - w:0 p:1 b:1
          <idle>-0     [002]   207.058929: allow_barrier:     swapper - w:1 p:0 b:1
           <...>-4958  [003]   207.058938: raise_barrier: out: md99_resync - w:1 p:0 b:1
           <...>-4958  [003]   207.059044: raise_barrier: in:  md99_resync - w:1 p:0 b:1
           <...>-4957  [003]   207.067171: lower_barrier:     md99_raid10 - w:1 p:0 b:0
           <...>-5047  [002]   207.067179: wait_barrier: out: dd - w:0 p:1 b:0
           <...>-5047  [002]   207.067180: make_request: dd - sector:7472121 sz:3584 rw:0
           <...>-4958  [003]   207.067182: raise_barrier: mid: md99_resync - w:0 p:1 b:1
           <...>-5047  [002]   207.067184: wait_barrier: in:  dd - w:0 p:1 b:1



          <idle>-0     [000]   463.231730: allow_barrier:     swapper - w:2 p:4 b:1
          <idle>-0     [000]   463.231739: allow_barrier:     swapper - w:2 p:3 b:1
          <idle>-0     [000]   463.231746: allow_barrier:     swapper - w:2 p:2 b:1
          <idle>-0     [000]   463.231765: allow_barrier:     swapper - w:2 p:1 b:1
          <idle>-0     [000]   463.231774: allow_barrier:     swapper - w:2 p:0 b:1
           <...>-5004  [000]   463.231792: raise_barrier: out: md99_resync - w:2 p:0 b:1
           <...>-5004  [000]   463.232005: raise_barrier: in:  md99_resync - w:2 p:0 b:1
           <...>-5003  [001]   463.232453: lower_barrier:     md99_raid10 - w:2 p:0 b:0
           <...>-5009  [000]   463.232463: wait_barrier: out: flush-9:99 - w:1 p:1 b:0
           <...>-5009  [000]   463.232464: make_request: flush-9:99 - sector:13931137 sz:61440 rw:1
           <...>-5105  [001]   463.232466: wait_barrier: out: dd - w:0 p:2 b:0
           <...>-5105  [001]   463.232467: make_request: dd - sector:7204393 sz:40960 rw:0
           <...>-5009  [000]   463.232476: wait_barrier: in:  flush-9:99 - w:0 p:2 b:0
           <...>-5009  [000]   463.232477: wait_barrier: out: flush-9:99 - w:0 p:3 b:0
           <...>-5009  [000]   463.232477: make_request: flush-9:99 - sector:13931257 sz:3584 rw:1
           <...>-5009  [000]   463.232481: wait_barrier: in:  flush-9:99 - w:0 p:3 b:0
           <...>-5009  [000]   463.232482: wait_barrier: out: flush-9:99 - w:0 p:4 b:0
           <...>-5009  [000]   463.232483: make_request: flush-9:99 - sector:13931264 sz:512 rw:1
           <...>-5105  [001]   463.232492: wait_barrier: in:  dd - w:0 p:4 b:0
           <...>-5105  [001]   463.232493: wait_barrier: out: dd - w:0 p:5 b:0
           <...>-5105  [001]   463.232494: make_request: dd - sector:7204473 sz:3584 rw:0
           <...>-5004  [000]   463.232495: raise_barrier: mid: md99_resync - w:0 p:5 b:1
           <...>-5105  [001]   463.232496: wait_barrier: in:  dd - w:0 p:5 b:1
           <...>-5009  [000]   463.232522: wait_barrier: in:  flush-9:99 - w:1 p:5 b:1
          <idle>-0     [000]   463.232726: allow_barrier:     swapper - w:2 p:4 b:1
          <idle>-0     [001]   463.240520: allow_barrier:     swapper - w:2 p:3 b:1
          <idle>-0     [000]   463.240946: allow_barrier:     swapper - w:2 p:2 b:1
          <idle>-0     [000]   463.240955: allow_barrier:     swapper - w:2 p:1 b:1

Thanks,

-- 
Justin Bronder

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

  reply	other threads:[~2010-07-23 15:47 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-16 18:46 Raid10 device hangs during resync and heavy I/O Justin Bronder
2010-07-16 18:49 ` Justin Bronder
2010-07-22 18:49 ` Justin Bronder
2010-07-23  3:19   ` Neil Brown
2010-07-23 15:47     ` Justin Bronder [this message]
2010-08-02  2:29       ` Neil Brown
2010-08-02  2:58         ` Neil Brown
2010-08-02 20:37           ` Justin Bronder
2010-08-07 11:22             ` Neil Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100723154701.GA2090@gmail.com \
    --to=jsbronder@gentoo.org \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).