From: Justin Bronder <jsbronder@gentoo.org>
To: linux-raid@vger.kernel.org
Subject: Re: Raid10 device hangs during resync and heavy I/O.
Date: Fri, 23 Jul 2010 11:47:01 -0400 [thread overview]
Message-ID: <20100723154701.GA2090@gmail.com> (raw)
In-Reply-To: <20100723131925.4b2bd54e@notabene>
[-- Attachment #1: Type: text/plain, Size: 7321 bytes --]
On 23/07/10 13:19 +1000, Neil Brown wrote:
> On Thu, 22 Jul 2010 14:49:33 -0400
> Justin Bronder <jsbronder@gentoo.org> wrote:
>
> > On 16/07/10 14:46 -0400, Justin Bronder wrote:
> >
> > I've done some more research that may potentially help. All of
> > the following was done with 2.6.34.1.
> >
> > Still produces the hang:
> > - Using cp (may take a bit longer).
> > - Using jfs as the filesystem.
> > - Dropping RESYNC_DEPTH to 32
> > - Using the offset layout.
> >
> > Does not produce the hang:
> > - Using the near layout.
> > - Using dd on the partition directly instead of on a
> > filesystem via something like:
> > dd if=/dev/${MD_DEV}p1 of=/dev/${MD_DEV}p1 seek=4001 bs=1M
> >
> >
> > As the barrier code is very similiar, I repeated a number of
> > these tests using raid1 instead of raid10. In every case, I was
> > unable to cause the system to hang. I focused on the barriers
> > due to the tracebacks in the previous email. For the heck of it,
> > I added some tracing (patch below) where the reason for the hang
> > is fairly obvious. Of course, how it happened isn't.
> >
> > The last bit of the trace before the hang.
>
> Thanks for doing this!
>
> See below...
<previous trace cut>
>
>
> So the 'dd' process successfully waited for the barrier to be gone at
> 189.021179, and thus set pending to '1'. It then submitted the IO request.
> We should then see swapper (or possibly some other thread) calling
> allow_barrier when the request completes. But we don't.
> A request could possibly take many milliseconds to complete, but it shouldn't
> take seconds and certainly not minutes.
>
> It might be helpful if you could run this again, and in make_request(), after
> the call to "wait_barrier()" print out:
> bio->bi_sector, bio->bi_size, bio->bi_rw
>
> I'm guessing that the last request that doesn't seem to complete will be
> different from the other in some important way.
Nothing stood out to me, but here's the tail end of a couple of different
traces.
<...>-5047 [002] 207.023784: wait_barrier: in: dd - w:0 p:11 b:0
<...>-5047 [002] 207.023784: wait_barrier: out: dd - w:0 p:12 b:0
<...>-5047 [002] 207.023785: make_request: dd - sector:7472001 sz:40960 rw:0
<...>-4958 [002] 207.023872: raise_barrier: mid: md99_resync - w:0 p:12 b:1
<...>-5047 [002] 207.024689: allow_barrier: dd - w:0 p:11 b:1
<...>-5047 [002] 207.024695: allow_barrier: dd - w:0 p:10 b:1
<...>-5047 [002] 207.024697: allow_barrier: dd - w:0 p:9 b:1
<...>-5047 [002] 207.024710: allow_barrier: dd - w:0 p:8 b:1
<...>-5047 [002] 207.024713: allow_barrier: dd - w:0 p:7 b:1
<...>-5047 [002] 207.026679: wait_barrier: in: dd - w:0 p:7 b:1
<idle>-0 [003] 207.043049: allow_barrier: swapper - w:1 p:6 b:1
<idle>-0 [003] 207.043058: allow_barrier: swapper - w:1 p:5 b:1
<idle>-0 [003] 207.043063: allow_barrier: swapper - w:1 p:4 b:1
<idle>-0 [003] 207.043070: allow_barrier: swapper - w:1 p:3 b:1
<idle>-0 [003] 207.043074: allow_barrier: swapper - w:1 p:2 b:1
<idle>-0 [003] 207.043079: allow_barrier: swapper - w:1 p:1 b:1
<idle>-0 [003] 207.043084: allow_barrier: swapper - w:1 p:0 b:1
<...>-4958 [003] 207.043108: raise_barrier: out: md99_resync - w:1 p:0 b:1
<...>-4958 [003] 207.043150: raise_barrier: in: md99_resync - w:1 p:0 b:1
<...>-4957 [003] 207.051206: lower_barrier: md99_raid10 - w:1 p:0 b:0
<...>-5047 [002] 207.051215: wait_barrier: out: dd - w:0 p:1 b:0
<...>-5047 [002] 207.051216: make_request: dd - sector:7472081 sz:20480 rw:0
<...>-4958 [003] 207.051218: raise_barrier: mid: md99_resync - w:0 p:1 b:1
<...>-5047 [002] 207.051227: wait_barrier: in: dd - w:0 p:1 b:1
<idle>-0 [002] 207.058929: allow_barrier: swapper - w:1 p:0 b:1
<...>-4958 [003] 207.058938: raise_barrier: out: md99_resync - w:1 p:0 b:1
<...>-4958 [003] 207.059044: raise_barrier: in: md99_resync - w:1 p:0 b:1
<...>-4957 [003] 207.067171: lower_barrier: md99_raid10 - w:1 p:0 b:0
<...>-5047 [002] 207.067179: wait_barrier: out: dd - w:0 p:1 b:0
<...>-5047 [002] 207.067180: make_request: dd - sector:7472121 sz:3584 rw:0
<...>-4958 [003] 207.067182: raise_barrier: mid: md99_resync - w:0 p:1 b:1
<...>-5047 [002] 207.067184: wait_barrier: in: dd - w:0 p:1 b:1
<idle>-0 [000] 463.231730: allow_barrier: swapper - w:2 p:4 b:1
<idle>-0 [000] 463.231739: allow_barrier: swapper - w:2 p:3 b:1
<idle>-0 [000] 463.231746: allow_barrier: swapper - w:2 p:2 b:1
<idle>-0 [000] 463.231765: allow_barrier: swapper - w:2 p:1 b:1
<idle>-0 [000] 463.231774: allow_barrier: swapper - w:2 p:0 b:1
<...>-5004 [000] 463.231792: raise_barrier: out: md99_resync - w:2 p:0 b:1
<...>-5004 [000] 463.232005: raise_barrier: in: md99_resync - w:2 p:0 b:1
<...>-5003 [001] 463.232453: lower_barrier: md99_raid10 - w:2 p:0 b:0
<...>-5009 [000] 463.232463: wait_barrier: out: flush-9:99 - w:1 p:1 b:0
<...>-5009 [000] 463.232464: make_request: flush-9:99 - sector:13931137 sz:61440 rw:1
<...>-5105 [001] 463.232466: wait_barrier: out: dd - w:0 p:2 b:0
<...>-5105 [001] 463.232467: make_request: dd - sector:7204393 sz:40960 rw:0
<...>-5009 [000] 463.232476: wait_barrier: in: flush-9:99 - w:0 p:2 b:0
<...>-5009 [000] 463.232477: wait_barrier: out: flush-9:99 - w:0 p:3 b:0
<...>-5009 [000] 463.232477: make_request: flush-9:99 - sector:13931257 sz:3584 rw:1
<...>-5009 [000] 463.232481: wait_barrier: in: flush-9:99 - w:0 p:3 b:0
<...>-5009 [000] 463.232482: wait_barrier: out: flush-9:99 - w:0 p:4 b:0
<...>-5009 [000] 463.232483: make_request: flush-9:99 - sector:13931264 sz:512 rw:1
<...>-5105 [001] 463.232492: wait_barrier: in: dd - w:0 p:4 b:0
<...>-5105 [001] 463.232493: wait_barrier: out: dd - w:0 p:5 b:0
<...>-5105 [001] 463.232494: make_request: dd - sector:7204473 sz:3584 rw:0
<...>-5004 [000] 463.232495: raise_barrier: mid: md99_resync - w:0 p:5 b:1
<...>-5105 [001] 463.232496: wait_barrier: in: dd - w:0 p:5 b:1
<...>-5009 [000] 463.232522: wait_barrier: in: flush-9:99 - w:1 p:5 b:1
<idle>-0 [000] 463.232726: allow_barrier: swapper - w:2 p:4 b:1
<idle>-0 [001] 463.240520: allow_barrier: swapper - w:2 p:3 b:1
<idle>-0 [000] 463.240946: allow_barrier: swapper - w:2 p:2 b:1
<idle>-0 [000] 463.240955: allow_barrier: swapper - w:2 p:1 b:1
Thanks,
--
Justin Bronder
[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]
next prev parent reply other threads:[~2010-07-23 15:47 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-07-16 18:46 Raid10 device hangs during resync and heavy I/O Justin Bronder
2010-07-16 18:49 ` Justin Bronder
2010-07-22 18:49 ` Justin Bronder
2010-07-23 3:19 ` Neil Brown
2010-07-23 15:47 ` Justin Bronder [this message]
2010-08-02 2:29 ` Neil Brown
2010-08-02 2:58 ` Neil Brown
2010-08-02 20:37 ` Justin Bronder
2010-08-07 11:22 ` Neil Brown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100723154701.GA2090@gmail.com \
--to=jsbronder@gentoo.org \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).