linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.de>
To: "Wojcik, Krzysztof" <krzysztof.wojcik@intel.com>
Cc: "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
Subject: Re: After 0->10 takeover process hangs at "wait_barrier"
Date: Thu, 3 Feb 2011 18:35:48 +1100	[thread overview]
Message-ID: <20110203183548.3b875693@notabene.brown> (raw)
In-Reply-To: <BE2BFE91933D1B4089447C6448604080670E9A24@irsmsx503.ger.corp.intel.com>

On Wed, 2 Feb 2011 12:15:28 +0000 "Wojcik, Krzysztof"
<krzysztof.wojcik@intel.com> wrote:

> Neil,
> 
> I would like to return to problem related to raid0->raid10 takeover operation.
> I observed following symptoms:
> 1. After raid0->raid10 takeover we have array with 2 missing disks. When we add disk for rebuild, recovery process starts as expected but it does not finish- it stops at about 90%, md126_resync process hangs in "D" state
> 2. Similar behavior is when we have mounted raid0 array and we execute takeover to raid10. After this when we try to unmount array- it causes process umount hangs in "D"
> 
> In scenarios above processes hang at the same function- wait_barrier in raid10.c.
> Process waits in macro "wait_event_lock_irq" until the "!conf->barrier" condition will be true. In scenarios above it never happens.
> 
> Issue does not appear if after takeover we stop array and assemble it again- we can rebuild disks without problem. It indicates that raid0->raid10 takeover process does not initialize all array parameters in proper way.
> 
> Do you have any suggestions what can I do to get closer to solving this problem?

Yes.

Towards the end of level_store, after calling pers->run, we call
mddev_resume..
This calls pers->quiesce(mddev, 0)

With RAID10, that calls lower_barrier.
However raise_barrier hadn't been called on that 'conf' yet,
so conf->barrier becomes negative, which is bad.

Maybe raid10_takeover_raid0 should call raise_barrier on the conf
before returning it.
I suspect that is the right approach, but I would need to review some
of the code in various levels to make sure it makes sense, and would
need to add some comments to clarify this.

Could you just try that one change and see if it fixed the problem?

i.e.

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 69b6595..10b636d 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2467,7 +2467,7 @@ static void *raid10_takeover_raid0(mddev_t *mddev)
 		list_for_each_entry(rdev, &mddev->disks, same_set)
 			if (rdev->raid_disk >= 0)
 				rdev->new_raid_disk = rdev->raid_disk * 2;
-		
+	conf->barrier++;
 	return conf;
 }
 


Thanks,
NeilBrown

  reply	other threads:[~2011-02-03  7:35 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-02 12:15 After 0->10 takeover process hangs at "wait_barrier" Wojcik, Krzysztof
2011-02-03  7:35 ` NeilBrown [this message]
2011-02-03 16:21   ` Wojcik, Krzysztof
2011-02-08  0:42     ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110203183548.3b875693@notabene.brown \
    --to=neilb@suse.de \
    --cc=krzysztof.wojcik@intel.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).