linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] FIX: Process hangs at wait_barrier
@ 2011-02-04 13:18 Krzysztof Wojcik
  2011-02-04 13:18 ` [PATCH 1/2] FIX: md: process hangs at wait_barrier after 0->10 takeover Krzysztof Wojcik
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Krzysztof Wojcik @ 2011-02-04 13:18 UTC (permalink / raw)
  To: neilb
  Cc: linux-raid, wojciech.neubauer, adam.kwolek, dan.j.williams,
	ed.ciechanowski

Patches resolve problem with process crash at wait_barrier()
after raid0->raid10 takeover.
First patch resolve this particular problem.
Solution is similar to RAID1 barrier implementation.
Second is proposal for general protection against barrier
become negative.

---

Krzysztof Wojcik (2):
      FIX: md: process hangs at wait_barrier after 0->10 takeover
      FIX: md: Prevent barrier become negative


 drivers/md/raid1.c  |    3 ++-
 drivers/md/raid10.c |    9 ++++++---
 2 files changed, 8 insertions(+), 4 deletions(-)

-- 
Krzysztof Wojcik

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/2] FIX: md: process hangs at wait_barrier after 0->10 takeover
  2011-02-04 13:18 [PATCH 0/2] FIX: Process hangs at wait_barrier Krzysztof Wojcik
@ 2011-02-04 13:18 ` Krzysztof Wojcik
  2011-02-04 13:18 ` [PATCH 2/2] FIX: md: Prevent barrier become negative Krzysztof Wojcik
  2011-02-08  0:50 ` [PATCH 0/2] FIX: Process hangs at wait_barrier NeilBrown
  2 siblings, 0 replies; 4+ messages in thread
From: Krzysztof Wojcik @ 2011-02-04 13:18 UTC (permalink / raw)
  To: neilb
  Cc: linux-raid, wojciech.neubauer, adam.kwolek, dan.j.williams,
	ed.ciechanowski

Following symptoms were observed:
1. After raid0->raid10 takeover operation we have array with 2
missing disks.
When we add disk for rebuild, recovery process starts as expected
but it does not finish- it stops at about 90%, md126_resync process
hangs in "D" state.
2. Similar behavior is when we have mounted raid0 array and we
execute takeover to raid10. After this when we try to unmount array-
it causes process umount hangs in "D"

In scenarios above processes hang at the same function- wait_barrier
in raid10.c.
Process waits in macro "wait_event_lock_irq" until the
"!conf->barrier" condition will be true.
In scenarios above it never happens.

Reason was that at the end of level_store, after calling pers->run,
we call mddev_resume. This calls pers->quiesce(mddev, 0) with
RAID10, that calls lower_barrier.
However raise_barrier hadn't been called on that 'conf' yet,
so conf->barrier becomes negative, which is bad.

This patch introduces setting conf->barrier=1 after takeover
operation. It prevents to become barrier negative after call
lower_barrier().

Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com>
---
 drivers/md/raid10.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 69b6595..3b607b2 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2463,11 +2463,13 @@ static void *raid10_takeover_raid0(mddev_t *mddev)
 	mddev->recovery_cp = MaxSector;
 
 	conf = setup_conf(mddev);
-	if (!IS_ERR(conf))
+	if (!IS_ERR(conf)) {
 		list_for_each_entry(rdev, &mddev->disks, same_set)
 			if (rdev->raid_disk >= 0)
 				rdev->new_raid_disk = rdev->raid_disk * 2;
-		
+		conf->barrier = 1;
+	}
+
 	return conf;
 }
 


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/2] FIX: md: Prevent barrier become negative
  2011-02-04 13:18 [PATCH 0/2] FIX: Process hangs at wait_barrier Krzysztof Wojcik
  2011-02-04 13:18 ` [PATCH 1/2] FIX: md: process hangs at wait_barrier after 0->10 takeover Krzysztof Wojcik
@ 2011-02-04 13:18 ` Krzysztof Wojcik
  2011-02-08  0:50 ` [PATCH 0/2] FIX: Process hangs at wait_barrier NeilBrown
  2 siblings, 0 replies; 4+ messages in thread
From: Krzysztof Wojcik @ 2011-02-04 13:18 UTC (permalink / raw)
  To: neilb
  Cc: linux-raid, wojciech.neubauer, adam.kwolek, dan.j.williams,
	ed.ciechanowski

In some situations barrier counter may become negative.
Calling lower_barrier with barrier=0 results barrier become negative.
It is harm situation and may cause process hang when we call wait_barrier()

This patch introduces additional condition in lower_barrier function-
decrement barrier counter only if it is raised.
It prevents to become barrier variable negative.

Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com>
---
 drivers/md/raid1.c  |    3 ++-
 drivers/md/raid10.c |    3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a23ffa3..fa7077b 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -660,7 +660,8 @@ static void lower_barrier(conf_t *conf)
 	unsigned long flags;
 	BUG_ON(conf->barrier <= 0);
 	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->barrier--;
+	if (conf->barrier > 0)
+		conf->barrier--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3b607b2..c9e46a9 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -727,7 +727,8 @@ static void lower_barrier(conf_t *conf)
 {
 	unsigned long flags;
 	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->barrier--;
+	if (conf->barrier > 0)
+		conf->barrier--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/2] FIX: Process hangs at wait_barrier
  2011-02-04 13:18 [PATCH 0/2] FIX: Process hangs at wait_barrier Krzysztof Wojcik
  2011-02-04 13:18 ` [PATCH 1/2] FIX: md: process hangs at wait_barrier after 0->10 takeover Krzysztof Wojcik
  2011-02-04 13:18 ` [PATCH 2/2] FIX: md: Prevent barrier become negative Krzysztof Wojcik
@ 2011-02-08  0:50 ` NeilBrown
  2 siblings, 0 replies; 4+ messages in thread
From: NeilBrown @ 2011-02-08  0:50 UTC (permalink / raw)
  To: Krzysztof Wojcik
  Cc: linux-raid, wojciech.neubauer, adam.kwolek, dan.j.williams,
	ed.ciechanowski

On Fri, 04 Feb 2011 14:18:18 +0100 Krzysztof Wojcik
<krzysztof.wojcik@intel.com> wrote:

> Patches resolve problem with process crash at wait_barrier()
> after raid0->raid10 takeover.
> First patch resolve this particular problem.
> Solution is similar to RAID1 barrier implementation.
> Second is proposal for general protection against barrier
> become negative.
> 
> ---
> 
> Krzysztof Wojcik (2):
>       FIX: md: process hangs at wait_barrier after 0->10 takeover

Applied, thanks.


>       FIX: md: Prevent barrier become negative

If we ever trying to make 'barrier' negative, that is a bug somewhere.
So I would prefer:

   BUG_ON(conf->barrier <= 0);
   conf->barrier--;

NeilBrown


> 
> 
>  drivers/md/raid1.c  |    3 ++-
>  drivers/md/raid10.c |    9 ++++++---
>  2 files changed, 8 insertions(+), 4 deletions(-)
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-02-08  0:50 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-04 13:18 [PATCH 0/2] FIX: Process hangs at wait_barrier Krzysztof Wojcik
2011-02-04 13:18 ` [PATCH 1/2] FIX: md: process hangs at wait_barrier after 0->10 takeover Krzysztof Wojcik
2011-02-04 13:18 ` [PATCH 2/2] FIX: md: Prevent barrier become negative Krzysztof Wojcik
2011-02-08  0:50 ` [PATCH 0/2] FIX: Process hangs at wait_barrier NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).