[PATCH 0/1] Recent breakage of DM RAID

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/1] Recent breakage of DM RAID
@ 2013-11-24 23:30 Jonathan Brassow
  2013-11-24 23:30 ` [PATCH 1/1] MD/DM RAID: Fix hang due to recent RAID5 locking changes Jonathan Brassow
  0 siblings, 1 reply; 11+ messages in thread
From: Jonathan Brassow @ 2013-11-24 23:30 UTC (permalink / raw)
  To: jbrassow, neilb, linux-raid, dm-devel

There have been two commits that have broken DM RAID:
1) 773ca82 (v3.12-rc1) raid5: make release_stripe lockless
2) 566c09c raid5: relieve lock contention in get_active_stripe()

I've provided a patch for the first, but I don't have a patch ready for
the second yet.  Since 3.12 is alreay out, I figured I better send the
patch for the first problem now.

Simply creating a RAID 4/5/6 device via device-mapper (or LVM) is sufficent
to cause the hang to manifest.

Jonathan Brassow (1):
  MD/DM RAID: Fix hang due to recent RAID5 locking changes

 drivers/md/dm-raid.c |    1 -
 drivers/md/md.c      |    5 ++++-
 2 files changed, 4 insertions(+), 2 deletions(-)

-- 
1.7.7.6


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/1] MD/DM RAID: Fix hang due to recent RAID5 locking changes
  2013-11-24 23:30 [PATCH 0/1] Recent breakage of DM RAID Jonathan Brassow
@ 2013-11-24 23:30 ` Jonathan Brassow
  2013-11-25  0:03   ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Jonathan Brassow @ 2013-11-24 23:30 UTC (permalink / raw)
  To: jbrassow, neilb, linux-raid, dm-devel

When commit 773ca82 was made in v3.12-rc1, it caused RAID4/5/6 devices
that were created via device-mapper (dm-raid.c) to hang on creation.
This is not necessarily the fault of that commit, but perhaps the way
dm-raid.c was setting-up and activating devices.

Device-mapper allows I/O and memory allocations in the constructor
(i.e. raid_ctr()), but nominal and recovery I/O should not be allowed
until a 'resume' is issued (i.e. raid_resume()).  It has been problematic
(at least in the past) to call mddev_resume before mddev_suspend was
called, but this is how DM behaves - CTR then resume.  To solve the
problem, raid_ctr() was setting up the structures, calling md_run(), and
then also calling mddev_suspend().  The stage was then set for raid_resume()
to call mddev_resume().

Commit 773ca82 caused a change in behavior during raid5.c:run().
'setup_conf->grow_stripes->grow_one_stripe' is called which creates the
stripe cache and increments 'active_stripes'.
'grow_one_stripe->release_stripe' doesn't actually decrement 'active_stripes'
anymore.  The side effect of this is that when raid_ctr calls mddev_suspend,
it waits for 'active_stripes' to reduce to 0 - which never happens.

You could argue that the MD personalities should be able to handle either
a suspend or a resume after 'md_run' is called, but it can't really handle
either.  To fix this, I've removed the call to mddev_suspend in raid_ctr and
I've made the call to the personality's 'quiesce' function within
mddev_resume dependent on whether the device is currently suspended.

This patch is suitable and recommended for 3.12.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
---
 drivers/md/dm-raid.c |    1 -
 drivers/md/md.c      |    5 ++++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 4880b69..cdad87c 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -1249,7 +1249,6 @@ static int raid_ctr(struct dm_target *ti, unsigned argc, char **argv)
 	rs->callbacks.congested_fn = raid_is_congested;
 	dm_table_add_target_callbacks(ti->table, &rs->callbacks);

-	mddev_suspend(&rs->md);
 	return 0;

 size_mismatch:
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 561a65f..383980d 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -359,9 +359,12 @@ EXPORT_SYMBOL_GPL(mddev_suspend);

 void mddev_resume(struct mddev *mddev)
 {
+	int should_quiesce = mddev->suspended;
+
 	mddev->suspended = 0;
 	wake_up(&mddev->sb_wait);
-	mddev->pers->quiesce(mddev, 0);
+	if (should_quiesce)
+		mddev->pers->quiesce(mddev, 0);

 	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 	md_wakeup_thread(mddev->thread);
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] MD/DM RAID: Fix hang due to recent RAID5 locking changes
  2013-11-24 23:30 ` [PATCH 1/1] MD/DM RAID: Fix hang due to recent RAID5 locking changes Jonathan Brassow
@ 2013-11-25  0:03   ` NeilBrown
  2013-11-25 14:20     ` Brassow Jonathan
  0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2013-11-25  0:03 UTC (permalink / raw)
  To: Jonathan Brassow; +Cc: linux-raid, dm-devel

[-- Attachment #1: Type: text/plain, Size: 3332 bytes --]

On Sun, 24 Nov 2013 17:30:43 -0600 Jonathan Brassow <jbrassow@redhat.com>
wrote:

> When commit 773ca82 was made in v3.12-rc1, it caused RAID4/5/6 devices
> that were created via device-mapper (dm-raid.c) to hang on creation.
> This is not necessarily the fault of that commit, but perhaps the way
> dm-raid.c was setting-up and activating devices.
> 
> Device-mapper allows I/O and memory allocations in the constructor
> (i.e. raid_ctr()), but nominal and recovery I/O should not be allowed
> until a 'resume' is issued (i.e. raid_resume()).  It has been problematic
> (at least in the past) to call mddev_resume before mddev_suspend was
> called, but this is how DM behaves - CTR then resume.  To solve the
> problem, raid_ctr() was setting up the structures, calling md_run(), and
> then also calling mddev_suspend().  The stage was then set for raid_resume()
> to call mddev_resume().
> 
> Commit 773ca82 caused a change in behavior during raid5.c:run().
> 'setup_conf->grow_stripes->grow_one_stripe' is called which creates the
> stripe cache and increments 'active_stripes'.
> 'grow_one_stripe->release_stripe' doesn't actually decrement 'active_stripes'
> anymore.  The side effect of this is that when raid_ctr calls mddev_suspend,
> it waits for 'active_stripes' to reduce to 0 - which never happens.

Hi Jon,
 this sounds like the same bug that is fixed by 

commit ad4068de49862b083ac2a15bc50689bb30ce3e44
Author: majianpeng <majianpeng@gmail.com>
Date:   Thu Nov 14 15:16:15 2013 +1100

    raid5: Use slow_path to release stripe when mddev->thread is null

which is already en-route to 3.12.x.  Could you check if it fixes the bug for
you?

Thanks,
NeilBrown

> 
> You could argue that the MD personalities should be able to handle either
> a suspend or a resume after 'md_run' is called, but it can't really handle
> either.  To fix this, I've removed the call to mddev_suspend in raid_ctr and
> I've made the call to the personality's 'quiesce' function within
> mddev_resume dependent on whether the device is currently suspended.
> 
> This patch is suitable and recommended for 3.12.
> 
> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
> ---
>  drivers/md/dm-raid.c |    1 -
>  drivers/md/md.c      |    5 ++++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> index 4880b69..cdad87c 100644
> --- a/drivers/md/dm-raid.c
> +++ b/drivers/md/dm-raid.c
> @@ -1249,7 +1249,6 @@ static int raid_ctr(struct dm_target *ti, unsigned argc, char **argv)
>  	rs->callbacks.congested_fn = raid_is_congested;
>  	dm_table_add_target_callbacks(ti->table, &rs->callbacks);
>  
> -	mddev_suspend(&rs->md);
>  	return 0;
>  
>  size_mismatch:
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 561a65f..383980d 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -359,9 +359,12 @@ EXPORT_SYMBOL_GPL(mddev_suspend);
>  
>  void mddev_resume(struct mddev *mddev)
>  {
> +	int should_quiesce = mddev->suspended;
> +
>  	mddev->suspended = 0;
>  	wake_up(&mddev->sb_wait);
> -	mddev->pers->quiesce(mddev, 0);
> +	if (should_quiesce)
> +		mddev->pers->quiesce(mddev, 0);
>  
>  	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>  	md_wakeup_thread(mddev->thread);


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] MD/DM RAID: Fix hang due to recent RAID5 locking changes
  2013-11-25  0:03   ` NeilBrown
@ 2013-11-25 14:20     ` Brassow Jonathan
  2013-11-25 19:08       ` Brassow Jonathan
  0 siblings, 1 reply; 11+ messages in thread
From: Brassow Jonathan @ 2013-11-25 14:20 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid@vger.kernel.org Raid, device-mapper development,
	Jonathan Brassow


On Nov 24, 2013, at 6:03 PM, NeilBrown wrote:

> On Sun, 24 Nov 2013 17:30:43 -0600 Jonathan Brassow <jbrassow@redhat.com>
> wrote:
> 
>> When commit 773ca82 was made in v3.12-rc1, it caused RAID4/5/6 devices
>> that were created via device-mapper (dm-raid.c) to hang on creation.
>> This is not necessarily the fault of that commit, but perhaps the way
>> dm-raid.c was setting-up and activating devices.
>> 
>> Device-mapper allows I/O and memory allocations in the constructor
>> (i.e. raid_ctr()), but nominal and recovery I/O should not be allowed
>> until a 'resume' is issued (i.e. raid_resume()).  It has been problematic
>> (at least in the past) to call mddev_resume before mddev_suspend was
>> called, but this is how DM behaves - CTR then resume.  To solve the
>> problem, raid_ctr() was setting up the structures, calling md_run(), and
>> then also calling mddev_suspend().  The stage was then set for raid_resume()
>> to call mddev_resume().
>> 
>> Commit 773ca82 caused a change in behavior during raid5.c:run().
>> 'setup_conf->grow_stripes->grow_one_stripe' is called which creates the
>> stripe cache and increments 'active_stripes'.
>> 'grow_one_stripe->release_stripe' doesn't actually decrement 'active_stripes'
>> anymore.  The side effect of this is that when raid_ctr calls mddev_suspend,
>> it waits for 'active_stripes' to reduce to 0 - which never happens.
> 
> Hi Jon,
> this sounds like the same bug that is fixed by 
> 
> commit ad4068de49862b083ac2a15bc50689bb30ce3e44
> Author: majianpeng <majianpeng@gmail.com>
> Date:   Thu Nov 14 15:16:15 2013 +1100
> 
>    raid5: Use slow_path to release stripe when mddev->thread is null
> 
> which is already en-route to 3.12.x.  Could you check if it fixes the bug for
> you?

Sure, I'll check.  Just reading the subject of the patch, I have high hopes.  The slow path decrements 'active_stripes', which was causing the above problem...  I'll make sure though.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] MD/DM RAID: Fix hang due to recent RAID5 locking changes
  2013-11-25 14:20     ` Brassow Jonathan
@ 2013-11-25 19:08       ` Brassow Jonathan
  2013-11-26  5:27         ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Brassow Jonathan @ 2013-11-25 19:08 UTC (permalink / raw)
  To: Brassow Jonathan
  Cc: NeilBrown, linux-raid@vger.kernel.org Raid,
	device-mapper development


On Nov 25, 2013, at 8:20 AM, Brassow Jonathan wrote:

> 
> On Nov 24, 2013, at 6:03 PM, NeilBrown wrote:
> 
>> On Sun, 24 Nov 2013 17:30:43 -0600 Jonathan Brassow <jbrassow@redhat.com>
>> wrote:
>> 
>>> When commit 773ca82 was made in v3.12-rc1, it caused RAID4/5/6 devices
>>> that were created via device-mapper (dm-raid.c) to hang on creation.
>>> This is not necessarily the fault of that commit, but perhaps the way
>>> dm-raid.c was setting-up and activating devices.
>>> 
>>> Device-mapper allows I/O and memory allocations in the constructor
>>> (i.e. raid_ctr()), but nominal and recovery I/O should not be allowed
>>> until a 'resume' is issued (i.e. raid_resume()).  It has been problematic
>>> (at least in the past) to call mddev_resume before mddev_suspend was
>>> called, but this is how DM behaves - CTR then resume.  To solve the
>>> problem, raid_ctr() was setting up the structures, calling md_run(), and
>>> then also calling mddev_suspend().  The stage was then set for raid_resume()
>>> to call mddev_resume().
>>> 
>>> Commit 773ca82 caused a change in behavior during raid5.c:run().
>>> 'setup_conf->grow_stripes->grow_one_stripe' is called which creates the
>>> stripe cache and increments 'active_stripes'.
>>> 'grow_one_stripe->release_stripe' doesn't actually decrement 'active_stripes'
>>> anymore.  The side effect of this is that when raid_ctr calls mddev_suspend,
>>> it waits for 'active_stripes' to reduce to 0 - which never happens.
>> 
>> Hi Jon,
>> this sounds like the same bug that is fixed by 
>> 
>> commit ad4068de49862b083ac2a15bc50689bb30ce3e44
>> Author: majianpeng <majianpeng@gmail.com>
>> Date:   Thu Nov 14 15:16:15 2013 +1100
>> 
>>   raid5: Use slow_path to release stripe when mddev->thread is null
>> 
>> which is already en-route to 3.12.x.  Could you check if it fixes the bug for
>> you?
> 
> Sure, I'll check.  Just reading the subject of the patch, I have high hopes.  The slow path decrements 'active_stripes', which was causing the above problem...  I'll make sure though.

Yes, this patch fixes the issue in 3.12-rc1+.

However, there is still a problem I'm searching for that was introduced in commit 566c09c (at least that's what I get when bisecting).

The problem only shows up when I have taken a snapshot of a RAID5 device and only if I have cycled the device before adding the snapshot:
1> lvcreate --type raid5 -i 3 -L 20M -n lv vg
2> lvchange -an vg/lv
3> lvchange -ay vg/lv
4> lvcreate -s vg/lv -L 50M -n snap
5> lvchange -an vg/lv
6> lvchange -ay vg/lv -- BUG: line 292 of raid5.c

The current bug triggers on the 'BUG_ON(atomic_read(&conf->active_stripes)==0)' in do_release_stripe().  I'm not sure why yet.

 brassow


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] MD/DM RAID: Fix hang due to recent RAID5 locking changes
  2013-11-25 19:08       ` Brassow Jonathan
@ 2013-11-26  5:27         ` NeilBrown
  2013-11-26 14:32           ` Brassow Jonathan
  0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2013-11-26  5:27 UTC (permalink / raw)
  To: Brassow Jonathan
  Cc: linux-raid@vger.kernel.org Raid, device-mapper development

[-- Attachment #1: Type: text/plain, Size: 3594 bytes --]

On Mon, 25 Nov 2013 13:08:56 -0600 Brassow Jonathan <jbrassow@redhat.com>
wrote:

> 
> On Nov 25, 2013, at 8:20 AM, Brassow Jonathan wrote:
> 
> > 
> > On Nov 24, 2013, at 6:03 PM, NeilBrown wrote:
> > 
> >> On Sun, 24 Nov 2013 17:30:43 -0600 Jonathan Brassow <jbrassow@redhat.com>
> >> wrote:
> >> 
> >>> When commit 773ca82 was made in v3.12-rc1, it caused RAID4/5/6 devices
> >>> that were created via device-mapper (dm-raid.c) to hang on creation.
> >>> This is not necessarily the fault of that commit, but perhaps the way
> >>> dm-raid.c was setting-up and activating devices.
> >>> 
> >>> Device-mapper allows I/O and memory allocations in the constructor
> >>> (i.e. raid_ctr()), but nominal and recovery I/O should not be allowed
> >>> until a 'resume' is issued (i.e. raid_resume()).  It has been problematic
> >>> (at least in the past) to call mddev_resume before mddev_suspend was
> >>> called, but this is how DM behaves - CTR then resume.  To solve the
> >>> problem, raid_ctr() was setting up the structures, calling md_run(), and
> >>> then also calling mddev_suspend().  The stage was then set for raid_resume()
> >>> to call mddev_resume().
> >>> 
> >>> Commit 773ca82 caused a change in behavior during raid5.c:run().
> >>> 'setup_conf->grow_stripes->grow_one_stripe' is called which creates the
> >>> stripe cache and increments 'active_stripes'.
> >>> 'grow_one_stripe->release_stripe' doesn't actually decrement 'active_stripes'
> >>> anymore.  The side effect of this is that when raid_ctr calls mddev_suspend,
> >>> it waits for 'active_stripes' to reduce to 0 - which never happens.
> >> 
> >> Hi Jon,
> >> this sounds like the same bug that is fixed by 
> >> 
> >> commit ad4068de49862b083ac2a15bc50689bb30ce3e44
> >> Author: majianpeng <majianpeng@gmail.com>
> >> Date:   Thu Nov 14 15:16:15 2013 +1100
> >> 
> >>   raid5: Use slow_path to release stripe when mddev->thread is null
> >> 
> >> which is already en-route to 3.12.x.  Could you check if it fixes the bug for
> >> you?
> > 
> > Sure, I'll check.  Just reading the subject of the patch, I have high hopes.  The slow path decrements 'active_stripes', which was causing the above problem...  I'll make sure though.
> 
> Yes, this patch fixes the issue in 3.12-rc1+.
> 
> However, there is still a problem I'm searching for that was introduced in commit 566c09c (at least that's what I get when bisecting).
> 
> The problem only shows up when I have taken a snapshot of a RAID5 device and only if I have cycled the device before adding the snapshot:
> 1> lvcreate --type raid5 -i 3 -L 20M -n lv vg
> 2> lvchange -an vg/lv
> 3> lvchange -ay vg/lv
> 4> lvcreate -s vg/lv -L 50M -n snap
> 5> lvchange -an vg/lv
> 6> lvchange -ay vg/lv -- BUG: line 292 of raid5.c
> 
> The current bug triggers on the 'BUG_ON(atomic_read(&conf->active_stripes)==0)' in do_release_stripe().  I'm not sure why yet.
> 
>  brassow

I've had a look and I must say I'm not sure either.
I keep wondering if something is wrong with the locking in get_active_stripe.
The region covered by device_lock is not much smaller with the whole now
covered by hash_locks[hash].  I cannot see a problem with the locking but I
might be missing something.  A missing atomic_inc of active_stripes in there
could cause your problem.

As you can easily reproduce, could you try expanding the range covered by
device_lock to be the whole branch where sh is not NULL.  If that makes a
difference it would be quite instructive.  I don't hold high hopes though.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] MD/DM RAID: Fix hang due to recent RAID5 locking changes
  2013-11-26  5:27         ` NeilBrown
@ 2013-11-26 14:32           ` Brassow Jonathan
  2013-11-26 22:34             ` Brassow Jonathan
  0 siblings, 1 reply; 11+ messages in thread
From: Brassow Jonathan @ 2013-11-26 14:32 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid@vger.kernel.org Raid, device-mapper development,
	Jonathan Brassow, Shaohua Li


On Nov 25, 2013, at 11:27 PM, NeilBrown wrote:

> On Mon, 25 Nov 2013 13:08:56 -0600 Brassow Jonathan <jbrassow@redhat.com>
> wrote:
> 
>> 
>> On Nov 25, 2013, at 8:20 AM, Brassow Jonathan wrote:
>> 
>>> 
>>> On Nov 24, 2013, at 6:03 PM, NeilBrown wrote:
>>> 
>>>> On Sun, 24 Nov 2013 17:30:43 -0600 Jonathan Brassow <jbrassow@redhat.com>
>>>> wrote:
>>>> 
>>>>> When commit 773ca82 was made in v3.12-rc1, it caused RAID4/5/6 devices
>>>>> that were created via device-mapper (dm-raid.c) to hang on creation.
>>>>> This is not necessarily the fault of that commit, but perhaps the way
>>>>> dm-raid.c was setting-up and activating devices.
>>>>> 
>>>>> Device-mapper allows I/O and memory allocations in the constructor
>>>>> (i.e. raid_ctr()), but nominal and recovery I/O should not be allowed
>>>>> until a 'resume' is issued (i.e. raid_resume()).  It has been problematic
>>>>> (at least in the past) to call mddev_resume before mddev_suspend was
>>>>> called, but this is how DM behaves - CTR then resume.  To solve the
>>>>> problem, raid_ctr() was setting up the structures, calling md_run(), and
>>>>> then also calling mddev_suspend().  The stage was then set for raid_resume()
>>>>> to call mddev_resume().
>>>>> 
>>>>> Commit 773ca82 caused a change in behavior during raid5.c:run().
>>>>> 'setup_conf->grow_stripes->grow_one_stripe' is called which creates the
>>>>> stripe cache and increments 'active_stripes'.
>>>>> 'grow_one_stripe->release_stripe' doesn't actually decrement 'active_stripes'
>>>>> anymore.  The side effect of this is that when raid_ctr calls mddev_suspend,
>>>>> it waits for 'active_stripes' to reduce to 0 - which never happens.
>>>> 
>>>> Hi Jon,
>>>> this sounds like the same bug that is fixed by 
>>>> 
>>>> commit ad4068de49862b083ac2a15bc50689bb30ce3e44
>>>> Author: majianpeng <majianpeng@gmail.com>
>>>> Date:   Thu Nov 14 15:16:15 2013 +1100
>>>> 
>>>>  raid5: Use slow_path to release stripe when mddev->thread is null
>>>> 
>>>> which is already en-route to 3.12.x.  Could you check if it fixes the bug for
>>>> you?
>>> 
>>> Sure, I'll check.  Just reading the subject of the patch, I have high hopes.  The slow path decrements 'active_stripes', which was causing the above problem...  I'll make sure though.
>> 
>> Yes, this patch fixes the issue in 3.12-rc1+.
>> 
>> However, there is still a problem I'm searching for that was introduced in commit 566c09c (at least that's what I get when bisecting).
>> 
>> The problem only shows up when I have taken a snapshot of a RAID5 device and only if I have cycled the device before adding the snapshot:
>> 1> lvcreate --type raid5 -i 3 -L 20M -n lv vg
>> 2> lvchange -an vg/lv
>> 3> lvchange -ay vg/lv
>> 4> lvcreate -s vg/lv -L 50M -n snap
>> 5> lvchange -an vg/lv
>> 6> lvchange -ay vg/lv -- BUG: line 292 of raid5.c
>> 
>> The current bug triggers on the 'BUG_ON(atomic_read(&conf->active_stripes)==0)' in do_release_stripe().  I'm not sure why yet.
>> 
>> brassow
> 
> I've had a look and I must say I'm not sure either.
> I keep wondering if something is wrong with the locking in get_active_stripe.
> The region covered by device_lock is not much smaller with the whole now
> covered by hash_locks[hash].  I cannot see a problem with the locking but I
> might be missing something.  A missing atomic_inc of active_stripes in there
> could cause your problem.
> 
> As you can easily reproduce, could you try expanding the range covered by
> device_lock to be the whole branch where sh is not NULL.  If that makes a
> difference it would be quite instructive.  I don't hold high hopes though.

Sure, I'll try that.

I've also found that I hit the BUG() on line 693 of raid5.c:get_active_stripe().

thanks,
 brassow


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] MD/DM RAID: Fix hang due to recent RAID5 locking changes
  2013-11-26 14:32           ` Brassow Jonathan
@ 2013-11-26 22:34             ` Brassow Jonathan
  2013-11-27  3:12               ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Brassow Jonathan @ 2013-11-26 22:34 UTC (permalink / raw)
  To: Brassow Jonathan
  Cc: NeilBrown, linux-raid@vger.kernel.org Raid,
	device-mapper development, Shaohua Li


On Nov 26, 2013, at 8:32 AM, Brassow Jonathan wrote:

> 
> On Nov 25, 2013, at 11:27 PM, NeilBrown wrote:
> 
>> On Mon, 25 Nov 2013 13:08:56 -0600 Brassow Jonathan <jbrassow@redhat.com>
>> wrote:
>> 
>>> 
>>> On Nov 25, 2013, at 8:20 AM, Brassow Jonathan wrote:
>>> 
>>>> 
>>>> On Nov 24, 2013, at 6:03 PM, NeilBrown wrote:
>>>> 
>>>>> On Sun, 24 Nov 2013 17:30:43 -0600 Jonathan Brassow <jbrassow@redhat.com>
>>>>> wrote:
>>>>> 
>>>>>> When commit 773ca82 was made in v3.12-rc1, it caused RAID4/5/6 devices
>>>>>> that were created via device-mapper (dm-raid.c) to hang on creation.
>>>>>> This is not necessarily the fault of that commit, but perhaps the way
>>>>>> dm-raid.c was setting-up and activating devices.
>>>>>> 
>>>>>> Device-mapper allows I/O and memory allocations in the constructor
>>>>>> (i.e. raid_ctr()), but nominal and recovery I/O should not be allowed
>>>>>> until a 'resume' is issued (i.e. raid_resume()).  It has been problematic
>>>>>> (at least in the past) to call mddev_resume before mddev_suspend was
>>>>>> called, but this is how DM behaves - CTR then resume.  To solve the
>>>>>> problem, raid_ctr() was setting up the structures, calling md_run(), and
>>>>>> then also calling mddev_suspend().  The stage was then set for raid_resume()
>>>>>> to call mddev_resume().
>>>>>> 
>>>>>> Commit 773ca82 caused a change in behavior during raid5.c:run().
>>>>>> 'setup_conf->grow_stripes->grow_one_stripe' is called which creates the
>>>>>> stripe cache and increments 'active_stripes'.
>>>>>> 'grow_one_stripe->release_stripe' doesn't actually decrement 'active_stripes'
>>>>>> anymore.  The side effect of this is that when raid_ctr calls mddev_suspend,
>>>>>> it waits for 'active_stripes' to reduce to 0 - which never happens.
>>>>> 
>>>>> Hi Jon,
>>>>> this sounds like the same bug that is fixed by 
>>>>> 
>>>>> commit ad4068de49862b083ac2a15bc50689bb30ce3e44
>>>>> Author: majianpeng <majianpeng@gmail.com>
>>>>> Date:   Thu Nov 14 15:16:15 2013 +1100
>>>>> 
>>>>> raid5: Use slow_path to release stripe when mddev->thread is null
>>>>> 
>>>>> which is already en-route to 3.12.x.  Could you check if it fixes the bug for
>>>>> you?
>>>> 
>>>> Sure, I'll check.  Just reading the subject of the patch, I have high hopes.  The slow path decrements 'active_stripes', which was causing the above problem...  I'll make sure though.
>>> 
>>> Yes, this patch fixes the issue in 3.12-rc1+.
>>> 
>>> However, there is still a problem I'm searching for that was introduced in commit 566c09c (at least that's what I get when bisecting).
>>> 
>>> The problem only shows up when I have taken a snapshot of a RAID5 device and only if I have cycled the device before adding the snapshot:
>>> 1> lvcreate --type raid5 -i 3 -L 20M -n lv vg
>>> 2> lvchange -an vg/lv
>>> 3> lvchange -ay vg/lv
>>> 4> lvcreate -s vg/lv -L 50M -n snap
>>> 5> lvchange -an vg/lv
>>> 6> lvchange -ay vg/lv -- BUG: line 292 of raid5.c
>>> 
>>> The current bug triggers on the 'BUG_ON(atomic_read(&conf->active_stripes)==0)' in do_release_stripe().  I'm not sure why yet.
>>> 
>>> brassow
>> 
>> I've had a look and I must say I'm not sure either.
>> I keep wondering if something is wrong with the locking in get_active_stripe.
>> The region covered by device_lock is not much smaller with the whole now
>> covered by hash_locks[hash].  I cannot see a problem with the locking but I
>> might be missing something.  A missing atomic_inc of active_stripes in there
>> could cause your problem.
>> 
>> As you can easily reproduce, could you try expanding the range covered by
>> device_lock to be the whole branch where sh is not NULL.  If that makes a
>> difference it would be quite instructive.  I don't hold high hopes though.
> 
> Sure, I'll try that.
> 
> I've also found that I hit the BUG() on line 693 of raid5.c:get_active_stripe().

Neil,

That fixed it - I no longer see any BUG()s or hangs in any of my tests.

I simply move the device_lock to cover the whole 'if (atomic_read(&sh->count)) ... else' conditional.  Seems that the sh->count (and other related bits) are being changed between the time it is checked and the time the lock is acquired.

If I had to guess, I'd say it was possible to have the following condition:
Thread1-1: get_active_stripe(): 'if (atomic_read(&sh->count))' => FALSE
	Thread2-1: Acquire 'device_lock'
	Thread2-2: Enter release_stripe_list
	Thread2-3: Clear STRIPE_ON_RELEASE_LIST while in release_stripe_list
	Thread2-4: call __release_stripe->do_release_stripe->atomic_dec_and_test(&sh->count)
	Thread2-5: Exit release_stripe_list
	Thread2-6: Release 'device_lock'
Thread1-2: Acquire 'device_lock'
Thread1-3: check !STRIPE_ON_RELEASE_LIST --> call BUG();

Right now, get_active_stripe is checking 'count' and then waiting for the lock.  The lock in the current case is simply allowing more time to set the condition that we BUG on.  Moving the lock ensures that 'count' is checked properly along with STRIPE_ON_RELEASE_LIST after they have been manipulated.

If you think the analysis is correct, I can try to clean-up the language a little bit and submit the patch.

 brassow



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] MD/DM RAID: Fix hang due to recent RAID5 locking changes
  2013-11-26 22:34             ` Brassow Jonathan
@ 2013-11-27  3:12               ` NeilBrown
  2013-11-27 10:02                 ` Shaohua Li
  2013-11-27 16:00                 ` Brassow Jonathan
  0 siblings, 2 replies; 11+ messages in thread
From: NeilBrown @ 2013-11-27  3:12 UTC (permalink / raw)
  To: Brassow Jonathan
  Cc: linux-raid@vger.kernel.org Raid, device-mapper development,
	Shaohua Li

[-- Attachment #1: Type: text/plain, Size: 9752 bytes --]

On Tue, 26 Nov 2013 16:34:58 -0600 Brassow Jonathan <jbrassow@redhat.com>
wrote:

> 
> On Nov 26, 2013, at 8:32 AM, Brassow Jonathan wrote:
> 
> > 
> > On Nov 25, 2013, at 11:27 PM, NeilBrown wrote:
> > 
> >> On Mon, 25 Nov 2013 13:08:56 -0600 Brassow Jonathan <jbrassow@redhat.com>
> >> wrote:
> >> 
> >>> 
> >>> On Nov 25, 2013, at 8:20 AM, Brassow Jonathan wrote:
> >>> 
> >>>> 
> >>>> On Nov 24, 2013, at 6:03 PM, NeilBrown wrote:
> >>>> 
> >>>>> On Sun, 24 Nov 2013 17:30:43 -0600 Jonathan Brassow <jbrassow@redhat.com>
> >>>>> wrote:
> >>>>> 
> >>>>>> When commit 773ca82 was made in v3.12-rc1, it caused RAID4/5/6 devices
> >>>>>> that were created via device-mapper (dm-raid.c) to hang on creation.
> >>>>>> This is not necessarily the fault of that commit, but perhaps the way
> >>>>>> dm-raid.c was setting-up and activating devices.
> >>>>>> 
> >>>>>> Device-mapper allows I/O and memory allocations in the constructor
> >>>>>> (i.e. raid_ctr()), but nominal and recovery I/O should not be allowed
> >>>>>> until a 'resume' is issued (i.e. raid_resume()).  It has been problematic
> >>>>>> (at least in the past) to call mddev_resume before mddev_suspend was
> >>>>>> called, but this is how DM behaves - CTR then resume.  To solve the
> >>>>>> problem, raid_ctr() was setting up the structures, calling md_run(), and
> >>>>>> then also calling mddev_suspend().  The stage was then set for raid_resume()
> >>>>>> to call mddev_resume().
> >>>>>> 
> >>>>>> Commit 773ca82 caused a change in behavior during raid5.c:run().
> >>>>>> 'setup_conf->grow_stripes->grow_one_stripe' is called which creates the
> >>>>>> stripe cache and increments 'active_stripes'.
> >>>>>> 'grow_one_stripe->release_stripe' doesn't actually decrement 'active_stripes'
> >>>>>> anymore.  The side effect of this is that when raid_ctr calls mddev_suspend,
> >>>>>> it waits for 'active_stripes' to reduce to 0 - which never happens.
> >>>>> 
> >>>>> Hi Jon,
> >>>>> this sounds like the same bug that is fixed by 
> >>>>> 
> >>>>> commit ad4068de49862b083ac2a15bc50689bb30ce3e44
> >>>>> Author: majianpeng <majianpeng@gmail.com>
> >>>>> Date:   Thu Nov 14 15:16:15 2013 +1100
> >>>>> 
> >>>>> raid5: Use slow_path to release stripe when mddev->thread is null
> >>>>> 
> >>>>> which is already en-route to 3.12.x.  Could you check if it fixes the bug for
> >>>>> you?
> >>>> 
> >>>> Sure, I'll check.  Just reading the subject of the patch, I have high hopes.  The slow path decrements 'active_stripes', which was causing the above problem...  I'll make sure though.
> >>> 
> >>> Yes, this patch fixes the issue in 3.12-rc1+.
> >>> 
> >>> However, there is still a problem I'm searching for that was introduced in commit 566c09c (at least that's what I get when bisecting).
> >>> 
> >>> The problem only shows up when I have taken a snapshot of a RAID5 device and only if I have cycled the device before adding the snapshot:
> >>> 1> lvcreate --type raid5 -i 3 -L 20M -n lv vg
> >>> 2> lvchange -an vg/lv
> >>> 3> lvchange -ay vg/lv
> >>> 4> lvcreate -s vg/lv -L 50M -n snap
> >>> 5> lvchange -an vg/lv
> >>> 6> lvchange -ay vg/lv -- BUG: line 292 of raid5.c
> >>> 
> >>> The current bug triggers on the 'BUG_ON(atomic_read(&conf->active_stripes)==0)' in do_release_stripe().  I'm not sure why yet.
> >>> 
> >>> brassow
> >> 
> >> I've had a look and I must say I'm not sure either.
> >> I keep wondering if something is wrong with the locking in get_active_stripe.
> >> The region covered by device_lock is not much smaller with the whole now
> >> covered by hash_locks[hash].  I cannot see a problem with the locking but I
> >> might be missing something.  A missing atomic_inc of active_stripes in there
> >> could cause your problem.
> >> 
> >> As you can easily reproduce, could you try expanding the range covered by
> >> device_lock to be the whole branch where sh is not NULL.  If that makes a
> >> difference it would be quite instructive.  I don't hold high hopes though.
> > 
> > Sure, I'll try that.
> > 
> > I've also found that I hit the BUG() on line 693 of raid5.c:get_active_stripe().

I have another report of that BUG - from Fengguang Wu.  I suspect it and
you're other BUG are related.

> 
> Neil,
> 
> That fixed it - I no longer see any BUG()s or hangs in any of my tests.

Excellent.  It means we are on the right track at least.

> 
> I simply move the device_lock to cover the whole 'if (atomic_read(&sh->count)) ... else' conditional.  Seems that the sh->count (and other related bits) are being changed between the time it is checked and the time the lock is acquired.
> 
> If I had to guess, I'd say it was possible to have the following condition:
> Thread1-1: get_active_stripe(): 'if (atomic_read(&sh->count))' => FALSE
> 	Thread2-1: Acquire 'device_lock'
> 	Thread2-2: Enter release_stripe_list
> 	Thread2-3: Clear STRIPE_ON_RELEASE_LIST while in release_stripe_list
> 	Thread2-4: call __release_stripe->do_release_stripe->atomic_dec_and_test(&sh->count)
> 	Thread2-5: Exit release_stripe_list
> 	Thread2-6: Release 'device_lock'
> Thread1-2: Acquire 'device_lock'
> Thread1-3: check !STRIPE_ON_RELEASE_LIST --> call BUG();
> 
> Right now, get_active_stripe is checking 'count' and then waiting for the lock.  The lock in the current case is simply allowing more time to set the condition that we BUG on.  Moving the lock ensures that 'count' is checked properly along with STRIPE_ON_RELEASE_LIST after they have been manipulated.
> 
> If you think the analysis is correct, I can try to clean-up the language a little bit and submit the patch.

If a stripe is on the released_stripes list (and has STRIPE_ON_RELEASE_LIST)
set, it means that a call to __release_stripe is pending, so the ->count must
be at least 1.
So we would need some other actor to be holding device_lock, incrementing
->count, and  then placing the stripe on the released_stripes list.  I don't
think that is possible.

I think I see the race now though (don't know why I couldn't see it
yesterday .. but knowing it must be there from your testing must have
helped).

I think the race is with __get_priority_stripe().  That is the  only place
other than get_active_stripe() where we can increment ->count from zero.
So:
  1-1: get_active_stripe() sees ->count is zero
  1-2: tried to get device_lock and blocks
  2-1: meanwhile handle_active_stripes is holding ->device_lock and
       chooses this stripe from handle list.  It deletes it from the
       ->lru and increments the count - then releases ->device_lock
  1-3: get_active_stripe() sees that ->lru is empty, and complains.

You other BUG is triggered by a difference race caused by the same locking
problem.

->active_stripes counts the number of stripes which have a non-zero ->count,
or have STRIPE_HANDLE set.  STRIPE_HANDLE can only be set when ->count is
non-zero, so we just need to update this whenever count reaches zero or
leaves zero while STRIPE_HANDLE is set.
This would previously always happen under device_lock.  However
get_active_stripe() now increments ->count from zero under ->hash_locks.
This could  race with __release_stripe() which only needs device_lock
So 
  1-1: get_active_stripe notices that ->count is not zero and skips locking
       with device_lock
  2-1: __release_stripe() decrements ->count to zero and calls
      do_release_stripe which, as STRIPE_HANDLE is not set, decrements
      ->active_stripes
  1-1: get_active_stripe proceeds to increment ->count from zero without a
       matching increment for ->active_stripes

so now ->active_stripes is 1 too small and your BUG is not far away.


So I think we do need to extend the lock region exactly as in the patch you
tried.
Sad thing is that Shaohua originally had the code like that but I talked him
out of it. :-(
http://www.spinics.net/lists/raid/msg44290.html

It seems a shame to be taking device_lock here rather than just hash_locks.
It might be possible to relax that a bit, but first we would need to measure
if it was worth it.

This is the patch I'm thinking of for now.  It also fixes up the two BUGs as
both of them are wrong.
STRIPE_ON_RELEASE_LIST has no effect on the value of ->lru, so it shouldn't
be included.  And while STRIPE_EXPANDING justifies an active stripe_head
being on an lru, it is never ever set for an inactive stripe_head (which is
always on an lru) so it shouldn't be there either.

Does that all sound convincing?

Thanks a lot,
NeilBrown


diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 47da0af6322b..7384b1ec8783 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -678,26 +678,23 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
 			} else
 				init_stripe(sh, sector, previous);
 		} else {
+			spin_lock(&conf->device_lock);
 			if (atomic_read(&sh->count)) {
 				BUG_ON(!list_empty(&sh->lru)
 				    && !test_bit(STRIPE_EXPANDING, &sh->state)
 				    && !test_bit(STRIPE_ON_UNPLUG_LIST, &sh->state)
-				    && !test_bit(STRIPE_ON_RELEASE_LIST, &sh->state));
+					);
 			} else {
-				spin_lock(&conf->device_lock);
 				if (!test_bit(STRIPE_HANDLE, &sh->state))
 					atomic_inc(&conf->active_stripes);
-				if (list_empty(&sh->lru) &&
-				    !test_bit(STRIPE_ON_RELEASE_LIST, &sh->state) &&
-				    !test_bit(STRIPE_EXPANDING, &sh->state))
-					BUG();
+				BUG_ON(list_empty(&sh->lru));
 				list_del_init(&sh->lru);
 				if (sh->group) {
 					sh->group->stripes_cnt--;
 					sh->group = NULL;
 				}
-				spin_unlock(&conf->device_lock);
 			}
+			spin_unlock(&conf->device_lock);
 		}
 	} while (sh == NULL);
 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] MD/DM RAID: Fix hang due to recent RAID5 locking changes
  2013-11-27  3:12               ` NeilBrown
@ 2013-11-27 10:02                 ` Shaohua Li
  2013-11-27 16:00                 ` Brassow Jonathan
  1 sibling, 0 replies; 11+ messages in thread
From: Shaohua Li @ 2013-11-27 10:02 UTC (permalink / raw)
  To: NeilBrown
  Cc: Brassow Jonathan, linux-raid@vger.kernel.org Raid,
	device-mapper development

On Wed, Nov 27, 2013 at 02:12:08PM +1100, NeilBrown wrote:
> On Tue, 26 Nov 2013 16:34:58 -0600 Brassow Jonathan <jbrassow@redhat.com>
> wrote:
> 
> > 
> > On Nov 26, 2013, at 8:32 AM, Brassow Jonathan wrote:
> > 
> > > 
> > > On Nov 25, 2013, at 11:27 PM, NeilBrown wrote:
> > > 
> > >> On Mon, 25 Nov 2013 13:08:56 -0600 Brassow Jonathan <jbrassow@redhat.com>
> > >> wrote:
> > >> 
> > >>> 
> > >>> On Nov 25, 2013, at 8:20 AM, Brassow Jonathan wrote:
> > >>> 
> > >>>> 
> > >>>> On Nov 24, 2013, at 6:03 PM, NeilBrown wrote:
> > >>>> 
> > >>>>> On Sun, 24 Nov 2013 17:30:43 -0600 Jonathan Brassow <jbrassow@redhat.com>
> > >>>>> wrote:
> > >>>>> 
> > >>>>>> When commit 773ca82 was made in v3.12-rc1, it caused RAID4/5/6 devices
> > >>>>>> that were created via device-mapper (dm-raid.c) to hang on creation.
> > >>>>>> This is not necessarily the fault of that commit, but perhaps the way
> > >>>>>> dm-raid.c was setting-up and activating devices.
> > >>>>>> 
> > >>>>>> Device-mapper allows I/O and memory allocations in the constructor
> > >>>>>> (i.e. raid_ctr()), but nominal and recovery I/O should not be allowed
> > >>>>>> until a 'resume' is issued (i.e. raid_resume()).  It has been problematic
> > >>>>>> (at least in the past) to call mddev_resume before mddev_suspend was
> > >>>>>> called, but this is how DM behaves - CTR then resume.  To solve the
> > >>>>>> problem, raid_ctr() was setting up the structures, calling md_run(), and
> > >>>>>> then also calling mddev_suspend().  The stage was then set for raid_resume()
> > >>>>>> to call mddev_resume().
> > >>>>>> 
> > >>>>>> Commit 773ca82 caused a change in behavior during raid5.c:run().
> > >>>>>> 'setup_conf->grow_stripes->grow_one_stripe' is called which creates the
> > >>>>>> stripe cache and increments 'active_stripes'.
> > >>>>>> 'grow_one_stripe->release_stripe' doesn't actually decrement 'active_stripes'
> > >>>>>> anymore.  The side effect of this is that when raid_ctr calls mddev_suspend,
> > >>>>>> it waits for 'active_stripes' to reduce to 0 - which never happens.
> > >>>>> 
> > >>>>> Hi Jon,
> > >>>>> this sounds like the same bug that is fixed by 
> > >>>>> 
> > >>>>> commit ad4068de49862b083ac2a15bc50689bb30ce3e44
> > >>>>> Author: majianpeng <majianpeng@gmail.com>
> > >>>>> Date:   Thu Nov 14 15:16:15 2013 +1100
> > >>>>> 
> > >>>>> raid5: Use slow_path to release stripe when mddev->thread is null
> > >>>>> 
> > >>>>> which is already en-route to 3.12.x.  Could you check if it fixes the bug for
> > >>>>> you?
> > >>>> 
> > >>>> Sure, I'll check.  Just reading the subject of the patch, I have high hopes.  The slow path decrements 'active_stripes', which was causing the above problem...  I'll make sure though.
> > >>> 
> > >>> Yes, this patch fixes the issue in 3.12-rc1+.
> > >>> 
> > >>> However, there is still a problem I'm searching for that was introduced in commit 566c09c (at least that's what I get when bisecting).
> > >>> 
> > >>> The problem only shows up when I have taken a snapshot of a RAID5 device and only if I have cycled the device before adding the snapshot:
> > >>> 1> lvcreate --type raid5 -i 3 -L 20M -n lv vg
> > >>> 2> lvchange -an vg/lv
> > >>> 3> lvchange -ay vg/lv
> > >>> 4> lvcreate -s vg/lv -L 50M -n snap
> > >>> 5> lvchange -an vg/lv
> > >>> 6> lvchange -ay vg/lv -- BUG: line 292 of raid5.c
> > >>> 
> > >>> The current bug triggers on the 'BUG_ON(atomic_read(&conf->active_stripes)==0)' in do_release_stripe().  I'm not sure why yet.
> > >>> 
> > >>> brassow
> > >> 
> > >> I've had a look and I must say I'm not sure either.
> > >> I keep wondering if something is wrong with the locking in get_active_stripe.
> > >> The region covered by device_lock is not much smaller with the whole now
> > >> covered by hash_locks[hash].  I cannot see a problem with the locking but I
> > >> might be missing something.  A missing atomic_inc of active_stripes in there
> > >> could cause your problem.
> > >> 
> > >> As you can easily reproduce, could you try expanding the range covered by
> > >> device_lock to be the whole branch where sh is not NULL.  If that makes a
> > >> difference it would be quite instructive.  I don't hold high hopes though.
> > > 
> > > Sure, I'll try that.
> > > 
> > > I've also found that I hit the BUG() on line 693 of raid5.c:get_active_stripe().
> 
> I have another report of that BUG - from Fengguang Wu.  I suspect it and
> you're other BUG are related.
> 
> > 
> > Neil,
> > 
> > That fixed it - I no longer see any BUG()s or hangs in any of my tests.
> 
> Excellent.  It means we are on the right track at least.
> 
> > 
> > I simply move the device_lock to cover the whole 'if (atomic_read(&sh->count)) ... else' conditional.  Seems that the sh->count (and other related bits) are being changed between the time it is checked and the time the lock is acquired.
> > 
> > If I had to guess, I'd say it was possible to have the following condition:
> > Thread1-1: get_active_stripe(): 'if (atomic_read(&sh->count))' => FALSE
> > 	Thread2-1: Acquire 'device_lock'
> > 	Thread2-2: Enter release_stripe_list
> > 	Thread2-3: Clear STRIPE_ON_RELEASE_LIST while in release_stripe_list
> > 	Thread2-4: call __release_stripe->do_release_stripe->atomic_dec_and_test(&sh->count)
> > 	Thread2-5: Exit release_stripe_list
> > 	Thread2-6: Release 'device_lock'
> > Thread1-2: Acquire 'device_lock'
> > Thread1-3: check !STRIPE_ON_RELEASE_LIST --> call BUG();
> > 
> > Right now, get_active_stripe is checking 'count' and then waiting for the lock.  The lock in the current case is simply allowing more time to set the condition that we BUG on.  Moving the lock ensures that 'count' is checked properly along with STRIPE_ON_RELEASE_LIST after they have been manipulated.
> > 
> > If you think the analysis is correct, I can try to clean-up the language a little bit and submit the patch.
> 
> If a stripe is on the released_stripes list (and has STRIPE_ON_RELEASE_LIST)
> set, it means that a call to __release_stripe is pending, so the ->count must
> be at least 1.
> So we would need some other actor to be holding device_lock, incrementing
> ->count, and  then placing the stripe on the released_stripes list.  I don't
> think that is possible.
> 
> I think I see the race now though (don't know why I couldn't see it
> yesterday .. but knowing it must be there from your testing must have
> helped).
> 
> I think the race is with __get_priority_stripe().  That is the  only place
> other than get_active_stripe() where we can increment ->count from zero.
> So:
>   1-1: get_active_stripe() sees ->count is zero
>   1-2: tried to get device_lock and blocks
>   2-1: meanwhile handle_active_stripes is holding ->device_lock and
>        chooses this stripe from handle list.  It deletes it from the
>        ->lru and increments the count - then releases ->device_lock
>   1-3: get_active_stripe() sees that ->lru is empty, and complains.
> 
> You other BUG is triggered by a difference race caused by the same locking
> problem.
> 
> ->active_stripes counts the number of stripes which have a non-zero ->count,
> or have STRIPE_HANDLE set.  STRIPE_HANDLE can only be set when ->count is
> non-zero, so we just need to update this whenever count reaches zero or
> leaves zero while STRIPE_HANDLE is set.
> This would previously always happen under device_lock.  However
> get_active_stripe() now increments ->count from zero under ->hash_locks.
> This could  race with __release_stripe() which only needs device_lock
> So 
>   1-1: get_active_stripe notices that ->count is not zero and skips locking
>        with device_lock
>   2-1: __release_stripe() decrements ->count to zero and calls
>       do_release_stripe which, as STRIPE_HANDLE is not set, decrements
>       ->active_stripes
>   1-1: get_active_stripe proceeds to increment ->count from zero without a
>        matching increment for ->active_stripes
> 
> so now ->active_stripes is 1 too small and your BUG is not far away.
> 
> 
> So I think we do need to extend the lock region exactly as in the patch you
> tried.
> Sad thing is that Shaohua originally had the code like that but I talked him
> out of it. :-(
> http://www.spinics.net/lists/raid/msg44290.html
> 
> It seems a shame to be taking device_lock here rather than just hash_locks.
> It might be possible to relax that a bit, but first we would need to measure
> if it was worth it.

Good catch! Relaxing device_lock protected region like this is hardly a big
win, I bet.

Reviewed-by: Shaohua Li <shli@kernel.org>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] MD/DM RAID: Fix hang due to recent RAID5 locking changes
  2013-11-27  3:12               ` NeilBrown
  2013-11-27 10:02                 ` Shaohua Li
@ 2013-11-27 16:00                 ` Brassow Jonathan
  1 sibling, 0 replies; 11+ messages in thread
From: Brassow Jonathan @ 2013-11-27 16:00 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid@vger.kernel.org Raid, device-mapper development,
	Shaohua Li


On Nov 26, 2013, at 9:12 PM, NeilBrown wrote:

> 
> If a stripe is on the released_stripes list (and has STRIPE_ON_RELEASE_LIST)
> set, it means that a call to __release_stripe is pending, so the ->count must
> be at least 1.
> So we would need some other actor to be holding device_lock, incrementing
> ->count, and  then placing the stripe on the released_stripes list.  I don't
> think that is possible.
> 
> I think I see the race now though (don't know why I couldn't see it
> yesterday .. but knowing it must be there from your testing must have
> helped).
> 
> I think the race is with __get_priority_stripe().  That is the  only place
> other than get_active_stripe() where we can increment ->count from zero.
> So:
>  1-1: get_active_stripe() sees ->count is zero
>  1-2: tried to get device_lock and blocks
>  2-1: meanwhile handle_active_stripes is holding ->device_lock and
>       chooses this stripe from handle list.  It deletes it from the
>       ->lru and increments the count - then releases ->device_lock
>  1-3: get_active_stripe() sees that ->lru is empty, and complains.
> 
> You other BUG is triggered by a difference race caused by the same locking
> problem.
> 
> ->active_stripes counts the number of stripes which have a non-zero ->count,
> or have STRIPE_HANDLE set.  STRIPE_HANDLE can only be set when ->count is
> non-zero, so we just need to update this whenever count reaches zero or
> leaves zero while STRIPE_HANDLE is set.
> This would previously always happen under device_lock.  However
> get_active_stripe() now increments ->count from zero under ->hash_locks.
> This could  race with __release_stripe() which only needs device_lock
> So 
>  1-1: get_active_stripe notices that ->count is not zero and skips locking
>       with device_lock
>  2-1: __release_stripe() decrements ->count to zero and calls
>      do_release_stripe which, as STRIPE_HANDLE is not set, decrements
>      ->active_stripes
>  1-1: get_active_stripe proceeds to increment ->count from zero without a
>       matching increment for ->active_stripes
> 
> so now ->active_stripes is 1 too small and your BUG is not far away.
> 
> 
> So I think we do need to extend the lock region exactly as in the patch you
> tried.
> Sad thing is that Shaohua originally had the code like that but I talked him
> out of it. :-(
> http://www.spinics.net/lists/raid/msg44290.html
> 
> It seems a shame to be taking device_lock here rather than just hash_locks.
> It might be possible to relax that a bit, but first we would need to measure
> if it was worth it.
> 
> This is the patch I'm thinking of for now.  It also fixes up the two BUGs as
> both of them are wrong.
> STRIPE_ON_RELEASE_LIST has no effect on the value of ->lru, so it shouldn't
> be included.  And while STRIPE_EXPANDING justifies an active stripe_head
> being on an lru, it is never ever set for an inactive stripe_head (which is
> always on an lru) so it shouldn't be there either.
> 
> Does that all sound convincing?

Yes.  The proof is in the testing for me though, and things look good there.

 brassow



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-11-27 16:00 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-24 23:30 [PATCH 0/1] Recent breakage of DM RAID Jonathan Brassow
2013-11-24 23:30 ` [PATCH 1/1] MD/DM RAID: Fix hang due to recent RAID5 locking changes Jonathan Brassow
2013-11-25  0:03   ` NeilBrown
2013-11-25 14:20     ` Brassow Jonathan
2013-11-25 19:08       ` Brassow Jonathan
2013-11-26  5:27         ` NeilBrown
2013-11-26 14:32           ` Brassow Jonathan
2013-11-26 22:34             ` Brassow Jonathan
2013-11-27  3:12               ` NeilBrown
2013-11-27 10:02                 ` Shaohua Li
2013-11-27 16:00                 ` Brassow Jonathan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).