Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: NeilBrown @ 2016-12-12  0:59 UTC (permalink / raw)
  To: Jinpu Wang, linux-raid, Shaohua Li; +Cc: Nate Dailey
In-Reply-To: <CAMGffE=1kp7gMhG+dFxTvhD6VkRuSfbuXbf9vYSWwGG+=vvhDA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 184 bytes --]

On Sat, Nov 26 2016, Jinpu Wang wrote:
> [  810.270860]  [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0

What is this?  I cannot find that function in the upstream kernel.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: What is the solution for USB HDs
From: Adam Goryachev @ 2016-12-11 22:14 UTC (permalink / raw)
  To: juca, linux-raid
In-Reply-To: <13e51316a11df7b0c9f5ab89392361de@juan-carlos.info>

On 12/12/16 04:14, juca@juan-carlos.info wrote:
> Dear Linux-RAID Developers,
>
> I always wanted to have a small home server with a RAID-array to store 
> pesonal and work files.
>
> So I decided to create a fileserver with a raspberry pi and a RAID-1 
> Array. For this I bought two USB 1TB HDs.
>
> After a while I noticed that Mdadm was marking my drives as failed or 
> even worse as faulty. This was strange since both HDs are new.
>
> So I tested them. I deleted the RAID partitions completely with 
> WIPEFS. Then I created Ext4 partitions and run a benchmark of the 
> partitions in ubuntu. Everything is fine.
>
> I created then again my RAID array. To my surprise one drive was again 
> faulty. I googled a little  and I found that my USB drives went to 
> sleep mode and when they were woken up, one of them changed name 
> /dev/sdb/ -> /dev/sdc. Mdadm was unable to find the "new" drive and 
> marked it as faulty. Re-add didn't work. Adding it again went into an 
> unnecessary and long sync of 1TB.
>
> I googled the problem and learnt about UDEV. It turns out that Linux 
> gracefully creates symlinks to the disks-partitions under 
> /dev/disks/by-id/... I thought that I could use this symlinks to fix 
> the problem with Mdadm, but mdadm just followed the links to find 
> again the changing /dev/sd* names.
>
> I continued my journey and found more about the UDEV rules. I created 
> a couple of them, using the serial number of the disks. I turns out 
> that changing the name is not anymore allowed. (NAME="sdb" is ignored) 
> There is even a warning about this. See here: 
> http://unix.stackexchange.com/questions/119593/is-there-a-way-to-change-device-names-in-dev-directory
> "NAME="pendrak" ignored, kernel device nodes can not be renamed; 
> please fix it in /etc/udev/rules.d/99-local.rules:1"
>
> My last desperate idea was to create a cron job that reads directly 
> from the disks, to keep them awake and avoid a name change. Something 
> like this:
>
> dd if=/dev/disk/by-id/usb-TOSHIBA_External_USB_3.0_20151214504D3-0:0 
> bs=1024 count=$(($[1<<10]*10)) skip=$(($SKIP)) ibs=1024 of=/dev/null 
> iflag=direct status=none 2>&1
>
> But even with that workaround the name of a disk just changed and 
> Mdadm marked it as faulty. --- I'm out of ideas.
>
> So I'm stuck with this wonderful situation:
>
> - MDADM will follow symlinks and ignore the /dev/disk/by-id/... 
> identification.
> - UDEV does not allow to change the name of a device.
> - The names will apparently change no matter what I do.
>
> I'm starting to regret my decision to create this array. I expected 
> some challenges but it is turning to be ridiculously difficult. Almost 
> unusable.
>
> Maybe you have a solution for this.
>
> In any case it would be great if you consider some long-term 
> out-of-the-box solution for this new reality of USB HDs. My ideas 
> would be:
>
> -- Allow to create arrays with symlinks 
> ("/dev/disk/by-id/usb-TOSHIBA_External_USB_3.0_20151214504D3-0:0")
>
> -- Include a keep-alive functionality that avoids names to change 
> randomly.
>
> -- Store HD ids to recognize disks.
>

Hi,
I think you are going down the wrong path. It isn't for Linux MD to 
magically "keep drives awake" or to detect which drive is the one you 
want to use.

Also, you haven't provided enough information/detail on *why* the drives 
are vanishing. For a drive to enter sleep mode, and then wake up again, 
it shouldn't change device name. The only time it will change device 
name (in my experience) is if you unplug it, and then re-plug it.
Given you are using a Raspberry Pi, and I've quite some experience with 
them, my initial question would be to confirm that you are providing 
sufficient power to the RPi, and to the 2 x USB HDD's, if there is not 
enough power, then one drive could easily "die" and then wake up again 
when the other drive reduces its power demand. One of the best ways to 
solve this is to:
a) Use desktop drives (ie, drives that have their own power supply, and 
don't just draw power from USB
b) Use a powered USB hub (ie, a USB hub that has it's own power supply)
c) Use USB flash drives, or SSD's, I suspect both will have a lower 
power demand

If you are still having problems after that, then please try to post 
more details on what happens when the drives vanish (eg, kernel logs, 
system logs, etc).

Also, you might consider an alternative SBC, some have SATA ports 
already available (RPi are the only SBC's I've used, but there are many 
other options/variations).

Regards,
Adam



-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply

* What is the solution for USB HDs
From: juca @ 2016-12-11 17:14 UTC (permalink / raw)
  To: linux-raid

Dear Linux-RAID Developers,

I always wanted to have a small home server with a RAID-array to store 
pesonal and work files.

So I decided to create a fileserver with a raspberry pi and a RAID-1 
Array. For this I bought two USB 1TB HDs.

After a while I noticed that Mdadm was marking my drives as failed or 
even worse as faulty. This was strange since both HDs are new.

So I tested them. I deleted the RAID partitions completely with WIPEFS. 
Then I created Ext4 partitions and run a benchmark of the partitions in 
ubuntu. Everything is fine.

I created then again my RAID array. To my surprise one drive was again 
faulty. I googled a little  and I found that my USB drives went to sleep 
mode and when they were woken up, one of them changed name /dev/sdb/ -> 
/dev/sdc. Mdadm was unable to find the "new" drive and marked it as 
faulty. Re-add didn't work. Adding it again went into an unnecessary and 
long sync of 1TB.

I googled the problem and learnt about UDEV. It turns out that Linux 
gracefully creates symlinks to the disks-partitions under 
/dev/disks/by-id/... I thought that I could use this symlinks to fix the 
problem with Mdadm, but mdadm just followed the links to find again the 
changing /dev/sd* names.

I continued my journey and found more about the UDEV rules. I created a 
couple of them, using the serial number of the disks. I turns out that 
changing the name is not anymore allowed. (NAME="sdb" is ignored) There 
is even a warning about this. See here: 
http://unix.stackexchange.com/questions/119593/is-there-a-way-to-change-device-names-in-dev-directory
"NAME="pendrak" ignored, kernel device nodes can not be renamed; please 
fix it in /etc/udev/rules.d/99-local.rules:1"

My last desperate idea was to create a cron job that reads directly from 
the disks, to keep them awake and avoid a name change. Something like 
this:

dd if=/dev/disk/by-id/usb-TOSHIBA_External_USB_3.0_20151214504D3-0:0 
bs=1024 count=$(($[1<<10]*10)) skip=$(($SKIP)) ibs=1024 of=/dev/null 
iflag=direct status=none 2>&1

But even with that workaround the name of a disk just changed and Mdadm 
marked it as faulty. --- I'm out of ideas.

So I'm stuck with this wonderful situation:

- MDADM will follow symlinks and ignore the /dev/disk/by-id/... 
identification.
- UDEV does not allow to change the name of a device.
- The names will apparently change no matter what I do.

I'm starting to regret my decision to create this array. I expected some 
challenges but it is turning to be ridiculously difficult. Almost 
unusable.

Maybe you have a solution for this.

In any case it would be great if you consider some long-term 
out-of-the-box solution for this new reality of USB HDs. My ideas would 
be:

-- Allow to create arrays with symlinks 
("/dev/disk/by-id/usb-TOSHIBA_External_USB_3.0_20151214504D3-0:0")

-- Include a keep-alive functionality that avoids names to change 
randomly.

-- Store HD ids to recognize disks.

Thank you.

Best regards,

Juan Carlos Carvajal Bermúdez
+43 650 477 0005
juca@juan-carlos.info
Beingasse 17/2/3
AT 1150, Wien
www.juan-carlos.info

^ permalink raw reply

* Re: md: Combine two kmalloc() calls into one in sb_equal()
From: Bernd Schubert @ 2016-12-09 22:04 UTC (permalink / raw)
  To: SF Markus Elfring; +Cc: linux-raid, Shaohua Li, LKML, kernel-janitors
In-Reply-To: <b992386f-bf20-8483-d7dd-ee4c29625ecb@users.sourceforge.net>



On 09.12.2016 22:58, SF Markus Elfring wrote:
>> Irrelevant, the variable is not used before checking it.
>
> * Will it be more appropriate to attempt another memory allocation only if
>   the previous one succeeded already?
>
> * Can it be a bit more efficient to duplicate only the required data
>   in a single function call before?

How many memory allocations do you expect to fail?

^ permalink raw reply

* Re: md: Combine two kmalloc() calls into one in sb_equal()
From: SF Markus Elfring @ 2016-12-09 21:58 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-raid, Shaohua Li, LKML, kernel-janitors
In-Reply-To: <a33e87c6-9b47-96e2-a877-b86bc8949eb6@fastmail.fm>

> Irrelevant, the variable is not used before checking it.

* Will it be more appropriate to attempt another memory allocation only if
  the previous one succeeded already?

* Can it be a bit more efficient to duplicate only the required data
  in a single function call before?

Regards,
Markus

^ permalink raw reply

* Re: [PATCH] md: Combine two kmalloc() calls into one in sb_equal()
From: Joe Perches @ 2016-12-09 21:57 UTC (permalink / raw)
  To: Al Viro; +Cc: SF Markus Elfring, linux-raid, Shaohua Li, LKML, kernel-janitors
In-Reply-To: <20161209213030.GC1555@ZenIV.linux.org.uk>

On Fri, 2016-12-09 at 21:30 +0000, Al Viro wrote:
> On Fri, Dec 09, 2016 at 11:05:14AM -0800, Joe Perches wrote:
> > On Fri, 2016-12-09 at 19:30 +0100, SF Markus Elfring wrote:
> > > From: Markus Elfring <elfring@users.sourceforge.net>
> > > Date: Fri, 9 Dec 2016 19:09:13 +0100
> > > 
> > > The function "kmalloc" was called in one case by the function "sb_equal"
> > > without checking immediately if it failed.
> > > This issue was detected by using the Coccinelle software.
> > > 
> > > Perform the desired memory allocation (and release at the end)
> > > by a single function call instead.
> > > 
> > > Fixes: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 ("Linux-2.6.12-rc2")
> > 
> > Making a change does not mean fixes.
> > 
> > There's nothing particularly _wrong_ with the code as-is.
> > 
> > 2 kmemdup calls might make the code more obvious.
> > 
> > There's a small optimization possible in that only the
> > first MB_SB_GENERIC_CONSTANT_WORDS of the struct are
> > actually compared.  Alloc and copy of both entire structs
> > is inefficient and unnecessary.
> > 
> > Perhaps something like the below would be marginally
> > better/faster, but the whole thing is dubious.
> > 
> > static int sb_equal(mdp_super_t *sb1, mdp_super_t *sb2)
> > {
> > 	int ret;
> > 	void *tmp1, *tmp2;
> > 
> > 	tmp1 = kmemdup(sb1, MD_SB_GENERIC_CONSTANT_WORDS * sizeof(__u32), GFP_KERNEL);
> > 	tmp2 = kmemdup(sb2, MD_SB_GENERIC_CONSTANT_WORDS * sizeof(__u32), GFP_KERNEL);
> > 
> > 	if (!tmp1 || !tmp2) {
> > 		ret = 0;
> > 		goto out;
> > 	}
> > 
> > 	/*
> > 	 * nr_disks is not constant
> > 	 */
> > 	((mdp_super_t *)tmp1)->nr_disks = 0;
> > 	((mdp_super_t *)tmp2)->nr_disks = 0;
> > 
> > 	ret = memcmp(tmp1, tmp2, MD_SB_GENERIC_CONSTANT_WORDS * sizeof(__u32)) == 0;
> > 
> > out:
> > 	kfree(tmp1);
> > 	kfree(tmp2);
> > 	return ret;
> > }
> 
> May I politely inquire if either of you has actually bothered to read the
> code and figure out what it does?  This is grotesque...
> 
> For really slow: we have two objects.  We want to check if anything in the
> 128-byte chunks in their beginnings other than one 32bit field happens to be
> different.  For that we
> 	* allocate two 128-byte pieces of memory
> 	* *copy* our objects into those
> 	* forcibly zero the field in question in both of those copies
> 	* compare the fuckers
> 	* free them
> 
> And you two are discussing whether it's better to combine allocations of those
> copies into a single 256-byte allocation?  Really?

No.  May I suggest you read my suggestion?
At no point did I suggest a single allocation.

I think the single allocation is silly and just
makes the code harder to read.

>   _IF_ it is a hot path,
> the obvious optimization would be to avoid copying that crap in the first
> place - simply by
> 	return memcmp(sb1, sb2, offsetof(mdp_super_t, nr_disks)) ||
> 	       memcmp(&sb1->nr_disks + 1, &sb2->nr_disks + 1,
> 			MD_SB_GENERIC_CONSTANT_WORDS * sizeof(__u32) -
> 			offsetof(mdp_super_t, nr_disks) - 4);

That's all true, but Markus has enough trouble reading simple
code without trying to explain to him what offsetof does.

btw:  the "- 4" should be " - sizeof(__u32)" just for consistency
with the line above it.

> If it is _not_ a hot path, why bother with it at all?

exactly.

^ permalink raw reply

* Re: [PATCH] md: Combine two kmalloc() calls into one in sb_equal()
From: Al Viro @ 2016-12-09 21:30 UTC (permalink / raw)
  To: Joe Perches
  Cc: SF Markus Elfring, linux-raid, Shaohua Li, LKML, kernel-janitors
In-Reply-To: <1481310314.5946.40.camel@perches.com>

On Fri, Dec 09, 2016 at 11:05:14AM -0800, Joe Perches wrote:
> On Fri, 2016-12-09 at 19:30 +0100, SF Markus Elfring wrote:
> > From: Markus Elfring <elfring@users.sourceforge.net>
> > Date: Fri, 9 Dec 2016 19:09:13 +0100
> > 
> > The function "kmalloc" was called in one case by the function "sb_equal"
> > without checking immediately if it failed.
> > This issue was detected by using the Coccinelle software.
> > 
> > Perform the desired memory allocation (and release at the end)
> > by a single function call instead.
> > 
> > Fixes: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 ("Linux-2.6.12-rc2")
> 
> Making a change does not mean fixes.
> 
> There's nothing particularly _wrong_ with the code as-is.
> 
> 2 kmemdup calls might make the code more obvious.
> 
> There's a small optimization possible in that only the
> first MB_SB_GENERIC_CONSTANT_WORDS of the struct are
> actually compared.  Alloc and copy of both entire structs
> is inefficient and unnecessary.
> 
> Perhaps something like the below would be marginally
> better/faster, but the whole thing is dubious.
> 
> static int sb_equal(mdp_super_t *sb1, mdp_super_t *sb2)
> {
> 	int ret;
> 	void *tmp1, *tmp2;
> 
> 	tmp1 = kmemdup(sb1, MD_SB_GENERIC_CONSTANT_WORDS * sizeof(__u32), GFP_KERNEL);
> 	tmp2 = kmemdup(sb2, MD_SB_GENERIC_CONSTANT_WORDS * sizeof(__u32), GFP_KERNEL);
> 
> 	if (!tmp1 || !tmp2) {
> 		ret = 0;
> 		goto out;
> 	}
> 
> 	/*
> 	 * nr_disks is not constant
> 	 */
> 	((mdp_super_t *)tmp1)->nr_disks = 0;
> 	((mdp_super_t *)tmp2)->nr_disks = 0;
> 
> 	ret = memcmp(tmp1, tmp2, MD_SB_GENERIC_CONSTANT_WORDS * sizeof(__u32)) == 0;
> 
> out:
> 	kfree(tmp1);
> 	kfree(tmp2);
> 	return ret;
> }

May I politely inquire if either of you has actually bothered to read the
code and figure out what it does?  This is grotesque...

For really slow: we have two objects.  We want to check if anything in the
128-byte chunks in their beginnings other than one 32bit field happens to be
different.  For that we
	* allocate two 128-byte pieces of memory
	* *copy* our objects into those
	* forcibly zero the field in question in both of those copies
	* compare the fuckers
	* free them

And you two are discussing whether it's better to combine allocations of those
copies into a single 256-byte allocation?  Really?  _IF_ it is a hot path,
the obvious optimization would be to avoid copying that crap in the first
place - simply by
	return memcmp(sb1, sb2, offsetof(mdp_super_t, nr_disks)) ||
	       memcmp(&sb1->nr_disks + 1, &sb2->nr_disks + 1,
			MD_SB_GENERIC_CONSTANT_WORDS * sizeof(__u32) -
			offsetof(mdp_super_t, nr_disks) - 4);
If it is _not_ a hot path, why bother with it at all?

^ permalink raw reply

* Re: md: Combine two kmalloc() calls into one in sb_equal()
From: Bernd Schubert @ 2016-12-09 21:18 UTC (permalink / raw)
  To: SF Markus Elfring; +Cc: linux-raid, Shaohua Li, LKML, kernel-janitors
In-Reply-To: <a7693826-b1d1-882a-44ce-5dca9d2fe851@users.sourceforge.net>



On 09.12.2016 20:54, SF Markus Elfring wrote:
>> So where did you get the idea from that it is not checked immediately?
>
> Is another variable assignment performed so far before the return value
> is checked from a previous function call?

Irrelevant, the variable is not used before checking it.

^ permalink raw reply

* Re: md: Combine two kmalloc() calls into one in sb_equal()
From: Joe Perches @ 2016-12-09 20:51 UTC (permalink / raw)
  To: SF Markus Elfring; +Cc: linux-raid, Shaohua Li, LKML, kernel-janitors
In-Reply-To: <e25d2f8a-8ee9-c541-dac4-a254502df293@users.sourceforge.net>

On Fri, 2016-12-09 at 21:05 +0100, SF Markus Elfring wrote:
> > 	tmp1 = kmemdup(sb1, MD_SB_GENERIC_CONSTANT_WORDS * sizeof(__u32), GFP_KERNEL);
> 
> Is a function available in the Linux programming interface which would duplicate
> the beginning of two array elements in a single call directly?

No.  If there were, there would be a good argument to remove it.

^ permalink raw reply

* Re: md: Combine two kmalloc() calls into one in sb_equal()
From: SF Markus Elfring @ 2016-12-09 20:05 UTC (permalink / raw)
  To: Joe Perches; +Cc: linux-raid, Shaohua Li, LKML, kernel-janitors
In-Reply-To: <1481310314.5946.40.camel@perches.com>

> 	tmp1 = kmemdup(sb1, MD_SB_GENERIC_CONSTANT_WORDS * sizeof(__u32), GFP_KERNEL);

Is a function available in the Linux programming interface which would duplicate
the beginning of two array elements in a single call directly?

Regards,
Markus

^ permalink raw reply

* Re: md: Combine two kmalloc() calls into one in sb_equal()
From: SF Markus Elfring @ 2016-12-09 19:54 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-raid, Shaohua Li, LKML, kernel-janitors
In-Reply-To: <71044708-c5e4-3a14-b162-dfc600311a60@fastmail.fm>

> So where did you get the idea from that it is not checked immediately?

Is another variable assignment performed so far before the return value
is checked from a previous function call?

Regards,
Markus

^ permalink raw reply

* Re: [PATCH] md: Combine two kmalloc() calls into one in sb_equal()
From: Bernd Schubert @ 2016-12-09 19:09 UTC (permalink / raw)
  To: SF Markus Elfring, linux-raid, Shaohua Li; +Cc: LKML, kernel-janitors
In-Reply-To: <f4df733d-8e06-8ba1-c4c3-800d8c5615ba@users.sourceforge.net>



On 09.12.2016 19:30, SF Markus Elfring wrote:
> From: Markus Elfring <elfring@users.sourceforge.net>
> Date: Fri, 9 Dec 2016 19:09:13 +0100
>
> The function "kmalloc" was called in one case by the function "sb_equal"
> without checking immediately if it failed.

Err, your patch actually *replaces* the check. So where did you get the 
idea from that it is not checked immediately?

[...]

> -	tmp1 = kmalloc(sizeof(*tmp1),GFP_KERNEL);
> -	tmp2 = kmalloc(sizeof(*tmp2),GFP_KERNEL);
> -
> -	if (!tmp1 || !tmp2) {
> -		ret = 0;
> -		goto abort;
> -	}

This is not immediately?



Bernd

^ permalink raw reply

* Re: [PATCH] md: Combine two kmalloc() calls into one in sb_equal()
From: Joe Perches @ 2016-12-09 19:05 UTC (permalink / raw)
  To: SF Markus Elfring, linux-raid, Shaohua Li; +Cc: LKML, kernel-janitors
In-Reply-To: <f4df733d-8e06-8ba1-c4c3-800d8c5615ba@users.sourceforge.net>

On Fri, 2016-12-09 at 19:30 +0100, SF Markus Elfring wrote:
> From: Markus Elfring <elfring@users.sourceforge.net>
> Date: Fri, 9 Dec 2016 19:09:13 +0100
> 
> The function "kmalloc" was called in one case by the function "sb_equal"
> without checking immediately if it failed.
> This issue was detected by using the Coccinelle software.
> 
> Perform the desired memory allocation (and release at the end)
> by a single function call instead.
> 
> Fixes: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 ("Linux-2.6.12-rc2")

Making a change does not mean fixes.

There's nothing particularly _wrong_ with the code as-is.

2 kmemdup calls might make the code more obvious.

There's a small optimization possible in that only the
first MB_SB_GENERIC_CONSTANT_WORDS of the struct are
actually compared.  Alloc and copy of both entire structs
is inefficient and unnecessary.

Perhaps something like the below would be marginally
better/faster, but the whole thing is dubious.

static int sb_equal(mdp_super_t *sb1, mdp_super_t *sb2)
{
	int ret;
	void *tmp1, *tmp2;

	tmp1 = kmemdup(sb1, MD_SB_GENERIC_CONSTANT_WORDS * sizeof(__u32), GFP_KERNEL);
	tmp2 = kmemdup(sb2, MD_SB_GENERIC_CONSTANT_WORDS * sizeof(__u32), GFP_KERNEL);

	if (!tmp1 || !tmp2) {
		ret = 0;
		goto out;
	}

	/*
	 * nr_disks is not constant
	 */
	((mdp_super_t *)tmp1)->nr_disks = 0;
	((mdp_super_t *)tmp2)->nr_disks = 0;

	ret = memcmp(tmp1, tmp2, MD_SB_GENERIC_CONSTANT_WORDS * sizeof(__u32)) == 0;

out:
	kfree(tmp1);
	kfree(tmp2);
	return ret;
}

> Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
> ---
>  drivers/md/md.c | 13 ++++---------
>  1 file changed, 4 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index b088668269b0..86caf2536255 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -843,15 +843,12 @@ static int sb_equal(mdp_super_t *sb1, mdp_super_t *sb2)
>  	int ret;
>  	mdp_super_t *tmp1, *tmp2;
>  
> -	tmp1 = kmalloc(sizeof(*tmp1),GFP_KERNEL);
> -	tmp2 = kmalloc(sizeof(*tmp2),GFP_KERNEL);
> -
> -	if (!tmp1 || !tmp2) {
> -		ret = 0;
> -		goto abort;
> -	}
> +	tmp1 = kmalloc(2 * sizeof(*tmp1), GFP_KERNEL);
> +	if (!tmp1)
> +		return 0;
>  
>  	*tmp1 = *sb1;
> +	tmp2 = tmp1 + 1;
>  	*tmp2 = *sb2;
>  
>  	/*
> @@ -861,9 +858,7 @@ static int sb_equal(mdp_super_t *sb1, mdp_super_t *sb2)
>  	tmp2->nr_disks = 0;
>  
>  	ret = (memcmp(tmp1, tmp2, MD_SB_GENERIC_CONSTANT_WORDS * 4) == 0);
> -abort:
>  	kfree(tmp1);
> -	kfree(tmp2);
>  	return ret;
>  }

^ permalink raw reply

* [PATCH] md: Combine two kmalloc() calls into one in sb_equal()
From: SF Markus Elfring @ 2016-12-09 18:30 UTC (permalink / raw)
  To: linux-raid, Shaohua Li; +Cc: LKML, kernel-janitors

From: Markus Elfring <elfring@users.sourceforge.net>
Date: Fri, 9 Dec 2016 19:09:13 +0100

The function "kmalloc" was called in one case by the function "sb_equal"
without checking immediately if it failed.
This issue was detected by using the Coccinelle software.

Perform the desired memory allocation (and release at the end)
by a single function call instead.

Fixes: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 ("Linux-2.6.12-rc2")

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
 drivers/md/md.c | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index b088668269b0..86caf2536255 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -843,15 +843,12 @@ static int sb_equal(mdp_super_t *sb1, mdp_super_t *sb2)
 	int ret;
 	mdp_super_t *tmp1, *tmp2;
 
-	tmp1 = kmalloc(sizeof(*tmp1),GFP_KERNEL);
-	tmp2 = kmalloc(sizeof(*tmp2),GFP_KERNEL);
-
-	if (!tmp1 || !tmp2) {
-		ret = 0;
-		goto abort;
-	}
+	tmp1 = kmalloc(2 * sizeof(*tmp1), GFP_KERNEL);
+	if (!tmp1)
+		return 0;
 
 	*tmp1 = *sb1;
+	tmp2 = tmp1 + 1;
 	*tmp2 = *sb2;
 
 	/*
@@ -861,9 +858,7 @@ static int sb_equal(mdp_super_t *sb1, mdp_super_t *sb2)
 	tmp2->nr_disks = 0;
 
 	ret = (memcmp(tmp1, tmp2, MD_SB_GENERIC_CONSTANT_WORDS * 4) == 0);
-abort:
 	kfree(tmp1);
-	kfree(tmp2);
 	return ret;
 }
 
-- 
2.11.0

^ permalink raw reply related

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: Jinpu Wang @ 2016-12-09 15:36 UTC (permalink / raw)
  To: NeilBrown; +Cc: Coly Li, linux-raid, Shaohua Li, Nate Dailey
In-Reply-To: <87inqt1vzk.fsf@notabene.neil.brown.name>

On Fri, Dec 9, 2016 at 7:01 AM, NeilBrown <neilb@suse.com> wrote:
> On Thu, Dec 08 2016, Jinpu Wang wrote:
>
>
> This number:
>
>>   nr_pending = {
>>     counter = 1
>>   },
>
>
>
> and this number:
>
>>   nr_pending = {
>>     counter = 856
>>   },
>
> might be interesting.
>
> There are 855 requested on the list.  Add the one that is currently
> being retried give 856, which is nr_pending for the device that failed.
> But nr_pending on the device that didn't fail is 1.  I would expect
> zero.
> When a read or write requests succeeds, rdev_dec_pending() is called
> immediately so this should quickly go to zero.
>
> It seems as though there must be a request to the loop device that is
> stuck somewhere between the atomic_inc(&rdev->nr_pending) (possibly
> inside read_balance) and the call to generic_make_request().
> I cannot yet see how that would happen.
>
> Can you check if the is a repeatable observation?  Is nr_pending.counter
> always '1' on the loop device?
>
> Thanks,
> NeilBrown

Hi Neil,

Yes, it's repreatable observation. I triggered again, this time
  nr_pending = 1203,
  nr_waiting = 8,
  nr_queued = 1201,

conf->retry_list has 1175 entries.
on conf->bio_end_io_list has 26 entries.
Totol is 1201, match nr_queued.

in md_rdev healthy one loop1 has 1 nr_pending.
faulty one ibnbd1 has 1076.

crash> struct md_rdev 0xffff880228880400
struct md_rdev {
  same_set = {
    next = 0xffff88023202a200,
    prev = 0xffff8800b64c6018
  },
  sectors = 2095104,
  mddev = 0xffff8800b64c6000,
  last_events = 17764573,
  meta_bdev = 0x0,
  bdev = 0xffff8800b60ce080,
  sb_page = 0xffffea0002bd3040,
  bb_page = 0xffffea0002dc76c0,
  sb_loaded = 1,
  sb_events = 166,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff880037962af0 "dev-ibnbd0",
    entry = {
      next = 0xffff880228880480,
      prev = 0xffff880228880480
    },
    parent = 0xffff8800b64c6050,
    kset = 0x0,
    ktype = 0xffffffffa0501300 <rdev_ktype>,
    sd = 0xffff88022bfc12d0,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 581,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8802288804c8,
      prev = 0xffff8802288804c8
    }
  },
  desc_nr = 0,
  raid_disk = 0,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 18446744073709551615,
    journal_tail = 18446744073709551615
  },
  nr_pending = {
    counter = 1176
  },
  read_errors = {
    counter = 0
  },
  last_read_error = {
    tv_sec = 0,
    tv_nsec = 0
  },
  corrected_errors = {
    counter = 0
  },
  del_work = {
    data = {
      counter = 0
    },
    entry = {
      next = 0x0,
      prev = 0x0
    },
    func = 0x0
  },
  sysfs_state = 0xffff88022bfc1348,
  badblocks = {
    count = 0,
    unacked_exist = 0,
    shift = 0,
    page = 0xffff8802289aa000,
    changed = 0,
    lock = {
      seqcount = {
        sequence = 0
      },
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      }
    },
    sector = 0,
    size = 0
  }
}

crash> struct md_rdev 0xffff88023202a200
struct md_rdev {
  same_set = {
    next = 0xffff8800b64c6018,
    prev = 0xffff880228880400
  },
  sectors = 2095104,
  mddev = 0xffff8800b64c6000,
  last_events = 37178561,
  meta_bdev = 0x0,
  bdev = 0xffff8800b60d09c0,
  sb_page = 0xffffea0008af7580,
  bb_page = 0xffffea0002e69380,
  sb_loaded = 1,
  sb_events = 167,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff88023521ec30 "dev-loop1",
    entry = {
      next = 0xffff88023202a280,
      prev = 0xffff88023202a280
    },
    parent = 0xffff8800b64c6050,
    kset = 0x0,
    ktype = 0xffffffffa0501300 <rdev_ktype>,
    sd = 0xffff88022bc0a708,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 2,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff88023202a2c8,
      prev = 0xffff88023202a2c8
    }
  },
crash> struct md_rdev 0xffff88023202a200
struct md_rdev {
  same_set = {
    next = 0xffff8800b64c6018,
    prev = 0xffff880228880400
  },
  sectors = 2095104,
  mddev = 0xffff8800b64c6000,
  last_events = 37178561,
  meta_bdev = 0x0,
  bdev = 0xffff8800b60d09c0,
  sb_page = 0xffffea0008af7580,
  bb_page = 0xffffea0002e69380,
  sb_loaded = 1,
  sb_events = 167,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff88023521ec30 "dev-loop1",
    entry = {
      next = 0xffff88023202a280,
      prev = 0xffff88023202a280
    },
    parent = 0xffff8800b64c6050,
    kset = 0x0,
    ktype = 0xffffffffa0501300 <rdev_ktype>,
    sd = 0xffff88022bc0a708,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 2,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff88023202a2c8,
      prev = 0xffff88023202a2c8
    }
  },
  desc_nr = 1,
  raid_disk = 1,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 18446744073709551615,
    journal_tail = 18446744073709551615
  },
  nr_pending = {
    counter = 1
  },
  read_errors = {
    counter = 0
  },
  last_read_error = {
    tv_sec = 0,
    tv_nsec = 0
  },
  corrected_errors = {
    counter = 0
  },
  del_work = {
    data = {
      counter = 0
    },
    entry = {
      next = 0x0,
      prev = 0x0
    },
    func = 0x0
  },
  sysfs_state = 0xffff88022bc0a780,
  badblocks = {
    count = 0,
    unacked_exist = 0,
    shift = 0,
    page = 0xffff88022bff0000,
    changed = 0,
    lock = {
      seqcount = {
        sequence = 164
      },
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      }
    },
    sector = 0,
    size = 0
  }
}



Thanks

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: Jinpu Wang @ 2016-12-09 15:28 UTC (permalink / raw)
  To: NeilBrown; +Cc: Coly Li, linux-raid, Shaohua Li, Nate Dailey
In-Reply-To: <87inqt1vzk.fsf@notabene.neil.brown.name>

On Fri, Dec 9, 2016 at 7:01 AM, NeilBrown <neilb@suse.com> wrote:
> On Thu, Dec 08 2016, Jinpu Wang wrote:
>
>
> This number:
>
>>   nr_pending = {
>>     counter = 1
>>   },
>
>
>
> and this number:
>
>>   nr_pending = {
>>     counter = 856
>>   },
>
> might be interesting.
>
> There are 855 requested on the list.  Add the one that is currently
> being retried give 856, which is nr_pending for the device that failed.
> But nr_pending on the device that didn't fail is 1.  I would expect
> zero.
> When a read or write requests succeeds, rdev_dec_pending() is called
> immediately so this should quickly go to zero.
>
> It seems as though there must be a request to the loop device that is
> stuck somewhere between the atomic_inc(&rdev->nr_pending) (possibly
> inside read_balance) and the call to generic_make_request().
> I cannot yet see how that would happen.
>
> Can you check if the is a repeatable observation?  Is nr_pending.counter
> always '1' on the loop device?
>
> Thanks,
> NeilBrown

Hi Neil,

Yes, it's repreatable observation. I triggered again, this time
  nr_pending = 1203,
  nr_waiting = 8,
  nr_queued = 1201,

conf->retry_list has 1175 entries.
on conf->bio_end_io_list has 26 entries.
Totol is 1201, match nr_queued.

in md_rdev healthy one loop1 has 1 nr_pending.
faulty one ibnbd1 has 1076.

crash> struct md_rdev 0xffff880228880400
struct md_rdev {
  same_set = {
    next = 0xffff88023202a200,
    prev = 0xffff8800b64c6018
  },
  sectors = 2095104,
  mddev = 0xffff8800b64c6000,
  last_events = 17764573,
  meta_bdev = 0x0,
  bdev = 0xffff8800b60ce080,
  sb_page = 0xffffea0002bd3040,
  bb_page = 0xffffea0002dc76c0,
  sb_loaded = 1,
  sb_events = 166,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff880037962af0 "dev-ibnbd0",
    entry = {
      next = 0xffff880228880480,
      prev = 0xffff880228880480
    },
    parent = 0xffff8800b64c6050,
    kset = 0x0,
    ktype = 0xffffffffa0501300 <rdev_ktype>,
    sd = 0xffff88022bfc12d0,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 581,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8802288804c8,
      prev = 0xffff8802288804c8
    }
  },
  desc_nr = 0,
  raid_disk = 0,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 18446744073709551615,
    journal_tail = 18446744073709551615
  },
  nr_pending = {
    counter = 1176
  },
  read_errors = {
    counter = 0
  },
  last_read_error = {
    tv_sec = 0,
    tv_nsec = 0
  },
  corrected_errors = {
    counter = 0
  },
  del_work = {
    data = {
      counter = 0
    },
    entry = {
      next = 0x0,
      prev = 0x0
    },
    func = 0x0
  },
  sysfs_state = 0xffff88022bfc1348,
  badblocks = {
    count = 0,
    unacked_exist = 0,
    shift = 0,
    page = 0xffff8802289aa000,
    changed = 0,
    lock = {
      seqcount = {
        sequence = 0
      },
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      }
    },
    sector = 0,
    size = 0
  }
}

crash> struct md_rdev 0xffff88023202a200
struct md_rdev {
  same_set = {
    next = 0xffff8800b64c6018,
    prev = 0xffff880228880400
  },
  sectors = 2095104,
  mddev = 0xffff8800b64c6000,
  last_events = 37178561,
  meta_bdev = 0x0,
  bdev = 0xffff8800b60d09c0,
  sb_page = 0xffffea0008af7580,
  bb_page = 0xffffea0002e69380,
  sb_loaded = 1,
  sb_events = 167,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff88023521ec30 "dev-loop1",
    entry = {
      next = 0xffff88023202a280,
      prev = 0xffff88023202a280
    },
    parent = 0xffff8800b64c6050,
    kset = 0x0,
    ktype = 0xffffffffa0501300 <rdev_ktype>,
    sd = 0xffff88022bc0a708,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 2,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff88023202a2c8,
      prev = 0xffff88023202a2c8
    }
  },
crash> struct md_rdev 0xffff88023202a200
struct md_rdev {
  same_set = {
    next = 0xffff8800b64c6018,
    prev = 0xffff880228880400
  },
  sectors = 2095104,
  mddev = 0xffff8800b64c6000,
  last_events = 37178561,
  meta_bdev = 0x0,
  bdev = 0xffff8800b60d09c0,
  sb_page = 0xffffea0008af7580,
  bb_page = 0xffffea0002e69380,
  sb_loaded = 1,
  sb_events = 167,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff88023521ec30 "dev-loop1",
    entry = {
      next = 0xffff88023202a280,
      prev = 0xffff88023202a280
    },
    parent = 0xffff8800b64c6050,
    kset = 0x0,
    ktype = 0xffffffffa0501300 <rdev_ktype>,
    sd = 0xffff88022bc0a708,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 2,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff88023202a2c8,
      prev = 0xffff88023202a2c8
    }
  },
  desc_nr = 1,
  raid_disk = 1,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 18446744073709551615,
    journal_tail = 18446744073709551615
  },
  nr_pending = {
    counter = 1
  },
  read_errors = {
    counter = 0
  },
  last_read_error = {
    tv_sec = 0,
    tv_nsec = 0
  },
  corrected_errors = {
    counter = 0
  },
  del_work = {
    data = {
      counter = 0
    },
    entry = {
      next = 0x0,
      prev = 0x0
    },
    func = 0x0
  },
  sysfs_state = 0xffff88022bc0a780,
  badblocks = {
    count = 0,
    unacked_exist = 0,
    shift = 0,
    page = 0xffff88022bff0000,
    changed = 0,
    lock = {
      seqcount = {
        sequence = 164
      },
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      }
    },
    sector = 0,
    size = 0
  }
}



Thanks

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply

* Re: raid0 vs. mkfs
From: Coly Li @ 2016-12-09  7:34 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Avi Kivity, NeilBrown, linux-raid, linux-block
In-Reply-To: <20161208191906.ealqsudmgopn7wa3@kernel.org>

On 2016/12/9 上午3:19, Shaohua Li wrote:
> On Fri, Dec 09, 2016 at 12:44:57AM +0800, Coly Li wrote:
>> On 2016/12/8 上午12:59, Shaohua Li wrote:
>>> On Wed, Dec 07, 2016 at 07:50:33PM +0800, Coly Li wrote:
>> [snip]
>>> Thanks for doing this, Coly! For raid0, this totally makes sense. The raid0
>>> zones make things a little complicated though. I just had a brief look of your
>>> proposed patch, which looks really complicated. I'd suggest something like
>>> this:
>>> 1. split the bio according to zone boundary.
>>> 2. handle the splitted bio. since the bio is within zone range, calculating
>>> the start and end sector for each rdev should be easy.
>>>
>>
>> Hi Shaohua,
>>
>> Thanks for your suggestion! I try to modify the code by your suggestion,
>> it is even more hard to make the code that way ...
>>
>> Because even split bios for each zone, all the corner cases still exist
>> and should be taken care in every zoon. The code will be more complicated.
> 
> Not sure why it makes the code more complicated. Probably I'm wrong, but Just
> want to make sure we are in the same page: split the bio according to zone
> boundary, then handle the splitted bio separately. Calculating end/start point
> of each rdev for the new bio within a zone should be simple. we then clone a
> bio for each rdev and dispatch. So for example:
> Disk 0: D0 D2 D4 D6 D7
> Disk 1: D1 D3 D5
> zone 0 is from D0 - D5, zone 1 is from D6 - D7
> If bio is from D1 to D7, we split it to 2 bios, one is D1 - D5, the other D6 - D7.
> For D1 - D5, we dispatch 2 bios. D1 - D5 for disk 1, D2 - D4 for disk 0
> For D6 - D7, we just dispatch to disk 0.
> What kind of corner case makes this more complicated?
>  

Let me explain the corner cases.

When upper layer code issues a DISCARD bio, the bio->bi_iter.bi_sector
may not be chunk size aligned, and bio->bi_iter.bi_size may not be
(chunk_sects*nb_dev) sectors aligned. In raid0, we can't simply round
up/down them into chunk size aligned number, otherwise data
lost/corruption will happen.

Therefore for each DISCARD bio that raid0_make_request() receive, the
beginning and ending parts of this bio should be treat very carefully.
All the corner cases *come from here*, they are not about number of
zones or rdevs, it is about whether bio->bi_iter.bi_sector and
bio->bi_iter.bi_size are chunk size aligned or not.

- beginning of the bio
  If bio->bi_iter.bi_sector is not chunk size aligned, current raid0
code will split the beginning part into split bio which only contains
sectors from bio->bi_iter.sector to next chunk size aligned offset, and
issue this bio by generic_make_request(). But in
discard_large_discard_bio() we can't issue the split bio now, we have to
record lenth of this split bio into a per-device structure, and issue a
split bio after all the sectors of the DISCARD bio are calculated. So I
use recs[first_rdev_idx].bi_sector to rcoard bi_iter.bi_sector of the
split bio, recs[first_rdev_idx].sectors to record length of this split
bio, and recs[first_rdev_idx].rdev to record address of the real device
where the split bio will be sent to.

  After the first non-chunk-size aligned part handled, we need to look
into the next rdev for next chunk of DISCARD bio. Now the sector offset
is chunk size aligned, but there are still two condition:
  1) If the first_rdev is not the last read device in current zone, then
on the next real device, its per-device bio will start on a round down
chunk offset of recs[first_rdev_idx].bi_sector. If
recs[first_rdev_idx].bi_sector is chunk size aligned, the next real
device's per-device bio will start at the same chunk offset.

  2) If first_rdev is the last real device in current zone, then next
rdev to handle is the number 0 real device among
conf->strip_zone[zone_idx].nb_dev real devices. In this case, the
bi_iter.bi_setor of the per-device bio of this real device, is the chunk
offset next to chunk which recs[first_rdev_idx].bi_sector hits on
recs[first_rdev_idx].rdev.

- ending part of the bio
  If length of the DISCARD bio is not (chunk_sects * nb_dev) sectors
aligned, after we handled one or more (chunk_sects*nb_dev)) aligned
sectors, the rested sectors are less then chunk_sects*nb_dev, but these
sectors may still fit in some rested_chunks and rested_sectors. We also
need to handle them carefully.
  If rested_chunks is not 0, because chunks are linearly allocated on
each real device in current zone, we need to add chunk_sects to
recs[rdev_idx].sectors, where rdev_idx starts from 0 to
strip_zone[zone_idx].nb_dev - 1. When we move to next real device
(rdev_idx ++), we need to reduce one chunk from rested_chunk
(rested_chunk --). If rested_chunk reaches 0, we start to handle
rested_sectors.
  rested_sectors will be added to the next real device, we just simply
add rested_sectors to recs[rdev_idx].sectors. There is a corner case
that after the calculation of DISCARD bio, rested_chunks is 0, and
rested_sectors is not 0. In this case, we only need to add
rested_sectors to number 0 real device of the zone, which is recs[0].

A DISCARD bio is probably to only cover one raid0 zone, but all the
above corner cases have to be taken care. Therefore submitting split
bios for each zone, does not make the code simpler.

To handle the details correctly is really boring, if I can explain to
you face to face, that will be much easier.

>>> This will create slightly more bio to each rdev (not too many, since there
>>> aren't too many zones in practice) and block layer should easily merge these
>>> bios without much overhead. The benefit is a much simpler implementation.
>>>
>>>> I compose a prototype patch, the code is not simple, indeed it is quite
>>>> complicated IMHO.
>>>>
>>>> I do a little research, some NVMe SSDs support whole device size
>>>> DISCARD, also I observe mkfs.xfs sends out a raid0 device size DISCARD
>>>> bio to block layer. But raid0_make_request() only receives 512KB size
>>>> DISCARD bio, block/blk-lib.c:__blkdev_issue_discard() splits the
>>>> original large bio into 512KB small bios, the limitation is from
>>>> q->limits.discard_granularity.
>>>
>>> please adjust the max discard sectors for the queue. The original setting is
>>> chunk size.
>>
>> This is a powerful suggestion, I change the max_discard_sectors to raid0
>> size, and fix some bugs, now the patch looks working well. The
>> performance number is not bad.
>>
>> On 4x3TB NVMe raid0, format it with mkfs.xfs. Current upstream kernel
>> spends 306 seconds, the patched kernel spends 15 seconds. I see average
>> request size increases from 1 chunk (1024 sectors) to 2048 chunks
>> (2097152 sectors).
>>
>> I don't know why the bios are still be split before raid0_make_request()
>> receives them, after I set q->limits.max_discard_sectors to
>> mddev->array_sectors. Can anybody give me a hint ?
> 
> That probably is from disk_stack_limits. try set the max_discard_sectors after it.
> 

I know why the bio is still split, (1<<32)/512 = 8388608, in
__blkdev_issue_discard(), req_sects is decided by:
	/* Make sure bi_size doesn't overflow */
	req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
I try to simply modify bi_size from unsigned int to unsigned long, and
change the above limit to ULONG_MAX>>9, kernel panics. It seems change
bi_sector from unsigned int to unsigned long is not simple.

If we don't change bi_iter.bi_size to unsigned long, this is the best
effort now.

>> Here I attach the RFC v2 patch, if anybody wants to try it, please do it
>> and response the result :-)
>>
>> I will take time to write a very detailed commit log and code comments
>> to make this patch more easier to be understood. Ugly code, that's what
>> I have to pay to gain better performance ....
> 
> Can't say I like it :). Hard to read and the memory allocation is ugly. Please
> check if there is simpler solution first before writting detailed commit log.

I don't like it neither ....
I can imagine how hard to read it, because even explain it in text is
difficult... A lot of details to take care, and the calculation should
be exactly match in the end.

I tried to avoid to allocate memory for recs[], but without a data
structure to record the per-devcie bio attribution, I cannot find a
better way to write the code. There is only one thing makes me to be 1%+
comfortable is, which is bio_split() also allocates memory ^_^

Coly

^ permalink raw reply

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: NeilBrown @ 2016-12-09  6:01 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: Coly Li, linux-raid, Shaohua Li, Nate Dailey
In-Reply-To: <CAMGffEntREcP-A=td4m3LRp1oHtxR=MMH_=a4N49UPvzFj-02A@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 884 bytes --]

On Thu, Dec 08 2016, Jinpu Wang wrote:

This number:

>   nr_pending = {
>     counter = 1
>   },

and this number:

>   nr_pending = {
>     counter = 856
>   },

might be interesting.

There are 855 requested on the list.  Add the one that is currently
being retried give 856, which is nr_pending for the device that failed.
But nr_pending on the device that didn't fail is 1.  I would expect
zero.
When a read or write requests succeeds, rdev_dec_pending() is called
immediately so this should quickly go to zero.

It seems as though there must be a request to the loop device that is
stuck somewhere between the atomic_inc(&rdev->nr_pending) (possibly
inside read_balance) and the call to generic_make_request().
I cannot yet see how that would happen.

Can you check if the is a repeatable observation?  Is nr_pending.counter
always '1' on the loop device?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH 3/3] md: separate flags for superblock changes
From: Shaohua Li @ 2016-12-09  5:16 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid
In-Reply-To: <87lgvp1zld.fsf@notabene.neil.brown.name>

On Fri, Dec 09, 2016 at 03:43:58PM +1100, Neil Brown wrote:
> On Fri, Dec 09 2016, Shaohua Li wrote:
> 
> > The mddev->flags are used for different purposes. There are a lot of
> > places we check/change the flags without masking unrelated flags, we
> > could check/change unrelated flags. These usage are most for superblock
> > write, so spearate superblock related flags. This should make the code
> > clearer and also fix real bugs.
> >
> > Signed-off-by: Shaohua Li <shli@fb.com>
> 
> That real bug would be:
> 
> >  		md_wakeup_thread(mddev->thread);
> > -		wait_event(mddev->sb_wait, mddev->flags == 0 ||
> > +		wait_event(mddev->sb_wait, mddev->sb_flags == 0 ||
> >  			   test_bit(MD_RECOVERY_INTR, &mddev->recovery));

Yes, for the reshape hang.

> 
> mddev->flags used to be called mddev->sb_dirty, before
> Commit: 850b2b420cd5 ("[PATCH] md: replace magic numbers in sb_dirty with well defined bit flags")
> 
> Then we added lots of other flags.  Now we are going back to the same
> idea :-)
> 
> Reviewed-by: NeilBrown <neilb@suse.com>
> 
> Thanks,
> NeilBrown



^ permalink raw reply

* Re: [PATCH 1/3] md: takeover should clear unrelated bits
From: Shaohua Li @ 2016-12-09  5:15 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid
In-Reply-To: <87oa0l1zoo.fsf@notabene.neil.brown.name>

On Fri, Dec 09, 2016 at 03:41:59PM +1100, Neil Brown wrote:
> On Fri, Dec 09 2016, Shaohua Li wrote:
> 
> > When we change level from raid1 to raid5, the MD_FAILFAST_SUPPORTED bit
> > will be accidentally set, but raid5 doesn't support it. The same is true
> > for the MD_HAS_JOURNAL bit.
> >
> > Fix: 46533ff (md: Use REQ_FAILFAST_* on metadata writes where appropriate)
> > Cc: NeilBrown <neilb@suse.com>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  drivers/md/raid0.c | 5 +++++
> >  drivers/md/raid1.c | 5 ++++-
> >  drivers/md/raid5.c | 6 +++++-
> >  3 files changed, 14 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> > index e628f18..a162fed 100644
> > --- a/drivers/md/raid0.c
> > +++ b/drivers/md/raid0.c
> > @@ -539,8 +539,11 @@ static void *raid0_takeover_raid45(struct mddev *mddev)
> >  	mddev->delta_disks = -1;
> >  	/* make sure it will be not marked as dirty */
> >  	mddev->recovery_cp = MaxSector;
> > +	clear_bit(MD_HAS_JOURNAL, &mddev->flags);
> > +	clear_bit(MD_JOURNAL_CLEAN, &mddev->flags);
> >  
> >  	create_strip_zones(mddev, &priv_conf);
> > +
> >  	return priv_conf;
> >  }
> 
> 
> This looks like a good fix, but it is a little fragile.
> If a new bit is added, we would need to go through all the takeover
> functions and check if it needs to be cleared, and we would forget.
> It might be better if each personality defined a VALID_MD_FLAGS or
> whatever, and the takeover function always did
>  set_mask_bits(&mddev->flags, 0, VALID_MD_FLAGS);

This is fragile indeed. My original idea is to clear the flags in level_store,
but it doesn't work well as different personablities have different valid
falgs. The suggestion looks good. Thanks for the review.

Thanks,
Shaohua

> 
> But that can come later.
> 
> Reviewed-by: NeilBrown <neilb@suse.com>
> 
> Thanks,
> NeilBrown
> 
> 
> >  
> > @@ -580,6 +583,7 @@ static void *raid0_takeover_raid10(struct mddev *mddev)
> >  	mddev->degraded = 0;
> >  	/* make sure it will be not marked as dirty */
> >  	mddev->recovery_cp = MaxSector;
> > +	clear_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
> >  
> >  	create_strip_zones(mddev, &priv_conf);
> >  	return priv_conf;
> > @@ -622,6 +626,7 @@ static void *raid0_takeover_raid1(struct mddev *mddev)
> >  	mddev->raid_disks = 1;
> >  	/* make sure it will be not marked as dirty */
> >  	mddev->recovery_cp = MaxSector;
> > +	clear_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
> >  
> >  	create_strip_zones(mddev, &priv_conf);
> >  	return priv_conf;
> > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> > index 94e0afc..efc2e74 100644
> > --- a/drivers/md/raid1.c
> > +++ b/drivers/md/raid1.c
> > @@ -3243,9 +3243,12 @@ static void *raid1_takeover(struct mddev *mddev)
> >  		mddev->new_layout = 0;
> >  		mddev->new_chunk_sectors = 0;
> >  		conf = setup_conf(mddev);
> > -		if (!IS_ERR(conf))
> > +		if (!IS_ERR(conf)) {
> >  			/* Array must appear to be quiesced */
> >  			conf->array_frozen = 1;
> > +			clear_bit(MD_HAS_JOURNAL, &mddev->flags);
> > +			clear_bit(MD_JOURNAL_CLEAN, &mddev->flags);
> > +		}
> >  		return conf;
> >  	}
> >  	return ERR_PTR(-EINVAL);
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index 6bf3c26..3e6a2a0 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -7811,6 +7811,7 @@ static void *raid45_takeover_raid0(struct mddev *mddev, int level)
> >  static void *raid5_takeover_raid1(struct mddev *mddev)
> >  {
> >  	int chunksect;
> > +	void *ret;
> >  
> >  	if (mddev->raid_disks != 2 ||
> >  	    mddev->degraded > 1)
> > @@ -7832,7 +7833,10 @@ static void *raid5_takeover_raid1(struct mddev *mddev)
> >  	mddev->new_layout = ALGORITHM_LEFT_SYMMETRIC;
> >  	mddev->new_chunk_sectors = chunksect;
> >  
> > -	return setup_conf(mddev);
> > +	ret = setup_conf(mddev);
> > +	if (!IS_ERR_VALUE(ret))
> > +		clear_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
> > +	return ret;
> >  }
> >  
> >  static void *raid5_takeover_raid6(struct mddev *mddev)
> > -- 
> > 2.9.3
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply

* Re: [PATCH 3/3] md: separate flags for superblock changes
From: NeilBrown @ 2016-12-09  4:43 UTC (permalink / raw)
  To: Shaohua Li, linux-raid
In-Reply-To: <a836dcdaef1cecec422f492cca1c4e525034d0e6.1481240632.git.shli@fb.com>

[-- Attachment #1: Type: text/plain, Size: 935 bytes --]

On Fri, Dec 09 2016, Shaohua Li wrote:

> The mddev->flags are used for different purposes. There are a lot of
> places we check/change the flags without masking unrelated flags, we
> could check/change unrelated flags. These usage are most for superblock
> write, so spearate superblock related flags. This should make the code
> clearer and also fix real bugs.
>
> Signed-off-by: Shaohua Li <shli@fb.com>

That real bug would be:

>  		md_wakeup_thread(mddev->thread);
> -		wait_event(mddev->sb_wait, mddev->flags == 0 ||
> +		wait_event(mddev->sb_wait, mddev->sb_flags == 0 ||
>  			   test_bit(MD_RECOVERY_INTR, &mddev->recovery));

??

mddev->flags used to be called mddev->sb_dirty, before
Commit: 850b2b420cd5 ("[PATCH] md: replace magic numbers in sb_dirty with well defined bit flags")

Then we added lots of other flags.  Now we are going back to the same
idea :-)

Reviewed-by: NeilBrown <neilb@suse.com>

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH 1/3] md: takeover should clear unrelated bits
From: NeilBrown @ 2016-12-09  4:41 UTC (permalink / raw)
  To: Shaohua Li, linux-raid
In-Reply-To: <8ba65915901de01043eb08cfadd07d65767776fb.1481240632.git.shli@fb.com>

[-- Attachment #1: Type: text/plain, Size: 3839 bytes --]

On Fri, Dec 09 2016, Shaohua Li wrote:

> When we change level from raid1 to raid5, the MD_FAILFAST_SUPPORTED bit
> will be accidentally set, but raid5 doesn't support it. The same is true
> for the MD_HAS_JOURNAL bit.
>
> Fix: 46533ff (md: Use REQ_FAILFAST_* on metadata writes where appropriate)
> Cc: NeilBrown <neilb@suse.com>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/raid0.c | 5 +++++
>  drivers/md/raid1.c | 5 ++++-
>  drivers/md/raid5.c | 6 +++++-
>  3 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index e628f18..a162fed 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -539,8 +539,11 @@ static void *raid0_takeover_raid45(struct mddev *mddev)
>  	mddev->delta_disks = -1;
>  	/* make sure it will be not marked as dirty */
>  	mddev->recovery_cp = MaxSector;
> +	clear_bit(MD_HAS_JOURNAL, &mddev->flags);
> +	clear_bit(MD_JOURNAL_CLEAN, &mddev->flags);
>  
>  	create_strip_zones(mddev, &priv_conf);
> +
>  	return priv_conf;
>  }


This looks like a good fix, but it is a little fragile.
If a new bit is added, we would need to go through all the takeover
functions and check if it needs to be cleared, and we would forget.
It might be better if each personality defined a VALID_MD_FLAGS or
whatever, and the takeover function always did
 set_mask_bits(&mddev->flags, 0, VALID_MD_FLAGS);

??

But that can come later.

Reviewed-by: NeilBrown <neilb@suse.com>

Thanks,
NeilBrown


>  
> @@ -580,6 +583,7 @@ static void *raid0_takeover_raid10(struct mddev *mddev)
>  	mddev->degraded = 0;
>  	/* make sure it will be not marked as dirty */
>  	mddev->recovery_cp = MaxSector;
> +	clear_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
>  
>  	create_strip_zones(mddev, &priv_conf);
>  	return priv_conf;
> @@ -622,6 +626,7 @@ static void *raid0_takeover_raid1(struct mddev *mddev)
>  	mddev->raid_disks = 1;
>  	/* make sure it will be not marked as dirty */
>  	mddev->recovery_cp = MaxSector;
> +	clear_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
>  
>  	create_strip_zones(mddev, &priv_conf);
>  	return priv_conf;
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 94e0afc..efc2e74 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -3243,9 +3243,12 @@ static void *raid1_takeover(struct mddev *mddev)
>  		mddev->new_layout = 0;
>  		mddev->new_chunk_sectors = 0;
>  		conf = setup_conf(mddev);
> -		if (!IS_ERR(conf))
> +		if (!IS_ERR(conf)) {
>  			/* Array must appear to be quiesced */
>  			conf->array_frozen = 1;
> +			clear_bit(MD_HAS_JOURNAL, &mddev->flags);
> +			clear_bit(MD_JOURNAL_CLEAN, &mddev->flags);
> +		}
>  		return conf;
>  	}
>  	return ERR_PTR(-EINVAL);
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 6bf3c26..3e6a2a0 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -7811,6 +7811,7 @@ static void *raid45_takeover_raid0(struct mddev *mddev, int level)
>  static void *raid5_takeover_raid1(struct mddev *mddev)
>  {
>  	int chunksect;
> +	void *ret;
>  
>  	if (mddev->raid_disks != 2 ||
>  	    mddev->degraded > 1)
> @@ -7832,7 +7833,10 @@ static void *raid5_takeover_raid1(struct mddev *mddev)
>  	mddev->new_layout = ALGORITHM_LEFT_SYMMETRIC;
>  	mddev->new_chunk_sectors = chunksect;
>  
> -	return setup_conf(mddev);
> +	ret = setup_conf(mddev);
> +	if (!IS_ERR_VALUE(ret))
> +		clear_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
> +	return ret;
>  }
>  
>  static void *raid5_takeover_raid6(struct mddev *mddev)
> -- 
> 2.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH 2/3] md: MD_RECOVERY_NEEDED is set for mddev->recovery
From: NeilBrown @ 2016-12-09  4:31 UTC (permalink / raw)
  To: Shaohua Li, linux-raid
In-Reply-To: <8c8e0b996b721fb9747fb7b7d312d59dbba3b4c4.1481240632.git.shli@fb.com>

[-- Attachment #1: Type: text/plain, Size: 989 bytes --]

On Fri, Dec 09 2016, Shaohua Li wrote:

> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/md.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 84dc891..5e66648 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -6856,7 +6856,7 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
>  		/* need to ensure recovery thread has run */
>  		wait_event_interruptible_timeout(mddev->sb_wait,
>  						 !test_bit(MD_RECOVERY_NEEDED,
> -							   &mddev->flags),
> +							   &mddev->recovery),
>  						 msecs_to_jiffies(5000));
>  	if (cmd == STOP_ARRAY || cmd == STOP_ARRAY_RO) {
>  		/* Need to flush page cache, and ensure no-one else opens
> -- 
> 2.9.3

That was careless!

It would be good to add

Fixes: 90f5f7ad4f38 ("md: Wait for md_check_recovery before attempting device removal.")

to this.

Reviewed-by: NeilBrown <neilb@suse.com>

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* [PATCH 3/3] md: separate flags for superblock changes
From: Shaohua Li @ 2016-12-08 23:48 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb
In-Reply-To: <cover.1481240632.git.shli@fb.com>

The mddev->flags are used for different purposes. There are a lot of
places we check/change the flags without masking unrelated flags, we
could check/change unrelated flags. These usage are most for superblock
write, so spearate superblock related flags. This should make the code
clearer and also fix real bugs.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/bitmap.c      |   4 +-
 drivers/md/dm-raid.c     |   4 +-
 drivers/md/md.c          | 115 ++++++++++++++++++++++++-----------------------
 drivers/md/md.h          |  16 ++++---
 drivers/md/multipath.c   |   2 +-
 drivers/md/raid1.c       |  12 ++---
 drivers/md/raid10.c      |  22 ++++-----
 drivers/md/raid5-cache.c |   6 +--
 drivers/md/raid5.c       |  26 +++++------
 9 files changed, 106 insertions(+), 101 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index c462157..9fb2cca 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1623,7 +1623,7 @@ void bitmap_cond_end_sync(struct bitmap *bitmap, sector_t sector, bool force)
 		   atomic_read(&bitmap->mddev->recovery_active) == 0);
 
 	bitmap->mddev->curr_resync_completed = sector;
-	set_bit(MD_CHANGE_CLEAN, &bitmap->mddev->flags);
+	set_bit(MD_SB_CHANGE_CLEAN, &bitmap->mddev->sb_flags);
 	sector &= ~((1ULL << bitmap->counts.chunkshift) - 1);
 	s = 0;
 	while (s < sector && s < bitmap->mddev->resync_max_sectors) {
@@ -2296,7 +2296,7 @@ location_store(struct mddev *mddev, const char *buf, size_t len)
 		/* Ensure new bitmap info is stored in
 		 * metadata promptly.
 		 */
-		set_bit(MD_CHANGE_DEVS, &mddev->flags);
+		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 		md_wakeup_thread(mddev->thread);
 	}
 	rv = 0;
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 6d53810..953159d 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -2011,7 +2011,7 @@ static int super_load(struct md_rdev *rdev, struct md_rdev *refdev)
 		sb->compat_features = cpu_to_le32(FEATURE_FLAG_SUPPORTS_V190);
 
 		/* Force writing of superblocks to disk */
-		set_bit(MD_CHANGE_DEVS, &rdev->mddev->flags);
+		set_bit(MD_SB_CHANGE_DEVS, &rdev->mddev->sb_flags);
 
 		/* Any superblock is better than none, choose that if given */
 		return refdev ? 0 : 1;
@@ -3497,7 +3497,7 @@ static void rs_update_sbs(struct raid_set *rs)
 	struct mddev *mddev = &rs->md;
 	int ro = mddev->ro;
 
-	set_bit(MD_CHANGE_DEVS, &mddev->flags);
+	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 	mddev->ro = 0;
 	md_update_sb(mddev, 1);
 	mddev->ro = ro;
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5e66648..c15e234 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -729,7 +729,7 @@ static void super_written(struct bio *bio)
 		md_error(mddev, rdev);
 		if (!test_bit(Faulty, &rdev->flags)
 		    && (bio->bi_opf & MD_FAILFAST)) {
-			set_bit(MD_NEED_REWRITE, &mddev->flags);
+			set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
 			set_bit(LastDev, &rdev->flags);
 		}
 	} else
@@ -780,7 +780,7 @@ int md_super_wait(struct mddev *mddev)
 {
 	/* wait for all superblock writes that were scheduled to complete */
 	wait_event(mddev->sb_wait, atomic_read(&mddev->pending_writes)==0);
-	if (test_and_clear_bit(MD_NEED_REWRITE, &mddev->flags))
+	if (test_and_clear_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags))
 		return -EAGAIN;
 	return 0;
 }
@@ -2322,24 +2322,24 @@ void md_update_sb(struct mddev *mddev, int force_change)
 
 	if (mddev->ro) {
 		if (force_change)
-			set_bit(MD_CHANGE_DEVS, &mddev->flags);
+			set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 		return;
 	}
 
 repeat:
 	if (mddev_is_clustered(mddev)) {
-		if (test_and_clear_bit(MD_CHANGE_DEVS, &mddev->flags))
+		if (test_and_clear_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags))
 			force_change = 1;
-		if (test_and_clear_bit(MD_CHANGE_CLEAN, &mddev->flags))
+		if (test_and_clear_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags))
 			nospares = 1;
 		ret = md_cluster_ops->metadata_update_start(mddev);
 		/* Has someone else has updated the sb */
 		if (!does_sb_need_changing(mddev)) {
 			if (ret == 0)
 				md_cluster_ops->metadata_update_cancel(mddev);
-			bit_clear_unless(&mddev->flags, BIT(MD_CHANGE_PENDING),
-							 BIT(MD_CHANGE_DEVS) |
-							 BIT(MD_CHANGE_CLEAN));
+			bit_clear_unless(&mddev->sb_flags, BIT(MD_SB_CHANGE_PENDING),
+							 BIT(MD_SB_CHANGE_DEVS) |
+							 BIT(MD_SB_CHANGE_CLEAN));
 			return;
 		}
 	}
@@ -2355,10 +2355,10 @@ void md_update_sb(struct mddev *mddev, int force_change)
 
 	}
 	if (!mddev->persistent) {
-		clear_bit(MD_CHANGE_CLEAN, &mddev->flags);
-		clear_bit(MD_CHANGE_DEVS, &mddev->flags);
+		clear_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
+		clear_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 		if (!mddev->external) {
-			clear_bit(MD_CHANGE_PENDING, &mddev->flags);
+			clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
 			rdev_for_each(rdev, mddev) {
 				if (rdev->badblocks.changed) {
 					rdev->badblocks.changed = 0;
@@ -2378,9 +2378,9 @@ void md_update_sb(struct mddev *mddev, int force_change)
 
 	mddev->utime = ktime_get_real_seconds();
 
-	if (test_and_clear_bit(MD_CHANGE_DEVS, &mddev->flags))
+	if (test_and_clear_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags))
 		force_change = 1;
-	if (test_and_clear_bit(MD_CHANGE_CLEAN, &mddev->flags))
+	if (test_and_clear_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags))
 		/* just a clean<-> dirty transition, possibly leave spares alone,
 		 * though if events isn't the right even/odd, we will have to do
 		 * spares after all
@@ -2472,14 +2472,14 @@ void md_update_sb(struct mddev *mddev, int force_change)
 	}
 	if (md_super_wait(mddev) < 0)
 		goto rewrite;
-	/* if there was a failure, MD_CHANGE_DEVS was set, and we re-write super */
+	/* if there was a failure, MD_SB_CHANGE_DEVS was set, and we re-write super */
 
 	if (mddev_is_clustered(mddev) && ret == 0)
 		md_cluster_ops->metadata_update_finish(mddev);
 
 	if (mddev->in_sync != sync_req ||
-	    !bit_clear_unless(&mddev->flags, BIT(MD_CHANGE_PENDING),
-			       BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_CLEAN)))
+	    !bit_clear_unless(&mddev->sb_flags, BIT(MD_SB_CHANGE_PENDING),
+			       BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_CLEAN)))
 		/* have to write it out again */
 		goto repeat;
 	wake_up(&mddev->sb_wait);
@@ -2523,7 +2523,7 @@ static int add_bound_rdev(struct md_rdev *rdev)
 	}
 	sysfs_notify_dirent_safe(rdev->sysfs_state);
 
-	set_bit(MD_CHANGE_DEVS, &mddev->flags);
+	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 	if (mddev->degraded)
 		set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
 	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
@@ -2640,7 +2640,7 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 			if (err == 0) {
 				md_kick_rdev_from_array(rdev);
 				if (mddev->pers) {
-					set_bit(MD_CHANGE_DEVS, &mddev->flags);
+					set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 					md_wakeup_thread(mddev->thread);
 				}
 				md_new_event(mddev);
@@ -3651,7 +3651,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 	}
 	blk_set_stacking_limits(&mddev->queue->limits);
 	pers->run(mddev);
-	set_bit(MD_CHANGE_DEVS, &mddev->flags);
+	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 	mddev_resume(mddev);
 	if (!mddev->thread)
 		md_update_sb(mddev, 1);
@@ -3846,7 +3846,7 @@ resync_start_store(struct mddev *mddev, const char *buf, size_t len)
 	if (!err) {
 		mddev->recovery_cp = n;
 		if (mddev->pers)
-			set_bit(MD_CHANGE_CLEAN, &mddev->flags);
+			set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
 	}
 	mddev_unlock(mddev);
 	return err ?: len;
@@ -3920,7 +3920,7 @@ array_state_show(struct mddev *mddev, char *page)
 			st = read_auto;
 			break;
 		case 0:
-			if (test_bit(MD_CHANGE_PENDING, &mddev->flags))
+			if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
 				st = write_pending;
 			else if (mddev->in_sync)
 				st = clean;
@@ -3958,7 +3958,7 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
 		spin_lock(&mddev->lock);
 		if (st == active) {
 			restart_array(mddev);
-			clear_bit(MD_CHANGE_PENDING, &mddev->flags);
+			clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
 			md_wakeup_thread(mddev->thread);
 			wake_up(&mddev->sb_wait);
 			err = 0;
@@ -3969,7 +3969,7 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
 					mddev->in_sync = 1;
 					if (mddev->safemode == 1)
 						mddev->safemode = 0;
-					set_bit(MD_CHANGE_CLEAN, &mddev->flags);
+					set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
 				}
 				err = 0;
 			} else
@@ -4035,7 +4035,7 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
 					mddev->in_sync = 1;
 					if (mddev->safemode == 1)
 						mddev->safemode = 0;
-					set_bit(MD_CHANGE_CLEAN, &mddev->flags);
+					set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
 				}
 				err = 0;
 			} else
@@ -4049,7 +4049,7 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
 			err = restart_array(mddev);
 			if (err)
 				break;
-			clear_bit(MD_CHANGE_PENDING, &mddev->flags);
+			clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
 			wake_up(&mddev->sb_wait);
 			err = 0;
 		} else {
@@ -5378,7 +5378,7 @@ int md_run(struct mddev *mddev)
 		set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
 	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 
-	if (mddev->flags & MD_UPDATE_SB_FLAGS)
+	if (mddev->sb_flags)
 		md_update_sb(mddev, 0);
 
 	md_new_event(mddev);
@@ -5473,6 +5473,7 @@ static void md_clean(struct mddev *mddev)
 	mddev->level = LEVEL_NONE;
 	mddev->clevel[0] = 0;
 	mddev->flags = 0;
+	mddev->sb_flags = 0;
 	mddev->ro = 0;
 	mddev->metadata_type[0] = 0;
 	mddev->chunk_sectors = 0;
@@ -5525,7 +5526,7 @@ static void __md_stop_writes(struct mddev *mddev)
 
 	if (mddev->ro == 0 &&
 	    ((!mddev->in_sync && !mddev_is_clustered(mddev)) ||
-	     (mddev->flags & MD_UPDATE_SB_FLAGS))) {
+	     mddev->sb_flags)) {
 		/* mark array as shutdown cleanly */
 		if (!mddev_is_clustered(mddev))
 			mddev->in_sync = 1;
@@ -5608,13 +5609,13 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
 		 * which will now never happen */
 		wake_up_process(mddev->sync_thread->tsk);
 
-	if (mddev->external && test_bit(MD_CHANGE_PENDING, &mddev->flags))
+	if (mddev->external && test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
 		return -EBUSY;
 	mddev_unlock(mddev);
 	wait_event(resync_wait, !test_bit(MD_RECOVERY_RUNNING,
 					  &mddev->recovery));
 	wait_event(mddev->sb_wait,
-		   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
+		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
 	mddev_lock_nointr(mddev);
 
 	mutex_lock(&mddev->open_mutex);
@@ -6234,7 +6235,7 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev)
 		md_cluster_ops->remove_disk(mddev, rdev);
 
 	md_kick_rdev_from_array(rdev);
-	set_bit(MD_CHANGE_DEVS, &mddev->flags);
+	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 	if (mddev->thread)
 		md_wakeup_thread(mddev->thread);
 	else
@@ -6303,7 +6304,7 @@ static int hot_add_disk(struct mddev *mddev, dev_t dev)
 
 	rdev->raid_disk = -1;
 
-	set_bit(MD_CHANGE_DEVS, &mddev->flags);
+	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 	if (!mddev->thread)
 		md_update_sb(mddev, 1);
 	/*
@@ -6460,9 +6461,11 @@ static int set_array_info(struct mddev *mddev, mdu_array_info_t *info)
 
 	mddev->max_disks     = MD_SB_DISKS;
 
-	if (mddev->persistent)
+	if (mddev->persistent) {
 		mddev->flags         = 0;
-	set_bit(MD_CHANGE_DEVS, &mddev->flags);
+		mddev->sb_flags         = 0;
+	}
+	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 
 	mddev->bitmap_info.default_offset = MD_SB_BYTES >> 9;
 	mddev->bitmap_info.default_space = 64*2 - (MD_SB_BYTES >> 9);
@@ -7007,11 +7010,11 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
 			/* If a device failed while we were read-only, we
 			 * need to make sure the metadata is updated now.
 			 */
-			if (test_bit(MD_CHANGE_DEVS, &mddev->flags)) {
+			if (test_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags)) {
 				mddev_unlock(mddev);
 				wait_event(mddev->sb_wait,
-					   !test_bit(MD_CHANGE_DEVS, &mddev->flags) &&
-					   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
+					   !test_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags) &&
+					   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
 				mddev_lock_nointr(mddev);
 			}
 		} else {
@@ -7766,8 +7769,8 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 		spin_lock(&mddev->lock);
 		if (mddev->in_sync) {
 			mddev->in_sync = 0;
-			set_bit(MD_CHANGE_CLEAN, &mddev->flags);
-			set_bit(MD_CHANGE_PENDING, &mddev->flags);
+			set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
+			set_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
 			md_wakeup_thread(mddev->thread);
 			did_change = 1;
 		}
@@ -7776,7 +7779,7 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 	if (did_change)
 		sysfs_notify_dirent_safe(mddev->sysfs_state);
 	wait_event(mddev->sb_wait,
-		   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
+		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
 }
 EXPORT_SYMBOL(md_write_start);
 
@@ -7797,7 +7800,7 @@ EXPORT_SYMBOL(md_write_end);
  * attempting a GFP_KERNEL allocation while holding the mddev lock.
  * Must be called with mddev_lock held.
  *
- * In the ->external case MD_CHANGE_PENDING can not be cleared until mddev->lock
+ * In the ->external case MD_SB_CHANGE_PENDING can not be cleared until mddev->lock
  * is dropped, so return -EAGAIN after notifying userspace.
  */
 int md_allow_write(struct mddev *mddev)
@@ -7812,8 +7815,8 @@ int md_allow_write(struct mddev *mddev)
 	spin_lock(&mddev->lock);
 	if (mddev->in_sync) {
 		mddev->in_sync = 0;
-		set_bit(MD_CHANGE_CLEAN, &mddev->flags);
-		set_bit(MD_CHANGE_PENDING, &mddev->flags);
+		set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
+		set_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
 		if (mddev->safemode_delay &&
 		    mddev->safemode == 0)
 			mddev->safemode = 1;
@@ -7823,7 +7826,7 @@ int md_allow_write(struct mddev *mddev)
 	} else
 		spin_unlock(&mddev->lock);
 
-	if (test_bit(MD_CHANGE_PENDING, &mddev->flags))
+	if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
 		return -EAGAIN;
 	else
 		return 0;
@@ -8058,7 +8061,7 @@ void md_do_sync(struct md_thread *thread)
 			    j > mddev->recovery_cp)
 				mddev->recovery_cp = j;
 			update_time = jiffies;
-			set_bit(MD_CHANGE_CLEAN, &mddev->flags);
+			set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
 			sysfs_notify(&mddev->kobj, NULL, "sync_completed");
 		}
 
@@ -8206,8 +8209,8 @@ void md_do_sync(struct md_thread *thread)
 	/* set CHANGE_PENDING here since maybe another update is needed,
 	 * so other nodes are informed. It should be harmless for normal
 	 * raid */
-	set_mask_bits(&mddev->flags, 0,
-		      BIT(MD_CHANGE_PENDING) | BIT(MD_CHANGE_DEVS));
+	set_mask_bits(&mddev->sb_flags, 0,
+		      BIT(MD_SB_CHANGE_PENDING) | BIT(MD_SB_CHANGE_DEVS));
 
 	spin_lock(&mddev->lock);
 	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
@@ -8307,12 +8310,12 @@ static int remove_and_add_spares(struct mddev *mddev,
 			if (!test_bit(Journal, &rdev->flags))
 				spares++;
 			md_new_event(mddev);
-			set_bit(MD_CHANGE_DEVS, &mddev->flags);
+			set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 		}
 	}
 no_add:
 	if (removed)
-		set_bit(MD_CHANGE_DEVS, &mddev->flags);
+		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 	return spares;
 }
 
@@ -8385,7 +8388,7 @@ void md_check_recovery(struct mddev *mddev)
 	if (mddev->ro && !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))
 		return;
 	if ( ! (
-		(mddev->flags & MD_UPDATE_SB_FLAGS & ~ (1<<MD_CHANGE_PENDING)) ||
+		(mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) ||
 		test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
 		test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
 		test_bit(MD_RELOAD_SB, &mddev->flags) ||
@@ -8423,7 +8426,7 @@ void md_check_recovery(struct mddev *mddev)
 			md_reap_sync_thread(mddev);
 			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
 			clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
-			clear_bit(MD_CHANGE_PENDING, &mddev->flags);
+			clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
 			goto unlock;
 		}
 
@@ -8451,7 +8454,7 @@ void md_check_recovery(struct mddev *mddev)
 			    mddev->recovery_cp == MaxSector) {
 				mddev->in_sync = 1;
 				did_change = 1;
-				set_bit(MD_CHANGE_CLEAN, &mddev->flags);
+				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
 			}
 			if (mddev->safemode == 1)
 				mddev->safemode = 0;
@@ -8460,7 +8463,7 @@ void md_check_recovery(struct mddev *mddev)
 				sysfs_notify_dirent_safe(mddev->sysfs_state);
 		}
 
-		if (mddev->flags & MD_UPDATE_SB_FLAGS)
+		if (mddev->sb_flags)
 			md_update_sb(mddev, 0);
 
 		if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
@@ -8556,7 +8559,7 @@ void md_reap_sync_thread(struct mddev *mddev)
 		if (mddev->pers->spare_active(mddev)) {
 			sysfs_notify(&mddev->kobj, NULL,
 				     "degraded");
-			set_bit(MD_CHANGE_DEVS, &mddev->flags);
+			set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 		}
 	}
 	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
@@ -8571,7 +8574,7 @@ void md_reap_sync_thread(struct mddev *mddev)
 			rdev->saved_raid_disk = -1;
 
 	md_update_sb(mddev, 1);
-	/* MD_CHANGE_PENDING should be cleared by md_update_sb, so we can
+	/* MD_SB_CHANGE_PENDING should be cleared by md_update_sb, so we can
 	 * call resync_finish here if MD_CLUSTER_RESYNC_LOCKED is set by
 	 * clustered raid */
 	if (test_and_clear_bit(MD_CLUSTER_RESYNC_LOCKED, &mddev->flags))
@@ -8637,8 +8640,8 @@ int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 			sysfs_notify(&rdev->kobj, NULL,
 				     "unacknowledged_bad_blocks");
 		sysfs_notify_dirent_safe(rdev->sysfs_state);
-		set_mask_bits(&mddev->flags, 0,
-			      BIT(MD_CHANGE_CLEAN) | BIT(MD_CHANGE_PENDING));
+		set_mask_bits(&mddev->sb_flags, 0,
+			      BIT(MD_SB_CHANGE_CLEAN) | BIT(MD_SB_CHANGE_PENDING));
 		md_wakeup_thread(rdev->mddev->thread);
 		return 1;
 	} else
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 5c08f84..e38936d 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -213,9 +213,6 @@ extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 struct md_cluster_info;
 
 enum mddev_flags {
-	MD_CHANGE_DEVS,		/* Some device status has changed */
-	MD_CHANGE_CLEAN,	/* transition to or from 'clean' */
-	MD_CHANGE_PENDING,	/* switch from 'clean' to 'active' in progress */
 	MD_ARRAY_FIRST_USE,	/* First use of array, needs initialization */
 	MD_CLOSING,		/* If set, we are closing the array, do not open
 				 * it then */
@@ -231,11 +228,15 @@ enum mddev_flags {
 				 * supported as calls to md_error() will
 				 * never cause the array to become failed.
 				 */
-	MD_NEED_REWRITE,	/* metadata write needs to be repeated */
 };
-#define MD_UPDATE_SB_FLAGS (BIT(MD_CHANGE_DEVS) | \
-			    BIT(MD_CHANGE_CLEAN) | \
-			    BIT(MD_CHANGE_PENDING))	/* If these are set, md_update_sb needed */
+
+enum mddev_sb_flags {
+	MD_SB_CHANGE_DEVS,		/* Some device status has changed */
+	MD_SB_CHANGE_CLEAN,	/* transition to or from 'clean' */
+	MD_SB_CHANGE_PENDING,	/* switch from 'clean' to 'active' in progress */
+	MD_SB_NEED_REWRITE,	/* metadata write needs to be repeated */
+};
+
 struct mddev {
 	void				*private;
 	struct md_personality		*pers;
@@ -243,6 +244,7 @@ struct mddev {
 	int				md_minor;
 	struct list_head		disks;
 	unsigned long			flags;
+	unsigned long			sb_flags;
 
 	int				suspended;
 	atomic_t			active_io;
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 589b807..9fa2d6c 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -208,7 +208,7 @@ static void multipath_error (struct mddev *mddev, struct md_rdev *rdev)
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 	}
 	set_bit(Faulty, &rdev->flags);
-	set_bit(MD_CHANGE_DEVS, &mddev->flags);
+	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 	pr_err("multipath: IO failure on %s, disabling IO path.\n"
 	       "multipath: Operation continuing on %d IO paths.\n",
 	       bdevname(rdev->bdev, b),
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index efc2e74..a1f3fbe 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1517,8 +1517,8 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 	 * if recovery is running, make sure it aborts.
 	 */
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
-	set_mask_bits(&mddev->flags, 0,
-		      BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_PENDING));
+	set_mask_bits(&mddev->sb_flags, 0,
+		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
 	pr_crit("md/raid1:%s: Disk failure on %s, disabling device.\n"
 		"md/raid1:%s: Operation continuing on %d devices.\n",
 		mdname(mddev), bdevname(rdev->bdev, b),
@@ -2464,10 +2464,10 @@ static void raid1d(struct md_thread *thread)
 	md_check_recovery(mddev);
 
 	if (!list_empty_careful(&conf->bio_end_io_list) &&
-	    !test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
+	    !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
 		LIST_HEAD(tmp);
 		spin_lock_irqsave(&conf->device_lock, flags);
-		if (!test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
+		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
 			while (!list_empty(&conf->bio_end_io_list)) {
 				list_move(conf->bio_end_io_list.prev, &tmp);
 				conf->nr_queued--;
@@ -2521,7 +2521,7 @@ static void raid1d(struct md_thread *thread)
 			generic_make_request(r1_bio->bios[r1_bio->read_disk]);
 
 		cond_resched();
-		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
+		if (mddev->sb_flags & ~(1<<MD_SB_CHANGE_PENDING))
 			md_check_recovery(mddev);
 	}
 	blk_finish_plug(&plug);
@@ -2724,7 +2724,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 							min_bad, 0
 					) && ok;
 			}
-		set_bit(MD_CHANGE_DEVS, &mddev->flags);
+		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 		*skipped = 1;
 		put_buf(r1_bio);
 
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 525ca99..ab5e862 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1138,12 +1138,12 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
 		bio->bi_iter.bi_sector < conf->reshape_progress))) {
 		/* Need to update reshape_position in metadata */
 		mddev->reshape_position = conf->reshape_progress;
-		set_mask_bits(&mddev->flags, 0,
-			      BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_PENDING));
+		set_mask_bits(&mddev->sb_flags, 0,
+			      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
 		md_wakeup_thread(mddev->thread);
 		raid10_log(conf->mddev, "wait reshape metadata");
 		wait_event(mddev->sb_wait,
-			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
+			   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
 
 		conf->reshape_safe = mddev->reshape_position;
 	}
@@ -1652,8 +1652,8 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 	set_bit(Blocked, &rdev->flags);
 	set_bit(Faulty, &rdev->flags);
-	set_mask_bits(&mddev->flags, 0,
-		      BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_PENDING));
+	set_mask_bits(&mddev->sb_flags, 0,
+		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 	pr_crit("md/raid10:%s: Disk failure on %s, disabling device.\n"
 		"md/raid10:%s: Operation continuing on %d devices.\n",
@@ -2761,10 +2761,10 @@ static void raid10d(struct md_thread *thread)
 	md_check_recovery(mddev);
 
 	if (!list_empty_careful(&conf->bio_end_io_list) &&
-	    !test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
+	    !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
 		LIST_HEAD(tmp);
 		spin_lock_irqsave(&conf->device_lock, flags);
-		if (!test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
+		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
 			while (!list_empty(&conf->bio_end_io_list)) {
 				list_move(conf->bio_end_io_list.prev, &tmp);
 				conf->nr_queued--;
@@ -2822,7 +2822,7 @@ static void raid10d(struct md_thread *thread)
 		}
 
 		cond_resched();
-		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
+		if (mddev->sb_flags & ~(1<<MD_SB_CHANGE_PENDING))
 			md_check_recovery(mddev);
 	}
 	blk_finish_plug(&plug);
@@ -4209,7 +4209,7 @@ static int raid10_start_reshape(struct mddev *mddev)
 	spin_unlock_irq(&conf->device_lock);
 	mddev->raid_disks = conf->geo.raid_disks;
 	mddev->reshape_position = conf->reshape_progress;
-	set_bit(MD_CHANGE_DEVS, &mddev->flags);
+	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 
 	clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
 	clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
@@ -4404,9 +4404,9 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
 		else
 			mddev->curr_resync_completed = conf->reshape_progress;
 		conf->reshape_checkpoint = jiffies;
-		set_bit(MD_CHANGE_DEVS, &mddev->flags);
+		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 		md_wakeup_thread(mddev->thread);
-		wait_event(mddev->sb_wait, mddev->flags == 0 ||
+		wait_event(mddev->sb_wait, mddev->sb_flags == 0 ||
 			   test_bit(MD_RECOVERY_INTR, &mddev->recovery));
 		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
 			allow_barrier(conf);
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index de8a4ed..6d1a150 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1206,8 +1206,8 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
 	 * there is a deadlock. We workaround this issue with a trylock.
 	 * FIXME: we could miss discard if we can't take reconfig mutex
 	 */
-	set_mask_bits(&mddev->flags, 0,
-		BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_PENDING));
+	set_mask_bits(&mddev->sb_flags, 0,
+		BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
 	if (!mddev_trylock(mddev))
 		return;
 	md_update_sb(mddev, 1);
@@ -2197,7 +2197,7 @@ static void r5l_write_super(struct r5l_log *log, sector_t cp)
 	struct mddev *mddev = log->rdev->mddev;
 
 	log->rdev->journal_tail = cp;
-	set_bit(MD_CHANGE_DEVS, &mddev->flags);
+	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 }
 
 static ssize_t r5c_journal_mode_show(struct mddev *mddev, char *page)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3e6a2a0..d40e94d 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -961,7 +961,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 			if (bad < 0) {
 				set_bit(BlockedBadBlocks, &rdev->flags);
 				if (!conf->mddev->external &&
-				    conf->mddev->flags) {
+				    conf->mddev->sb_flags) {
 					/* It is very unlikely, but we might
 					 * still need to write out the
 					 * bad block log - better give it
@@ -2547,8 +2547,8 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
 
 	set_bit(Blocked, &rdev->flags);
 	set_bit(Faulty, &rdev->flags);
-	set_mask_bits(&mddev->flags, 0,
-		      BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_PENDING));
+	set_mask_bits(&mddev->sb_flags, 0,
+		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
 	pr_crit("md/raid:%s: Disk failure on %s, disabling device.\n"
 		"md/raid:%s: Operation continuing on %d devices.\n",
 		mdname(mddev),
@@ -4761,7 +4761,7 @@ static void handle_stripe(struct stripe_head *sh)
 	}
 
 	if (!bio_list_empty(&s.return_bi)) {
-		if (test_bit(MD_CHANGE_PENDING, &conf->mddev->flags)) {
+		if (test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
 			spin_lock_irq(&conf->device_lock);
 			bio_list_merge(&conf->return_bi, &s.return_bi);
 			spin_unlock_irq(&conf->device_lock);
@@ -5617,9 +5617,9 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 		mddev->reshape_position = conf->reshape_progress;
 		mddev->curr_resync_completed = sector_nr;
 		conf->reshape_checkpoint = jiffies;
-		set_bit(MD_CHANGE_DEVS, &mddev->flags);
+		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 		md_wakeup_thread(mddev->thread);
-		wait_event(mddev->sb_wait, mddev->flags == 0 ||
+		wait_event(mddev->sb_wait, mddev->sb_flags == 0 ||
 			   test_bit(MD_RECOVERY_INTR, &mddev->recovery));
 		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
 			return 0;
@@ -5715,10 +5715,10 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 		mddev->reshape_position = conf->reshape_progress;
 		mddev->curr_resync_completed = sector_nr;
 		conf->reshape_checkpoint = jiffies;
-		set_bit(MD_CHANGE_DEVS, &mddev->flags);
+		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 		md_wakeup_thread(mddev->thread);
 		wait_event(mddev->sb_wait,
-			   !test_bit(MD_CHANGE_DEVS, &mddev->flags)
+			   !test_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags)
 			   || test_bit(MD_RECOVERY_INTR, &mddev->recovery));
 		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
 			goto ret;
@@ -5993,10 +5993,10 @@ static void raid5d(struct md_thread *thread)
 	md_check_recovery(mddev);
 
 	if (!bio_list_empty(&conf->return_bi) &&
-	    !test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
+	    !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
 		struct bio_list tmp = BIO_EMPTY_LIST;
 		spin_lock_irq(&conf->device_lock);
-		if (!test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
+		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
 			bio_list_merge(&tmp, &conf->return_bi);
 			bio_list_init(&conf->return_bi);
 		}
@@ -6043,7 +6043,7 @@ static void raid5d(struct md_thread *thread)
 			break;
 		handled += batch_size;
 
-		if (mddev->flags & ~(1<<MD_CHANGE_PENDING)) {
+		if (mddev->sb_flags & ~(1 << MD_SB_CHANGE_PENDING)) {
 			spin_unlock_irq(&conf->device_lock);
 			md_check_recovery(mddev);
 			spin_lock_irq(&conf->device_lock);
@@ -7640,7 +7640,7 @@ static int raid5_start_reshape(struct mddev *mddev)
 	}
 	mddev->raid_disks = conf->raid_disks;
 	mddev->reshape_position = conf->reshape_progress;
-	set_bit(MD_CHANGE_DEVS, &mddev->flags);
+	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 
 	clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
 	clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
@@ -7906,7 +7906,7 @@ static int raid5_check_reshape(struct mddev *mddev)
 			conf->chunk_sectors = new_chunk ;
 			mddev->chunk_sectors = new_chunk;
 		}
-		set_bit(MD_CHANGE_DEVS, &mddev->flags);
+		set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 		md_wakeup_thread(mddev->thread);
 	}
 	return check_reshape(mddev);
-- 
2.9.3


^ permalink raw reply related

* [PATCH 2/3] md: MD_RECOVERY_NEEDED is set for mddev->recovery
From: Shaohua Li @ 2016-12-08 23:48 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb
In-Reply-To: <cover.1481240632.git.shli@fb.com>

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 84dc891..5e66648 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6856,7 +6856,7 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
 		/* need to ensure recovery thread has run */
 		wait_event_interruptible_timeout(mddev->sb_wait,
 						 !test_bit(MD_RECOVERY_NEEDED,
-							   &mddev->flags),
+							   &mddev->recovery),
 						 msecs_to_jiffies(5000));
 	if (cmd == STOP_ARRAY || cmd == STOP_ARRAY_RO) {
 		/* Need to flush page cache, and ensure no-one else opens
-- 
2.9.3


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox